开源 AI 食谱文档
使用 Hugging Face 和 Milvus 构建 RAG
并获得增强的文档体验
开始使用
使用 Hugging Face 和 Milvus 构建 RAG
作者: Chen Zhang
Milvus 是一款流行的开源向量数据库,通过高性能和可扩展的向量相似性搜索为 AI 应用提供支持。在本教程中,我们将向您展示如何使用 Hugging Face 和 Milvus 构建一个 RAG(检索增强生成)管道。
RAG 系统将检索系统与大语言模型(LLM)相结合。该系统首先使用 Milvus 向量数据库从语料库中检索相关文档,然后使用托管在 Hugging Face 上的 LLM 根据检索到的文档生成答案。
准备工作
依赖和环境
! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm
如果您正在使用 Google Colab,为了启用依赖项,您可能需要 重启运行时 (点击屏幕顶部的“运行时”菜单,然后从下拉菜单中选择“重启会话”)。
此外,我们建议您配置您的 Hugging Face 用户访问令牌,并将其设置在您的环境变量中,因为我们将使用 Hugging Face Hub 上的一个 LLM。如果您不设置令牌环境变量,您可能会遇到较低的请求限制。
import os
os.environ["HF_TOKEN"] = "hf_..."
准备数据
我们使用 AI 法案 PDF 作为我们 RAG 中的私有知识,这是一个针对 AI 的监管框架,不同的风险等级对应不同程度的监管。
%%bash
if [ ! -f "The-AI-Act.pdf" ]; then
wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi
我们使用 LangChain 的 PyPDFLoader
从 PDF 中提取文本,然后将文本分割成更小的块。默认情况下,我们将块大小设置为 1000,重叠设置为 200,这意味着每个块大约有 1000 个字符,两个块之间的重叠为 200 个字符。
>>> from langchain_community.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader("The-AI-Act.pdf")
>>> docs = loader.load()
>>> print(len(docs))
108
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
text_lines = [chunk.page_content for chunk in chunks]
准备嵌入模型
定义一个函数来生成文本嵌入。我们以 BGE 嵌入模型 为例,但您可以使用任何嵌入模型,例如在 MTEB 排行榜上找到的模型。
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
def emb_text(text):
return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]
生成一个测试嵌入并打印其维度和前几个元素。
>>> test_embedding = emb_text("This is a test")
>>> embedding_dim = len(test_embedding)
>>> print(embedding_dim)
>>> print(test_embedding[:10])
384 [-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]
将数据加载到 Milvus
创建集合
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./hf_milvus_demo.db")
collection_name = "rag_collection"
关于
MilvusClient
的参数:
- 将
uri
设置为本地文件,例如./hf_milvus_demo.db
,是最方便的方法,因为它会自动利用 Milvus Lite 将所有数据存储在该文件中。- 如果您有大量数据,比如超过一百万个向量,您可以在 Docker 或 Kubernetes 上设置一个性能更强的 Milvus 服务器。在这种设置中,请使用服务器的 uri,例如
https://:19530
,作为您的uri
。- 如果您想使用 Zilliz Cloud(Milvus 的全托管云服务),请调整
uri
和token
,它们对应 Zilliz Cloud 中的 公共端点和 Api 密钥。
检查集合是否已存在,如果存在则删除它。
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
使用指定的参数创建一个新的集合。
如果我们不指定任何字段信息,Milvus 会自动创建一个默认的 id
字段作为主键,以及一个 vector
字段来存储向量数据。一个保留的 JSON 字段用于存储未在 schema 中定义的字段及其值。
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
插入数据
遍历文本行,创建嵌入,然后将数据插入到 Milvus 中。
这里有一个新字段 text
,它是在集合 schema 中未定义的字段。它将被自动添加到保留的 JSON 动态字段中,从高层来看可以像普通字段一样对待。
from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
data.append({"id": i, "vector": emb_text(line), "text": line})
insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]
构建 RAG
为查询检索数据
让我们指定一个关于语料库的问题。
question = "What is the legal basis for the proposal?"
在集合中搜索问题,并检索前 3 个语义匹配项。
search_res = milvus_client.search(
collection_name=collection_name,
data=[emb_text(question)], # Use the `emb_text` function to convert the question to an embedding vector
limit=3, # Return top 3 results
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
让我们看一下查询的搜索结果
>>> import json
>>> retrieved_lines_with_distances = [(res["entity"]["text"], res["distance"]) for res in search_res[0]]
>>> print(json.dumps(retrieved_lines_with_distances, indent=4))
[ [ "EN 6 EN 2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY \n2.1. Legal basis \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and f unctioning of the internal market. \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed a nd used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elements", 0.7412998080253601 ], [ "applications and prevent market fragmentation. \nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \nregulatory approach to AI that is limited to the minimum necessary requirements to address \nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \nthe market. The proposal sets a robust and flexible legal framework. On the one hand, it is \ncomprehensive and future -proof in its fundamental regulatory choices, including the \nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \napproach that does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such", 0.696428656578064 ], [ "approach that does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such \nconcern can reasonably be anticipated in the near future. At the same time, t he legal \nframework includes flexible mechanisms that enable it to be dynamically adapted as the \ntechnology evolves and new concerning situations emerge. \nThe proposal sets harmonised rules for the development, placement on the market and use of \nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \ncontravening Union values, while specific restrictions and safeguards are proposed in relation \nto certain uses of remote biometric identification systems for the purpose of law enforcement. \nThe proposal lays down a solid risk methodology to define \u201chigh -risk\u201d AI systems that pose", 0.6891457438468933 ] ]
使用 LLM 获取 RAG 响应
在为 LLM 组合提示词之前,我们先将检索到的文档列表扁平化为一个纯字符串。
context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])
为语言模型定义提示词。这个提示词是根据从 Milvus 检索到的文档组装而成的。
PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
我们使用托管在 Hugging Face 推理服务器上的 Mixtral-8x7B-Instruct-v0.1 来根据提示词生成响应。
from huggingface_hub import InferenceClient
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm_client = InferenceClient(model=repo_id, timeout=120)
最后,我们可以格式化提示词并生成答案。
prompt = PROMPT.format(context=context, question=question)
>>> answer = llm_client.text_generation(
... prompt,
... max_new_tokens=1000,
... ).strip()
>>> print(answer)
The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.
恭喜!您已经成功使用 Hugging Face 和 Milvus 构建了一个 RAG 管道。
< > 在 GitHub 上更新