使用 Hugging Face 和 Milvus 构建 RAG
作者:张晨
Milvus 是一款流行的开源向量数据库,通过高性能和可扩展的向量相似性搜索为 AI 应用提供支持。在本教程中,我们将向您展示如何使用 Hugging Face 和 Milvus 构建 RAG(检索增强生成)管道。
RAG 系统将检索系统与大型语言模型相结合。该系统首先使用 Milvus 向量数据库从语料库中检索相关文档,然后使用托管在 Hugging Face 中的大型语言模型根据检索到的文档生成答案。
准备工作
依赖项和环境
! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm
如果您使用的是 Google Colab,要启用依赖项,您可能需要**重新启动运行时**(点击屏幕顶部的“运行时”菜单,然后从下拉菜单中选择“重新启动会话”)。
此外,我们建议您配置您的Hugging Face 用户访问令牌,并将其设置为环境变量,因为我们将使用 Hugging Face Hub 中的大型语言模型。如果您未设置令牌环境变量,可能会遇到请求限制。
import os
os.environ["HF_TOKEN"] = "hf_..."
准备数据
我们使用人工智能法案 PDF作为我们 RAG 中的私有知识,该法案是一个关于人工智能的不同风险等级(对应于或多或少的监管)的监管框架。
%%bash
if [ ! -f "The-AI-Act.pdf" ]; then
wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi
我们使用 LangChain 中的 PyPDFLoader
从 PDF 中提取文本,然后将文本分割成更小的块。默认情况下,我们将块大小设置为 1000,重叠设置为 200,这意味着每个块将近有 1000 个字符,两个块之间的重叠部分将有 200 个字符。
>>> from langchain_community.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader("The-AI-Act.pdf")
>>> docs = loader.load()
>>> print(len(docs))
108
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
text_lines = [chunk.page_content for chunk in chunks]
准备嵌入模型
定义一个生成文本嵌入的函数。我们使用 BGE 嵌入模型 作为示例,但您可以使用任何嵌入模型,例如在 MTEB 排行榜 上找到的那些。
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
def emb_text(text):
return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]
生成一个测试嵌入并打印其维度和前几个元素。
>>> test_embedding = emb_text("This is a test")
>>> embedding_dim = len(test_embedding)
>>> print(embedding_dim)
>>> print(test_embedding[:10])
384 [-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]
将数据加载到 Milvus 中
创建集合
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./hf_milvus_demo.db")
collection_name = "rag_collection"
关于
MilvusClient
的参数
- 将
uri
设置为本地文件,例如./hf_milvus_demo.db
,是最方便的方法,因为它会自动利用 Milvus Lite 将所有数据存储在这个文件中。- 如果您有大量数据,例如超过一百万个向量,您可以在 Docker 或 Kubernetes 上设置一个性能更高的 Milvus 服务器。在此设置中,请使用服务器 uri,例如
https://127.0.0.1:19530
,作为您的uri
。- 如果您想使用 Zilliz Cloud(Milvus 的完全托管云服务),请调整
uri
和token
,它们分别对应于 Zilliz Cloud 中的 公共端点和 API 密钥。
检查集合是否已存在,如果存在则删除它。
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
使用指定的参数创建一个新的集合。
如果我们没有指定任何字段信息,Milvus 将自动为主键创建一个默认的 id
字段,并创建一个 vector
字段来存储向量数据。一个保留的 JSON 字段用于存储未定义在模式中的字段及其值。
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
插入数据
遍历文本行,创建嵌入,然后将数据插入到 Milvus 中。
这里有一个新的字段 text
,它是集合模式中未定义的字段。它将自动添加到保留的 JSON 动态字段中,在高级别上可以将其视为普通字段。
from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
data.append({"id": i, "vector": emb_text(line), "text": line})
insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]
构建 RAG
检索查询数据
让我们指定一个关于语料库的问题。
question = "What is the legal basis for the proposal?"
在集合中搜索问题并检索前 3 个语义匹配。
search_res = milvus_client.search(
collection_name=collection_name,
data=[emb_text(question)], # Use the `emb_text` function to convert the question to an embedding vector
limit=3, # Return top 3 results
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
让我们看一下查询的搜索结果
>>> import json
>>> retrieved_lines_with_distances = [(res["entity"]["text"], res["distance"]) for res in search_res[0]]
>>> print(json.dumps(retrieved_lines_with_distances, indent=4))
[ [ "EN 6 EN 2. LEGAL BASIS, SUBSIDIARITY AND PROPORTIONALITY \n2.1. Legal basis \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and f unctioning of the internal market. \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed a nd used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elements", 0.7412998080253601 ], [ "applications and prevent market fragmentation. \nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \nregulatory approach to AI that is limited to the minimum necessary requirements to address \nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \nthe market. The proposal sets a robust and flexible legal framework. On the one hand, it is \ncomprehensive and future -proof in its fundamental regulatory choices, including the \nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \napproach that does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such", 0.696428656578064 ], [ "approach that does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such \nconcern can reasonably be anticipated in the near future. At the same time, t he legal \nframework includes flexible mechanisms that enable it to be dynamically adapted as the \ntechnology evolves and new concerning situations emerge. \nThe proposal sets harmonised rules for the development, placement on the market and use of \nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \ncontravening Union values, while specific restrictions and safeguards are proposed in relation \nto certain uses of remote biometric identification systems for the purpose of law enforcement. \nThe proposal lays down a solid risk methodology to define \u201chigh -risk\u201d AI systems that pose", 0.6891457438468933 ] ]
使用 LLM 获取 RAG 响应
在为 LLM 编写提示之前,让我们首先将检索到的文档列表展平为一个纯字符串。
context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])
为语言模型定义提示。此提示由从 Milvus 检索到的文档组成。
PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
我们使用托管在 Hugging Face 推理服务器上的 Mixtral-8x7B-Instruct-v0.1 根据提示生成响应。
from huggingface_hub import InferenceClient
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm_client = InferenceClient(model=repo_id, timeout=120)
最后,我们可以格式化提示并生成答案。
prompt = PROMPT.format(context=context, question=question)
>>> answer = llm_client.text_generation(
... prompt,
... max_new_tokens=1000,
... ).strip()
>>> print(answer)
The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.
恭喜!您已经使用 Hugging Face 和 Milvus 构建了一个 RAG 管道。
< > GitHub 更新