开源 AI 食谱文档

使用 Hugging Face 和 Milvus 构建 RAG

开源 AI 食谱

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 Hugging Face 和 Milvus 构建 RAG

作者： Chen Zhang

Milvus 是一个流行的开源向量数据库，它为 AI 应用提供高性能和可扩展的向量相似度搜索能力。在本教程中，我们将向您展示如何使用 Hugging Face 和 Milvus 构建 RAG（检索增强生成）管道。

RAG 系统将检索系统与 LLM 结合使用。该系统首先使用 Milvus 向量数据库从语料库中检索相关文档，然后使用托管在 Hugging Face 中的 LLM 基于检索到的文档生成答案。

准备工作

依赖项和环境

! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm

如果您正在使用 Google Colab，要启用依赖项，您可能需要重启运行时（点击屏幕顶部的“运行时”菜单，然后从下拉菜单中选择“重启会话”）。

此外，我们建议您配置您的 Hugging Face 用户访问令牌，并将其设置在您的环境变量中，因为我们将使用来自 Hugging Face Hub 的 LLM。如果您不设置令牌环境变量，您可能会遇到请求限制较低的情况。

import os

os.environ["HF_TOKEN"] = "hf_..."

准备数据

我们使用 AI 法案 PDF，这是一个针对 AI 的监管框架，其中不同的风险级别对应不同程度的监管，作为我们 RAG 中的私有知识。

%%bash

if [ ! -f "The-AI-Act.pdf" ]; then
    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi

我们使用 LangChain 中的 PyPDFLoader 从 PDF 中提取文本，然后将文本分割成更小的块。默认情况下，我们将块大小设置为 1000，重叠大小设置为 200，这意味着每个块将近有 1000 个字符，并且两个块之间的重叠将为 200 个字符。

>>> from langchain_community.document_loaders import PyPDFLoader

>>> loader = PyPDFLoader("The-AI-Act.pdf")
>>> docs = loader.load()
>>> print(len(docs))

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)

text_lines = [chunk.page_content for chunk in chunks]

准备嵌入模型

定义一个生成文本嵌入的函数。我们以 BGE 嵌入模型为例，但您可以使用任何嵌入模型，例如 MTEB 排行榜上找到的模型。

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")


def emb_text(text):
    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]

生成一个测试嵌入并打印其维度和前几个元素。

>>> test_embedding = emb_text("This is a test")
>>> embedding_dim = len(test_embedding)
>>> print(embedding_dim)
>>> print(test_embedding[:10])

384
[-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]

将数据加载到 Milvus 中

创建集合

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./hf_milvus_demo.db")

collection_name = "rag_collection"

关于 MilvusClient 的参数

将 uri 设置为本地文件，例如 ./hf_milvus_demo.db，是最方便的方法，因为它会自动利用 Milvus Lite 将所有数据存储在此文件中。

如果您有大量数据，例如超过一百万个向量，您可以在 Docker 或 Kubernetes 上设置性能更高的 Milvus 服务器。在这种设置中，请使用服务器 uri，例如 https://:19530，作为您的 uri。

如果您想使用 Zilliz Cloud，Milvus 的完全托管云服务，请调整 uri 和 token，它们对应于 Zilliz Cloud 中的公共端点和 API 密钥。

检查集合是否已存在，如果存在则删除它。

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

使用指定的参数创建一个新集合。

如果我们不指定任何字段信息，Milvus 将自动创建一个默认的 id 字段作为主键，以及一个 vector 字段来存储向量数据。保留的 JSON 字段用于存储未定义模式的字段及其值。

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

插入数据

遍历文本行，创建嵌入，然后将数据插入到 Milvus 中。

这里有一个新字段 text，它是集合模式中未定义的字段。它将自动添加到保留的 JSON 动态字段中，该字段在高层可以被视为普通字段。

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]

构建 RAG

检索查询数据

让我们指定一个关于语料库的问题。

question = "What is the legal basis for the proposal?"

在集合中搜索问题并检索前 3 个语义匹配项。

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

让我们看一下查询的搜索结果

>>> import json

>>> retrieved_lines_with_distances = [(res["entity"]["text"], res["distance"]) for res in search_res[0]]
>>> print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "EN 6  EN 2. LEGAL  BASIS,  SUBSIDIARITY  AND  PROPORTIONALITY  \n2.1. Legal  basis  \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and f unctioning of the internal market.  \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed a nd used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elements",
        0.7412998080253601
    ],
    [
        "applications and prevent market fragmentation.  \nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \nregulatory approach to AI that is limited to the minimum necessary requirements to address \nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \nthe market.  The proposal sets a robust and flexible legal framework. On the one hand, it is \ncomprehensive and future -proof in its fundamental regulatory choices, including the \nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \napproach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such",
        0.696428656578064
    ],
    [
        "approach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such \nconcern can reasonably be anticipated in the near future. At the same time, t he legal \nframework includes flexible mechanisms that enable it to be dynamically adapted as the \ntechnology evolves and new concerning situations emerge.  \nThe proposal sets harmonised rules for the development, placement on the market and use of \nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \ncontravening Union values, while specific restrictions and safeguards are proposed in  relation \nto certain uses of remote biometric identification systems for the purpose of law enforcement. \nThe proposal lays down a solid risk methodology to define \u201chigh -risk\u201d AI systems that pose",
        0.6891457438468933
    ]
]

使用 LLM 获取 RAG 响应

在为 LLM 编写提示之前，让我们首先将检索到的文档列表展平成一个纯字符串。

context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

为语言模型定义提示。此提示是使用从 Milvus 检索到的文档组装的。

PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

我们使用托管在 Hugging Face 推理服务器上的 Mixtral-8x7B-Instruct-v0.1 基于提示生成响应。

from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(model=repo_id, timeout=120)

最后，我们可以格式化提示并生成答案。

prompt = PROMPT.format(context=context, question=question)

>>> answer = llm_client.text_generation(
...     prompt,
...     max_new_tokens=1000,
... ).strip()
>>> print(answer)

The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.

恭喜！您已经使用 Hugging Face 和 Milvus 构建了一个 RAG 管道。

< > 在 GitHub 上更新

←使用 PEFT 进行 Prompt 调优 RAG 评估→