Agentic RAG：通过查询重构和自查询加速您的 RAG！🚀

本教程是高级教程。您应该首先了解这本其他食谱中的概念！

提醒：检索增强生成 (RAG) 是“使用 LLM 回答用户查询，但答案基于从知识库检索的信息”。与使用原始 LLM 或微调 LLM 相比，它有很多优点：仅举几例，它允许将答案建立在真实事实上并减少捏造，它允许为 LLM 提供特定领域的知识，并且它允许细粒度地控制对知识库信息的访问。

但原始 RAG 存在局限性，最重要的是以下两个

它仅执行一个检索步骤：如果结果不好，那么生成的结果也会不好。
语义相似性是使用用户查询作为参考来计算的，这可能不是最优的：例如，用户查询通常是一个问题，而包含真实答案的文档将以肯定语气出现，因此其相似性得分将低于其他疑问形式的源文档，从而导致错过相关信息的风险。

但是，我们可以通过创建一个 RAG agent 来缓解这些问题：非常简单，一个配备了检索器工具的 agent！

这个 agent 将会：✅ 自己制定查询，以及 ✅ 在需要时进行批判性评估以重新检索。

因此，它应该自然而然地恢复一些高级 RAG 技术！

agent 没有直接使用用户查询作为语义搜索的参考，而是自己制定了一个可以更接近目标文档的参考句子，如 HyDE 中所示
agent 可以生成代码片段，并在需要时重新检索，如 Self-Query 中所示

让我们构建这个系统。🛠️

运行以下代码行以安装所需的依赖项

!pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q

让我们登录以便调用 HF Inference API

from huggingface_hub import notebook_login

notebook_login()

我们首先加载一个知识库，我们要在其上执行 RAG：此数据集是许多 huggingface 包的文档页面的汇编，以 markdown 格式存储。

import datasets

knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

现在我们通过处理数据集并将其存储到向量数据库中来准备知识库，以供检索器使用。

我们使用 LangChain，因为它具有出色的向量数据库实用程序。对于嵌入模型，我们使用 thenlper/gte-small，因为它在我们的 RAG_evaluation 食谱中表现良好。

>>> from tqdm import tqdm
>>> from transformers import AutoTokenizer
>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from langchain.vectorstores import FAISS
>>> from langchain_community.embeddings import HuggingFaceEmbeddings
>>> from langchain_community.vectorstores.utils import DistanceStrategy

>>> source_docs = [
...     Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}) for doc in knowledge_base
... ]

>>> text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
...     AutoTokenizer.from_pretrained("thenlper/gte-small"),
...     chunk_size=200,
...     chunk_overlap=20,
...     add_start_index=True,
...     strip_whitespace=True,
...     separators=["\n\n", "\n", ".", " ", ""],
... )

>>> # Split docs and keep only unique ones
>>> print("Splitting documents...")
>>> docs_processed = []
>>> unique_texts = {}
>>> for doc in tqdm(source_docs):
...     new_docs = text_splitter.split_documents([doc])
...     for new_doc in new_docs:
...         if new_doc.page_content not in unique_texts:
...             unique_texts[new_doc.page_content] = True
...             docs_processed.append(new_doc)

>>> print("Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)")
>>> embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
>>> vectordb = FAISS.from_documents(
...     documents=docs_processed,
...     embedding=embedding_model,
...     distance_strategy=DistanceStrategy.COSINE,
... )

Splitting documents...

现在数据库已准备就绪：让我们构建我们的 Agentic RAG 系统！

👉 我们只需要一个 RetrieverTool，我们的 agent 可以利用它从知识库中检索信息。

由于我们需要将 vectordb 添加为工具的属性，因此我们不能简单地使用带有 @tool 装饰器的简单工具构造器：因此我们将遵循高级 agent 文档中强调的高级设置。

from smolagents import Tool
from langchain_core.vectorstores import VectorStore


class RetrieverTool(Tool):
    name = "retriever"
    description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, vectordb: VectorStore, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.vectordb.similarity_search(
            query,
            k=7,
        )

        return "\nRetrieved documents:\n" + "".join(
            [f"===== Document {str(i)} =====\n" + doc.page_content for i, doc in enumerate(docs)]
        )

现在，创建一个利用此工具的 agent 非常简单！

agent 在初始化时将需要这些参数

tools：agent 可以调用的工具列表。
model：为 agent 提供支持的 LLM。

我们的 model 必须是一个可调用对象，它接受消息列表作为输入并返回文本。它还需要接受一个 stop_sequences 参数，指示何时停止生成。为了方便起见，我们直接使用包中提供的 HfApiModel 类来获取一个 LLM 引擎，该引擎调用我们的 Inference API。

我们使用 meta-llama/Llama-3.1-70B-Instruct，它在 Hugging Face 的 Inference API 上免费提供！

注意： Inference API 基于各种标准托管模型，并且部署的模型可能会在没有事先通知的情况下更新或替换。在此处了解更多信息 here。

from smolagents import HfApiModel, ToolCallingAgent

model = HfApiModel("meta-llama/Llama-3.1-70B-Instruct")

retriever_tool = RetrieverTool(vectordb)
agent = ToolCallingAgent(tools=[retriever_tool], model=model)

由于我们将 agent 初始化为 ReactJsonAgent，因此它已自动获得一个默认的系统提示，该提示告知 LLM 引擎逐步处理并生成 JSON blob 格式的工具调用（您可以根据需要将此提示模板替换为您自己的模板）。

然后，当其 .run() 方法启动时，agent 会负责调用 LLM 引擎，解析工具调用 JSON blob 并执行这些工具调用，所有这些都在一个循环中进行，该循环仅在提供最终答案时结束。

>>> agent_output = agent.run("How can I push a model to the Hub?")

>>> print("Final output:")
>>> print(agent_output)

Final output:
To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() method.

Agentic RAG 与标准 RAG

agent 设置是否使 RAG 系统更好？好吧，让我们使用 LLM Judge 将其与标准 RAG 系统进行比较！

我们将使用 meta-llama/Meta-Llama-3-70B-Instruct 进行评估，因为它是我们为 LLM Judge 用例测试过的最强大的开源模型之一。

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

在运行测试之前，让我们让 agent 不那么冗长。

import logging

agent.logger.setLevel(logging.WARNING)  # Let's reduce the agent's verbosity level

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

outputs_agentic_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]

    enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".

Question:
{question}"""
    answer = agent.run(enhanced_question)
    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_agentic_rag.append(results_agentic)

from huggingface_hub import InferenceClient

reader_llm = InferenceClient("Qwen/Qwen2.5-72B-Instruct")

outputs_standard_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]
    context = retriever_tool(question)

    prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!

Question:
{question}

{context}
"""
    messages = [{"role": "user", "content": prompt}]
    answer = reader_llm.chat_completion(messages).choices[0].message.content

    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_standard_rag.append(results_agentic)

评估提示遵循我们的 llm_judge 食谱中展示的一些最佳原则：它遵循小的整数李克特量表，具有明确的标准以及每个分数的描述。

EVALUATION_PROMPT = """You are a fair evaluator language model.

You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.

The instruction to evaluate:
{instruction}

Response to evaluate:
{response}

Reference Answer (Score 3):
{reference_answer}

Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.

Feedback:"""

from huggingface_hub import InferenceClient

evaluation_client = InferenceClient("meta-llama/Llama-3.1-70B-Instruct")

import pandas as pd

results = {}
for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:
    for experiment in tqdm(outputs):
        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        messages = [
            {"role": "system", "content": "You are a fair evaluator language model."},
            {"role": "user", "content": eval_prompt},
        ]

        eval_result = evaluation_client.text_generation(eval_prompt, max_new_tokens=1000)
        try:
            feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
            experiment["eval_score_LLM_judge"] = score
            experiment["eval_feedback_LLM_judge"] = feedback
        except:
            print(f"Parsing failed - output was: {eval_result}")

    results[system_type] = pd.DataFrame.from_dict(outputs)
    results[system_type] = results[system_type].loc[~results[system_type]["generated_answer"].str.contains("Error")]

>>> DEFAULT_SCORE = 2  # Give average score whenever scoring fails


>>> def fill_score(x):
...     try:
...         return int(x)
...     except:
...         return DEFAULT_SCORE


>>> for system_type, outputs in [
...     ("agentic", outputs_agentic_rag),
...     ("standard", outputs_standard_rag),
... ]:

...     results[system_type]["eval_score_LLM_judge_int"] = (
...         results[system_type]["eval_score_LLM_judge"].fillna(DEFAULT_SCORE).apply(fill_score)
...     )
...     results[system_type]["eval_score_LLM_judge_int"] = (results[system_type]["eval_score_LLM_judge_int"] - 1) / 2

...     print(
...         f"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%"
...     )

Average score for agentic RAG: 86.9%
Average score for standard RAG: 73.1%

让我们回顾一下：与标准 RAG 相比，Agent 设置将分数提高了 14%！（从 73.1% 提高到 86.9%）

这是一个巨大的改进，设置非常简单 🚀

（作为基线，使用不带知识库的 Llama-3-70B 获得了 36%）

< > 更新在 GitHub 上

开源 AI 食谱

Agentic RAG：通过查询重构和自查询加速您的 RAG！🚀

Agentic RAG 与标准 RAG