开源 AI 食谱文档
Agentic RAG:通过查询重构和自查询为你的 RAG 加速!🚀
并获得增强的文档体验
开始使用
Agentic RAG:通过查询重构和自查询为你的 RAG 加速!🚀
本教程是高级教程。你应该首先掌握这本烹饪书中的概念!
提醒:检索增强生成(RAG)是“使用 LLM 回答用户查询,但答案基于从知识库中检索到的信息”。它比使用普通或微调的 LLM 具有许多优势:例如,它可以将答案基于真实事实并减少虚构,它允许为 LLM 提供特定领域的知识,并且它允许对知识库中的信息进行细粒度控制。
但是普通的 RAG 有其局限性,最重要的是这两个:
- 它只执行一个检索步骤:如果结果不好,那么生成也会很差。
- 语义相似性以用户查询为参考进行计算,这可能是次优的:例如,用户查询通常是一个问题,而包含真实答案的文档将采用肯定句式,因此其相似度得分会低于其他疑问句式的源文档,从而可能错过相关信息。
但是我们可以通过构建一个RAG 智能体来缓解这些问题:非常简单,一个配备检索工具的智能体!
这个智能体将:✅ 自行构建查询并 ✅ 在需要时进行批判性重检索。
所以它应该能够天真地恢复一些高级 RAG 技术!
让我们构建这个系统。🛠️
运行以下行以安装所需的依赖项
!pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q
让我们登录以便调用 HF 推理 API
from huggingface_hub import notebook_login
notebook_login()
我们首先加载一个我们想要执行 RAG 的知识库:这个数据集是许多huggingface
包的文档页面的编译,以 Markdown 格式存储。
import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
现在我们通过处理数据集并将其存储到向量数据库中,来准备知识库以供检索器使用。
我们使用LangChain,因为它提供了出色的向量数据库实用程序。对于嵌入模型,我们使用thenlper/gte-small,因为它在我们的RAG_evaluation
烹饪书中表现良好。
>>> from tqdm import tqdm
>>> from transformers import AutoTokenizer
>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from langchain.vectorstores import FAISS
>>> from langchain_community.embeddings import HuggingFaceEmbeddings
>>> from langchain_community.vectorstores.utils import DistanceStrategy
>>> source_docs = [
... Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}) for doc in knowledge_base
... ]
>>> text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
... AutoTokenizer.from_pretrained("thenlper/gte-small"),
... chunk_size=200,
... chunk_overlap=20,
... add_start_index=True,
... strip_whitespace=True,
... separators=["\n\n", "\n", ".", " ", ""],
... )
>>> # Split docs and keep only unique ones
>>> print("Splitting documents...")
>>> docs_processed = []
>>> unique_texts = {}
>>> for doc in tqdm(source_docs):
... new_docs = text_splitter.split_documents([doc])
... for new_doc in new_docs:
... if new_doc.page_content not in unique_texts:
... unique_texts[new_doc.page_content] = True
... docs_processed.append(new_doc)
>>> print("Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)")
>>> embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
>>> vectordb = FAISS.from_documents(
... documents=docs_processed,
... embedding=embedding_model,
... distance_strategy=DistanceStrategy.COSINE,
... )
Splitting documents...
现在数据库已准备就绪:让我们构建我们的 Agentic RAG 系统!
👉 我们只需要一个RetrieverTool
,我们的智能体就可以利用它从知识库中检索信息。
由于我们需要将 vectordb 添加为工具的一个属性,我们不能简单地使用带有@tool
装饰器的简单工具构造函数:所以我们将遵循高级智能体文档中强调的高级设置。
from smolagents import Tool
from langchain_core.vectorstores import VectorStore
class RetrieverTool(Tool):
name = "retriever"
description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
inputs = {
"query": {
"type": "string",
"description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
}
}
output_type = "string"
def __init__(self, vectordb: VectorStore, **kwargs):
super().__init__(**kwargs)
self.vectordb = vectordb
def forward(self, query: str) -> str:
assert isinstance(query, str), "Your search query must be a string"
docs = self.vectordb.similarity_search(
query,
k=7,
)
return "\nRetrieved documents:\n" + "".join(
[f"===== Document {str(i)} =====\n" + doc.page_content for i, doc in enumerate(docs)]
)
现在,创建一个利用此工具的智能体就变得简单了!
智能体在初始化时需要这些参数
tools
:智能体可以调用的工具列表。model
:为智能体提供支持的 LLM。
我们的model
必须是一个可调用对象,它以消息列表为输入并返回文本。它还需要接受一个stop_sequences
参数,该参数指示何时停止其生成。为了方便起见,我们直接使用包中提供的InferenceClientModel
类来获取一个调用我们推理 API的 LLM 引擎。
我们使用meta-llama/Llama-3.1-70B-Instruct,并在 Hugging Face 的推理 API 上免费提供!
注意: 推理 API 根据各种标准托管模型,部署的模型可能会在没有事先通知的情况下更新或替换。在此处了解更多信息。
from smolagents import InferenceClientModel, ToolCallingAgent
model = InferenceClientModel("meta-llama/Llama-3.1-70B-Instruct")
retriever_tool = RetrieverTool(vectordb)
agent = ToolCallingAgent(tools=[retriever_tool], model=model)
由于我们将智能体初始化为ReactJsonAgent
,它已自动获得一个默认系统提示,告诉 LLM 引擎逐步处理并生成 JSON Blob 形式的工具调用(您可以根据需要用自己的提示模板替换此提示模板)。
然后,当其.run()
方法启动时,智能体负责调用 LLM 引擎、解析工具调用 JSON Blob 并执行这些工具调用,所有这些都在一个循环中进行,直到提供最终答案为止。
>>> agent_output = agent.run("How can I push a model to the Hub?")
>>> print("Final output:")
>>> print(agent_output)
Final output: To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() method.
Agentic RAG 与标准 RAG
智能体设置能让 RAG 系统变得更好吗?嗯,让我们用 LLM Judge 将其与标准 RAG 系统进行比较!
我们将使用meta-llama/Meta-Llama-3-70B-Instruct进行评估,因为它是我们测试 LLM 判别用例时最强大的 OS 模型之一。
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")
在运行测试之前,让我们让代理不那么冗长。
import logging
agent.logger.setLevel(logging.WARNING) # Let's reduce the agent's verbosity level
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")
outputs_agentic_rag = []
for example in tqdm(eval_dataset):
question = example["question"]
enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".
Question:
{question}"""
answer = agent.run(enhanced_question)
print("=======================================================")
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f'True answer: {example["answer"]}')
results_agentic = {
"question": question,
"true_answer": example["answer"],
"source_doc": example["source_doc"],
"generated_answer": answer,
}
outputs_agentic_rag.append(results_agentic)
from huggingface_hub import InferenceClient
reader_llm = InferenceClient("Qwen/Qwen2.5-72B-Instruct")
outputs_standard_rag = []
for example in tqdm(eval_dataset):
question = example["question"]
context = retriever_tool(question)
prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Question:
{question}
{context}
"""
messages = [{"role": "user", "content": prompt}]
answer = reader_llm.chat_completion(messages).choices[0].message.content
print("=======================================================")
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f'True answer: {example["answer"]}')
results_agentic = {
"question": question,
"true_answer": example["answer"],
"source_doc": example["source_doc"],
"generated_answer": answer,
}
outputs_standard_rag.append(results_agentic)
评估提示遵循了我们的 llm_judge 烹饪书中展示的一些最佳原则:它遵循一个小的整数李克特量表,具有清晰的标准,以及每个分数的描述。
EVALUATION_PROMPT = """You are a fair evaluator language model.
You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.
The instruction to evaluate:
{instruction}
Response to evaluate:
{response}
Reference Answer (Score 3):
{reference_answer}
Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.
Feedback:"""
from huggingface_hub import InferenceClient
evaluation_client = InferenceClient("meta-llama/Llama-3.1-70B-Instruct")
import pandas as pd
results = {}
for system_type, outputs in [
("agentic", outputs_agentic_rag),
("standard", outputs_standard_rag),
]:
for experiment in tqdm(outputs):
eval_prompt = EVALUATION_PROMPT.format(
instruction=experiment["question"],
response=experiment["generated_answer"],
reference_answer=experiment["true_answer"],
)
messages = [
{"role": "system", "content": "You are a fair evaluator language model."},
{"role": "user", "content": eval_prompt},
]
eval_result = evaluation_client.text_generation(eval_prompt, max_new_tokens=1000)
try:
feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
experiment["eval_score_LLM_judge"] = score
experiment["eval_feedback_LLM_judge"] = feedback
except:
print(f"Parsing failed - output was: {eval_result}")
results[system_type] = pd.DataFrame.from_dict(outputs)
results[system_type] = results[system_type].loc[~results[system_type]["generated_answer"].str.contains("Error")]
>>> DEFAULT_SCORE = 2 # Give average score whenever scoring fails
>>> def fill_score(x):
... try:
... return int(x)
... except:
... return DEFAULT_SCORE
>>> for system_type, outputs in [
... ("agentic", outputs_agentic_rag),
... ("standard", outputs_standard_rag),
... ]:
... results[system_type]["eval_score_LLM_judge_int"] = (
... results[system_type]["eval_score_LLM_judge"].fillna(DEFAULT_SCORE).apply(fill_score)
... )
... results[system_type]["eval_score_LLM_judge_int"] = (results[system_type]["eval_score_LLM_judge_int"] - 1) / 2
... print(
... f"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%"
... )
Average score for agentic RAG: 86.9% Average score for standard RAG: 73.1%
总结:与标准 RAG 相比,代理设置将分数提高了 14%!(从 73.1% 到 86.9%)
这是一个巨大的进步,而且设置非常简单 🚀
(作为基线,不使用知识库的 Llama-3-70B 获得了 36%)
< > 在 GitHub 上更新