Agentic RAG:使用查询重构和自查询为你的 RAG 加速! 🚀
本教程为高级教程。你应该先了解来自另一个食谱的知识!
提醒:检索增强生成(RAG)是指“使用 LLM 来回答用户查询,但答案基于从知识库中检索的信息”。与使用普通或微调的 LLM 相比,它有许多优势:举例来说,它允许将答案建立在真实的事实之上并减少编造信息,它允许向 LLM 提供特定领域的知识,并且它允许对从知识库中访问信息的进行细粒度控制。
但是普通 RAG 也有局限性,最重要的是以下两个
- 它**只执行一个检索步骤**:如果结果很差,生成的答案也会很差。
- **语义相似度是使用用户查询作为参考进行计算的**,这可能不是最佳方法:例如,用户查询通常是一个问题,而包含正确答案的文档将以肯定语气,因此其相似度得分将比其他以疑问形式的源文档更低,导致存在丢失相关信息的风险。
但我们可以通过创建**RAG 智能体**来缓解这些问题:简单来说,就是**一个配备了检索工具的智能体!**
该智能体将:✅ 自己制定查询,并 ✅ 在需要时进行批判性审查以重新检索。
因此它应该可以无缝地恢复一些高级 RAG 技术!
- 与直接使用用户查询作为语义搜索中的参考不同,该智能体自己制定一个参考句子,该句子可以更接近目标文档,就像HyDE中一样
- 该智能体可以生成片段,并在需要时重新检索,就像Self-Query中一样
让我们构建这个系统。 🛠️
运行以下代码行以安装所需的依赖项
!pip install pandas langchain langchain-community sentence-transformers faiss-cpu "transformers[agents]" --upgrade -q
我们首先加载一个知识库,我们要在该知识库上执行 RAG:此数据集是许多huggingface
包的文档页面汇编,以 Markdown 格式存储。
import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
现在,我们通过处理数据集并将数据存储到向量数据库中来准备知识库,以供检索器使用。
我们使用LangChain,因为它具有出色的向量数据库实用程序。对于嵌入模型,我们使用thenlper/gte-small,因为它在我们的RAG_evaluation
食谱中表现良好。
>>> from tqdm import tqdm
>>> from transformers import AutoTokenizer
>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from langchain.vectorstores import FAISS
>>> from langchain_community.embeddings import HuggingFaceEmbeddings
>>> from langchain_community.vectorstores.utils import DistanceStrategy
>>> source_docs = [
... Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}) for doc in knowledge_base
... ]
>>> text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
... AutoTokenizer.from_pretrained("thenlper/gte-small"),
... chunk_size=200,
... chunk_overlap=20,
... add_start_index=True,
... strip_whitespace=True,
... separators=["\n\n", "\n", ".", " ", ""],
... )
>>> # Split docs and keep only unique ones
>>> print("Splitting documents...")
>>> docs_processed = []
>>> unique_texts = {}
>>> for doc in tqdm(source_docs):
... new_docs = text_splitter.split_documents([doc])
... for new_doc in new_docs:
... if new_doc.page_content not in unique_texts:
... unique_texts[new_doc.page_content] = True
... docs_processed.append(new_doc)
>>> print("Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)")
>>> embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
>>> vectordb = FAISS.from_documents(
... documents=docs_processed,
... embedding=embedding_model,
... distance_strategy=DistanceStrategy.COSINE,
... )
Splitting documents...
现在数据库已准备就绪:让我们构建我们的 Agentic RAG 系统!
👉 我们只需要一个RetrieverTool
,我们的智能体可以利用它从知识库中检索信息。
由于我们需要添加一个 vectordb 作为工具的属性,我们不能简单地使用简单的工具构造函数和@tool
装饰器:因此,我们将遵循高级智能体文档中突出显示的高级设置。
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore
class RetrieverTool(Tool):
name = "retriever"
description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
inputs = {
"query": {
"type": "string",
"description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
}
}
output_type = "string"
def __init__(self, vectordb: VectorStore, **kwargs):
super().__init__(**kwargs)
self.vectordb = vectordb
def forward(self, query: str) -> str:
assert isinstance(query, str), "Your search query must be a string"
docs = self.vectordb.similarity_search(
query,
k=7,
)
return "\nRetrieved documents:\n" + "".join(
[f"===== Document {str(i)} =====\n" + doc.page_content for i, doc in enumerate(docs)]
)
现在,创建利用此工具的智能体非常简单!
该智能体在初始化时需要以下参数
tools
:智能体可以调用的工具列表。llm_engine
:为智能体提供支持的 LLM。
我们的llm_engine
必须是一个可调用对象,它以消息列表作为输入并返回文本。它还需要接受一个stop_sequences
参数,该参数指示何时停止其生成。为了方便起见,我们直接使用包中提供的HfEngine
类来获取一个 LLM 引擎,该引擎调用我们的推理 API。
我们使用CohereForAI/c4ai-command-r-plus作为 llm 引擎,因为
- 它具有 128k 的长上下文,这对于处理长源文档很有帮助
- 它始终在 HF 的推理 API 上免费提供!
from transformers.agents import HfApiEngine, ReactJsonAgent
llm_engine = HfApiEngine("meta-llama/Llama-3.1-70B-Instruct")
retriever_tool = RetrieverTool(vectordb)
agent = ReactJsonAgent(tools=[retriever_tool], llm_engine=llm_engine, max_iterations=4, verbose=2)
由于我们初始化了智能体作为ReactJsonAgent
,它已自动获得一个默认系统提示,该提示告诉 LLM 引擎逐步处理并生成工具调用作为 JSON 块(你可以在需要时用自己的提示模板替换此提示模板)。
然后,当其.run()
方法启动时,该智能体负责调用 LLM 引擎,解析工具调用 JSON 块并执行这些工具调用,所有这些都包含在一个循环中,该循环仅在提供最终答案时结束。
>>> agent_output = agent.run("How can I push a model to the Hub?")
>>> print("Final output:")
>>> print(agent_output)
Final output: To push a model to the Hub, use `model.push_to_hub()`.
Agentic RAG 与标准 RAG 的对比
智能体设置是否能创建更好的 RAG 系统?让我们使用 LLM Judge 来比较一下!
我们将使用meta-llama/Meta-Llama-3-70B-Instruct进行评估,因为它是我们测试过的用于 LLM Judge 用例的最强大的开源模型之一。
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")
在运行测试之前,让我们让智能体不那么冗长。
import logging
agent.logger.setLevel(logging.WARNING)
outputs_agentic_rag = []
for example in tqdm(eval_dataset):
question = example["question"]
enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".
Question:
{question}"""
answer = agent.run(enhanced_question)
print("=======================================================")
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f'True answer: {example["answer"]}')
results_agentic = {
"question": question,
"true_answer": example["answer"],
"source_doc": example["source_doc"],
"generated_answer": answer,
}
outputs_agentic_rag.append(results_agentic)
from huggingface_hub import InferenceClient
reader_llm = InferenceClient("CohereForAI/c4ai-command-r-plus")
outputs_standard_rag = []
for example in tqdm(eval_dataset):
question = example["question"]
context = retriever_tool(question)
prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Question:
{question}
{context}
"""
messages = [{"role": "user", "content": prompt}]
answer = reader_llm.chat_completion(messages).choices[0].message.content
print("=======================================================")
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f'True answer: {example["answer"]}')
results_agentic = {
"question": question,
"true_answer": example["answer"],
"source_doc": example["source_doc"],
"generated_answer": answer,
}
outputs_standard_rag.append(results_agentic)
评估提示遵循我们的 llm_judge 食谱中展示的一些最佳原则:它遵循一个小整数李克特量表,具有明确的标准和每个得分的描述。
EVALUATION_PROMPT = """You are a fair evaluator language model.
You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.
The instruction to evaluate:
{instruction}
Response to evaluate:
{response}
Reference Answer (Score 3):
{reference_answer}
Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.
Feedback:"""
from huggingface_hub import InferenceClient
evaluation_client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")
>>> import pandas as pd
>>> for type, outputs in [
... ("agentic", outputs_agentic_rag),
... ("standard", outputs_standard_rag),
... ]:
... for experiment in tqdm(outputs):
... eval_prompt = EVALUATION_PROMPT.format(
... instruction=experiment["question"],
... response=experiment["generated_answer"],
... reference_answer=experiment["true_answer"],
... )
... messages = [
... {"role": "system", "content": "You are a fair evaluator language model."},
... {"role": "user", "content": eval_prompt},
... ]
... eval_result = evaluation_client.text_generation(eval_prompt, max_new_tokens=1000)
... try:
... feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
... experiment["eval_score_LLM_judge"] = score
... experiment["eval_feedback_LLM_judge"] = feedback
... except:
... print(f"Parsing failed - output was: {eval_result}")
... results = pd.DataFrame.from_dict(outputs)
... results = results.loc[~results["generated_answer"].str.contains("Error")]
... results["eval_score_LLM_judge_int"] = results["eval_score_LLM_judge"].fillna(1).apply(lambda x: int(x))
... results["eval_score_LLM_judge_int"] = (results["eval_score_LLM_judge_int"] - 1) / 2
... print(f"Average score for {type} RAG: {results['eval_score_LLM_judge_int'].mean()*100:.1f}%")
Average score for agentic RAG: 78.5%
让我们回顾一下:智能体设置将分数提高了 8.5%,高于标准 RAG!(从 70.0% 提高到 78.5%)
这是一个巨大的改进,而且设置非常简单 🚀
(作为基线,使用 Llama-3-70B 而不使用知识库得到了 36%)
< > 更新 在 GitHub 上