开源 AI 食谱文档

RAG 评估

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始

Open In Colab

RAG 评估

作者:Aymeric Roucher

此笔记本演示了如何通过构建合成评估数据集并使用 LLM 作为评判者来计算系统的准确性,来评估 RAG(检索增强生成)。

有关 RAG 的介绍,您可以查看另一个食谱

RAG 系统很复杂:这里是一个 RAG 图,我们用蓝色标记了系统增强的所有可能性

实现任何这些改进都可以带来巨大的性能提升;但如果无法监控更改对系统性能的影响,则更改任何内容都是无用的!因此,让我们看看如何评估我们的 RAG 系统。

评估 RAG 性能

由于有许多移动部件需要调整,对性能有很大影响,因此对 RAG 系统进行基准测试至关重要。

对于我们的评估管道,我们需要

  1. 一个包含问题-答案对(QA 对)的评估数据集
  2. 一个评估器来计算系统在上述评估数据集上的准确性。

➡️ 事实证明,我们可以使用 LLM 在整个过程中帮助我们!

  1. 评估数据集将由 LLM 🤖 合成生成,问题将由其他 LLM 🤖 过滤掉
  2. 然后,LLM 作为评判者 代理 🤖 将在这个合成数据集上执行评估。

让我们深入了解它并开始构建我们的评估管道!首先,我们安装所需的模型依赖项。

!pip install -q torch transformers transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille
%reload_ext autoreload
%autoreload 2
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth", None)
from huggingface_hub import notebook_login

notebook_login()

加载您的知识库

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

1. 构建用于评估的合成数据集

我们首先构建一个包含问题和相关上下文的合成数据集。方法是从我们的知识库中获取元素,并要求一个 LLM 根据这些文档生成问题。

然后,我们设置其他 LLM 代理作为生成 QA 对的质量过滤器:每个代理将充当特定缺陷的过滤器。

1.1. 准备源文档

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

1.2. 设置问题生成代理

我们使用 Mixtral 来生成 QA 对,因为它在 聊天机器人竞技场 等排行榜上表现出色。

from huggingface_hub import InferenceClient


repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)


def call_llm(inference_client: InferenceClient, prompt: str):
    response = inference_client.post(
        json={
            "inputs": prompt,
            "parameters": {"max_new_tokens": 1000},
            "task": "text-generation",
        },
    )
    return json.loads(response.decode())[0]["generated_text"]


call_llm(llm_client, "This is a test context")
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

现在让我们生成我们的 QA 对。在这个例子中,我们只生成 10 个 QA 对,其余的将从 Hub 加载。

但是对于您的特定知识库,鉴于您希望获得至少 ~100 个测试样本,并考虑到我们稍后将使用我们的批评代理过滤掉大约一半的样本,您应该生成更多样本,超过 200 个样本。

import random

N_GENERATIONS = 10  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(llm_client, QA_generation_prompt.format(context=sampled_context.page_content))
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue
display(pd.DataFrame(outputs).head(1))

1.3. 设置批评代理

先前代理生成的问题可能存在许多缺陷:在验证这些问题之前,我们应该进行质量检查。

因此,我们构建了批评代理,根据 这篇论文 中的几个标准来评估每个问题。

  • **基础性:**问题是否可以从给定的上下文中得到解答?
  • **相关性:**问题对用户是否相关?例如,"变压器 4.29.1 发布的日期是哪一天?" 对 ML 从业人员不相关。

我们还注意到最后一个失败案例,即当函数针对问题生成时的特定环境而定制,但本身无法理解时,例如 "本指南中使用的函数的名称是什么?"。我们还针对此标准构建了批评代理。

  • **独立性:**问题是否可以独立于任何上下文,被具有领域知识/互联网访问权限的人理解?相反的情况是,针对特定博客文章生成的问题是 本文中使用的函数是什么?

我们系统地使用所有这些代理对函数进行评分,只要任何一个代理的评分过低,我们就从评估数据集中删除该问题。

💡 在要求代理输出评分时,我们首先要求它们提供其理由。这将帮助我们验证评分,但最重要的是,要求它们先输出理由会为模型提供更多代币来思考和详细说明答案,然后将其总结为单个评分代币。

我们现在构建并运行这些批评代理。

question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

现在,让我们根据批评代理评分过滤掉不好的问题。

>>> import pandas as pd

>>> pd.set_option("display.max_colwidth", None)

>>> generated_questions = pd.DataFrame.from_dict(outputs)

>>> print("Evaluation dataset before filtering:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )
>>> generated_questions = generated_questions.loc[
...     (generated_questions["groundedness_score"] >= 4)
...     & (generated_questions["relevance_score"] >= 4)
...     & (generated_questions["standalone_score"] >= 4)
... ]
>>> print("============================================")
>>> print("Final evaluation dataset:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )

>>> eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)
Evaluation dataset before filtering:

现在,我们的合成评估数据集已完成!我们可以使用这个评估数据集评估不同的 RAG 系统。

我们这里只生成了几个 QA 对,以减少时间和成本。但让我们通过加载预生成的数据集来启动下一部分。

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

2. 构建我们的 RAG 系统

2.1. 预处理文档以构建我们的向量数据库

  • 在本部分中,**我们将知识库中的文档拆分为更小的块**:这些将是检索器选择的片段,然后作为阅读器 LLM 的支持元素被输入到阅读器 LLM 中,用于其答案。
  • 目标是构建语义相关的片段:不要太小,以至于不足以支持答案,也不要太大,以避免稀释单个想法。

文本拆分有许多选择。

  • n 个词/字符拆分一次,但这样做有将段落甚至句子切成两半的风险。
  • n 个词/字符拆分一次,但只在句子边界处拆分。
  • **递归拆分** 试图保留更多文档结构,通过树状方式处理文档,首先在最大单位(章节)上拆分,然后递归地在较小单位(段落、句子)上拆分。

要了解更多关于分块的信息,我建议您阅读 Greg Kamradt 的 这个很棒的笔记本

这个空间 使您可以可视化不同的拆分选项如何影响获得的块。

在下面,我们使用 Langchain 的 RecursiveCharacterTextSplitter

💡 为了衡量文本拆分器中的块长度,我们的长度函数将不是字符的计数,而是标记化文本中标记的计数:实际上,对于后续处理标记的嵌入器,用标记衡量长度更相关,并且在经验上表现更好。

from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)
]
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

2.2. 检索器 - 嵌入 🗂️

**检索器就像一个内部搜索引擎**:给定用户查询,它会从您的知识库中返回最相关的文档。

对于知识库,我们使用 Langchain 向量数据库,因为它**提供了方便的 FAISS 索引,并允许我们在整个处理过程中保留文档元数据**。

🛠️ **包含的选项:**

  • 调整分块方法。
    • 块的大小。
    • 方法:在不同的分隔符上拆分,使用 语义分块 ……
  • 更改嵌入模型。
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={"normalize_embeddings": True},  # set True to compute cosine similarity
    )

    # Check if embeddings already exist on disk
    index_name = f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

2.3. 阅读器 - LLM 💬

在本部分,LLM 阅读器会阅读检索到的文档以制定其答案。

🛠️ 我们尝试了以下选项来改善结果

  • 打开/关闭重排序
  • 更改阅读器模型
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""
from langchain_community.llms import HuggingFaceHub

repo_id = "HuggingFaceH4/zephyr-7b-beta"
READER_MODEL_NAME = "zephyr-7b-beta"
HF_API_TOKEN = ""

READER_LLM = HuggingFaceHub(
    repo_id=repo_id,
    task="text-generation",
    huggingfacehub_api_token=HF_API_TOKEN,
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)
from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM


def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = llm(final_prompt)

    return answer, relevant_docs

3. RAG 系统的基准测试

RAG 系统和评估数据集现已准备就绪。最后一步是在此评估数据集上评判 RAG 系统的输出。

为此,我们设置了一个评判代理。⚖️🤖

不同的 RAG 评估指标中,我们选择只关注忠实度,因为它是我们系统性能的最佳端到端指标。

我们使用 GPT4 作为评判,因为它在经验上表现良好,但您也可以尝试使用其他模型,例如 kaist-ai/prometheus-13b-v1.0BAAI/JudgeLM-33B-v1.0

💡 在评估提示中,我们针对 1-5 规模的每个指标提供详细描述,正如 Prometheus 的提示模板 中所做的那样:这有助于模型精确地对其指标进行定位。相反,如果您给评判 LLM 一个模糊的规模,则不同示例之间的输出将不够一致。

💡 同样,提示 LLM 在给出最终分数之前输出理由,会给它更多令牌,帮助它形式化和详细说明判断。

from langchain_core.language_models import BaseChatModel


def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)
from langchain.chat_models import ChatOpenAI

OPENAI_API_KEY = ""

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0, openai_api_key=OPENAI_API_KEY)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

🚀 让我们运行测试并评估答案!👇

if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") if rerank else None
            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )

            print("Running evaluation...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

检查结果

import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)
result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(lambda x: int(x) if isinstance(x, str) else 1)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4
average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()

示例结果

让我们加载我通过调整此笔记本中提供的不同选项获得的结果。有关这些选项为何有效或无效的更多详细信息,请参阅有关 advanced_RAG 的笔记本。

正如您在下面的图表中看到的,一些调整并没有带来任何改进,而另一些则带来了巨大的性能提升。

➡️ 没有一种单一的最佳方案:在调整 RAG 系统时,您应该尝试几个不同的方向。

import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])
fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

如您所见,这些对性能的影响各不相同。特别是,调整块大小既容易又非常有效。

但这是我们情况:您的结果可能截然不同:现在您拥有了可靠的评估管道,您可以开始探索其他选项!🗺️

< > 更新 在 GitHub 上