开源 AI 食谱文档

RAG 评估

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

RAG 评估

作者:Aymeric Roucher

本笔记本演示了如何通过构建一个合成评估数据集并使用“LLM 作为评判者”来计算系统准确性,从而评估您的 RAG(检索增强生成)系统。

有关 RAG 的介绍,您可以查阅另一篇指南

RAG 系统很复杂:这是一张 RAG 图表,我们用蓝色标注了所有可以增强系统的可能性。

实施这些改进中的任何一项都可以带来巨大的性能提升;但是,如果您无法监控更改对系统性能的影响,那么任何更改都是徒劳的!所以,让我们来看看如何评估我们的 RAG 系统。

评估 RAG 性能

由于有如此多的可调动部分对性能有巨大影响,因此对 RAG 系统进行基准测试至关重要。

对于我们的评估流程,我们将需要:

  1. 一个包含问答对(QA 对)的评估数据集
  2. 一个评估器,用于计算我们的系统在上述评估数据集上的准确性。

➡️ 事实证明,我们可以使用 LLM 在整个过程中为我们提供帮助!

  1. 评估数据集将由一个 LLM 🤖 合成生成,问题将由其他 LLM 🤖 过滤。
  2. 一个将 LLM 作为评判者的代理 🤖 将在这个合成数据集上执行评估。

让我们深入研究并开始构建我们的评估流程! 首先,我们安装所需的模型依赖项。

!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille
%reload_ext autoreload
%autoreload 2
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth", None)
from huggingface_hub import notebook_login

notebook_login()

加载您的知识库

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

1. 构建一个用于评估的合成数据集

我们首先构建一个包含问题和相关上下文的合成数据集。方法是从我们的知识库中获取元素,并要求 LLM 根据这些文档生成问题。

然后我们设置其他 LLM 代理作为生成的 QA 对的质量过滤器:每个代理都将针对特定缺陷进行过滤。

1.1. 准备源文档

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

1.2. 设置用于问题生成的代理

我们使用 Mixtral 来生成 QA 对,因为它在诸如 Chatbot Arena 等排行榜中表现出色。

from huggingface_hub import InferenceClient


repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)


def call_llm(inference_client: InferenceClient, prompt: str):
    response = inference_client.post(
        json={
            "inputs": prompt,
            "parameters": {"max_new_tokens": 1000},
            "task": "text-generation",
        },
    )
    return json.loads(response.decode())[0]["generated_text"]


call_llm(llm_client, "This is a test context")
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

现在让我们生成我们的 QA 对。在这个例子中,我们只生成 10 个 QA 对,剩下的将从 Hub 加载。

但是对于您特定的知识库,考虑到您希望至少获得约 100 个测试样本,并且考虑到我们稍后会用我们的评判代理过滤掉其中大约一半,您应该生成更多,数量应大于 200 个样本。

import random

N_GENERATIONS = 10  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(llm_client, QA_generation_prompt.format(context=sampled_context.page_content))
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue
display(pd.DataFrame(outputs).head(1))

1.3. 设置评判代理

之前代理生成的问题可能存在许多缺陷:在验证这些问题之前,我们应该进行质量检查。

因此,我们构建评判代理,它们将根据这篇论文中给出的几个标准对每个问题进行评分。

  • 依据性: 问题能否从给定的上下文中得到回答?
  • 相关性: 问题是否与用户相关?例如,“transformers 4.29.1 版本发布的日期是什么?”对机器学习从业者来说并不相关。

我们注意到的最后一个失败案例是,当一个函数是为生成问题的特定环境量身定制的,但本身无法理解,比如“本指南中使用的函数名称是什么?”。我们还为这个标准构建了一个评判代理。

  • 独立性:对于具备领域知识/互联网访问权限的人来说,这个问题在没有任何上下文的情况下是否可以理解?与此相反的例子是,从一篇特定博客文章中生成的问题:“本文中使用的函数是什么?”。

我们系统地用所有这些代理对函数进行评分,只要任何一个代理的得分过低,我们就会从评估数据集中剔除该问题。

💡 在要求代理输出分数时,我们首先要求它们给出其理由。这将帮助我们验证分数,但更重要的是,要求它首先输出理由可以给模型更多的词元来思考和阐述答案,然后再将其总结为单个分数词元。

我们现在构建并运行这些评判代理。

question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

现在,让我们根据评判代理的分数过滤掉不好的问题。

>>> import pandas as pd

>>> pd.set_option("display.max_colwidth", None)

>>> generated_questions = pd.DataFrame.from_dict(outputs)

>>> print("Evaluation dataset before filtering:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )
>>> generated_questions = generated_questions.loc[
...     (generated_questions["groundedness_score"] >= 4)
...     & (generated_questions["relevance_score"] >= 4)
...     & (generated_questions["standalone_score"] >= 4)
... ]
>>> print("============================================")
>>> print("Final evaluation dataset:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )

>>> eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)
Evaluation dataset before filtering:

现在我们的合成评估数据集已经完成了!我们可以在这个评估数据集上评估不同的 RAG 系统。

我们在这里只生成了几个 QA 对以减少时间和成本。但让我们通过加载一个预先生成的数据集来开始下一部分。

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

2. 构建我们的 RAG 系统

2.1. 预处理文档以构建我们的向量数据库

  • 在这一部分,我们将知识库中的文档分割成更小的块:这些块将是被检索器选中的片段,然后被阅读器 LLM 作为其答案的支持元素。
  • 目标是构建语义相关的片段:既不能太小以至于不足以支持答案,也不能太大以免稀释单个观点。

文本分割有多种选择:

  • n 个单词/字符分割一次,但这有切断段落甚至句子的风险。
  • n 个单词/字符后分割,但仅在句子边界处分割。
  • 递归分割 试图通过树状处理方式,首先按最大的单位(章节)分割,然后递归地按更小的单位(段落、句子)分割,从而更多地保留文档结构。

要了解有关分块的更多信息,我建议您阅读 Greg Kamradt 撰写的这本很棒的笔记本

这个空间可以让你可视化不同分割选项对你得到的块的影响。

在下文中,我们使用 Langchain 的 RecursiveCharacterTextSplitter

💡 为了在我们的文本分割器中测量块长度,我们的长度函数将不是字符数,而是分词后文本中的词元数:确实,对于处理词元的后续嵌入器,以词元为单位测量长度更相关,并且经验上表现更好。

from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)
]
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

2.2. 检索器 - 嵌入 🗂️

检索器就像一个内部搜索引擎:给定用户查询,它会从您的知识库中返回最相关的文档。

对于知识库,我们使用 Langchain 向量数据库,因为它提供了一个方便的 FAISS 索引,并允许我们在整个处理过程中保留文档元数据

🛠️ 包含的选项:

  • 调整分块方法
    • 块的大小
    • 方法:按不同的分隔符分割,使用语义分块
  • 更改嵌入模型
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={"normalize_embeddings": True},  # set True to compute cosine similarity
    )

    # Check if embeddings already exist on disk
    index_name = f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

2.3. 阅读器 - LLM 💬

在这一部分,LLM 阅读器会阅读检索到的文档来组织其答案。

🛠️ 这里我们尝试了以下选项来改进结果

  • 开启/关闭重排
  • 更改阅读器模型
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""
from langchain_community.llms import HuggingFaceHub

repo_id = "HuggingFaceH4/zephyr-7b-beta"
READER_MODEL_NAME = "zephyr-7b-beta"
HF_API_TOKEN = ""

READER_LLM = HuggingFaceHub(
    repo_id=repo_id,
    task="text-generation",
    huggingfacehub_api_token=HF_API_TOKEN,
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)
from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM


def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = llm(final_prompt)

    return answer, relevant_docs

3. RAG 系统基准测试

RAG 系统和评估数据集现已准备就绪。最后一步是在此评估数据集上评判 RAG 系统的输出。

为此,我们设置了一个评判代理。⚖️🤖

不同的 RAG 评估指标中,我们选择只关注答案正确性,因为它是我们系统性能的最佳端到端指标。

我们使用 GPT4 作为评判者,因为它在经验上表现良好,但你也可以尝试其他模型,例如 kaist-ai/prometheus-13b-v1.0BAAI/JudgeLM-33B-v1.0

💡 在评估提示中,我们对 1-5 分的每个指标都给出了详细描述,就像在 Prometheus 的提示模板中所做的那样:这有助于模型精确地确定其指标。相反,如果你给评判者 LLM 一个模糊的评分标准,不同示例之间的输出将不够一致。

💡 再次强调,在给出最终评分之前提示 LLM 输出理由,可以为其提供更多词元来帮助其形成和阐述判断。

from langchain_core.language_models import BaseChatModel


def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)
from langchain.chat_models import ChatOpenAI

OPENAI_API_KEY = ""

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0, openai_api_key=OPENAI_API_KEY)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

🚀 让我们运行测试并评估答案吧!👇

if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") if rerank else None
            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )

            print("Running evaluation...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

检查结果

import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)
result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(lambda x: int(x) if isinstance(x, str) else 1)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4
average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()

示例结果

让我们加载我通过调整此笔记本中可用的不同选项获得的结果。有关这些选项为何可能有效或无效的更多详细信息,请参阅关于高级 RAG 的笔记本。

正如下图所示,有些调整并没有带来任何改进,而有些则带来了巨大的性能提升。

➡️ 没有单一的良方:在调整 RAG 系统时,您应该尝试几个不同的方向。

import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])
fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

如您所见,这些调整对性能的影响各不相同。特别是,调整块大小既简单又影响巨大。

但这是我们的情况:您的结果可能会大不相同:既然您有了一个稳健的评估流程,您就可以开始探索其他选项了!🗺️

< > 在 GitHub 上更新