开源 AI 食谱文档

使用 LLM 作为评判员 🧑‍⚖️ 进行自动化和多功能评估

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 LLM 作为评判员 🧑‍⚖️ 进行自动化和多功能评估

作者：Aymeric Roucher

评估大型语言模型 (LLM) 通常是一项艰巨的任务：鉴于它们广泛的能力，分配给它们的任务通常需要根据非常宽泛且定义松散的要求进行评判。例如，助手对问题的回答可能是：

没有基于上下文
重复，重复，重复
语法错误
篇幅过长，用词过多，导致论述或书面内容过于详细和冗长
不连贯
...

评判标准不胜枚举。即使我们有一个有限的列表，每个标准也很难衡量：“设计一个基于规则的程序来评估输出是极其具有挑战性的。基于输出与参考答案之间相似性的传统评估指标（例如 ROUGE、BLEU）对于这些问题也无效。”

✅ 一种强大解决方案是以人类的方式评估输出，而无需耗费昂贵的人力时间，这就是 LLM 作为评判员 (LLM-as-a-judge)。这种方法在使用 MT-Bench 和 Chatbot Arena 对作为评判员的 LLM 进行评判中被引入——我鼓励您阅读。

💡 这个想法很简单：让 LLM 为你打分。🤖✓

但我们会发现，它并非开箱即用：你需要仔细设置才能获得好的结果。

!pip install huggingface_hub datasets pandas tqdm -q

import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import InferenceClient, notebook_login

tqdm.pandas()  # load tqdm's pandas support
pd.set_option("display.max_colwidth", None)

notebook_login()

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)

# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

1. 准备创建和评估我们的 LLM 评判员

假设您想给 LLM 一个特定的任务，比如回答开放式问题。

困难在于，正如我们上面讨论的，衡量答案的质量很困难，例如，精确的字符串匹配会将太多正确但措辞不同的答案标记为错误。

您可以让人类标注员来评判输出，但这非常耗时，而且如果您想更新模型或问题，就必须全部重新来过。

✅ 在这种情况下，您可以设置一个 LLM 作为评判员。

但是要使用 LLM 作为评判员，您首先需要评估它对您的模型输出的评分有多可靠。

➡️ 所以第一步将是... 创建一个人工评估数据集。但您只需为少数几个示例获取人工标注即可——大约 30 个就足以对性能有一个很好的了解。而且每次您想测试您的 LLM 评判员时，都可以重复使用这个数据集。

在我们的案例中，我们将使用 feedbackQA，其中包含每个问题/答案对的 2 个人工评估和分数：使用 30 个示例的样本将代表您的小型评估数据集可能的样子。

ratings = load_dataset("McGill-NLP/feedbackQA")["train"]
ratings = pd.DataFrame(ratings)

ratings["review_1"] = ratings["feedback"].apply(lambda x: x["rating"][0])
ratings["explanation_1"] = ratings["feedback"].apply(lambda x: x["explanation"][0])
ratings["review_2"] = ratings["feedback"].apply(lambda x: x["rating"][1])
ratings["explanation_2"] = ratings["feedback"].apply(lambda x: x["explanation"][1])
ratings = ratings.drop(columns=["feedback"])

# Map scores to numeric values
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)

计算性能基线总是一个好主意：在这里，例如，可以是两个人类评分者之间的一致性，通过他们给出的分数的皮尔逊相关系数来衡量。

>>> print("Correlation between 2 human raters:")
>>> print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")

Correlation between 2 human raters:
0.563

两个人类评分者之间的这种相关性不是很好。如果您的评分确实很差，这可能意味着评分标准不够清晰。

这意味着我们的“真实标签”包含噪声：因此我们不能期望任何算法评估能够非常接近它。

然而，我们可以减少这种噪声

通过取平均分作为我们的真实标签而不是任何单一分数，我们应该可以消除一些不规则性。
通过只选择人类评审员意见一致的样本。

在这里，我们将选择最后一个选项，并只保留 2 个人类评审员意见一致的示例。

# Sample examples
ratings_where_raters_agree = ratings.loc[ratings["score_1"] == ratings["score_2"]]
examples = ratings_where_raters_agree.groupby("score_1").sample(7, random_state=1214)
examples["human_score"] = examples["score_1"]

# Visualize 1 sample for each score
display(examples.groupby("human_score").first())

2. 创建我们的 LLM 评判员

我们用一个基本的提示词来构建我们的 LLM 评判员，其中包含这些元素

任务描述
量表描述：最小值、最大值、数值类型（这里是 float）
输出格式说明
一个答案的开头，尽可能地引导 LLM

JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.

Provide your feedback as follows:

Feedback:::
Total rating: (your rating, as a float between 0 and 10)

Now here are the question and answer.

Question: {question}
Answer: {answer}

Feedback:::
Total rating: """

examples["llm_judge"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=1000,
    ),
    axis=1,
)

def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer
        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None


examples["llm_judge_score"] = examples["llm_judge"].apply(extract_judge_score)
# Rescale the score given by the LLM on the same scale as the human score
examples["llm_judge_score"] = (examples["llm_judge_score"] / 10) + 1

>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}")

Correlation between LLM-as-a-judge and the human raters:
0.567

这已经不错了，考虑到两个随机、独立的变量之间的皮尔逊相关系数将为 0！

但我们很容易就能做得更好。🔝

3. 改进 LLM 评判员

正如 Aparna Dhinakaran 所展示的，LLM 在评估连续范围内的输出方面表现不佳。这篇文章为我们提供了一些构建更好提示词的最佳实践：

⏳ 留出更多思考时间，在最终答案前增加一个 评估 字段。
🔢 使用小的整数范围，例如 1-4 或 1-5，而不是像我们之前那样使用大的浮点数范围。
👩‍🏫 提供指导性量表。
我们甚至加入了一点激励来鼓励 LLM！

IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

examples["llm_judge_improved"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=500,
    ),
    axis=1,
)
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(extract_judge_score)

>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}")

Correlation between LLM-as-a-judge and the human raters:
0.843

仅通过对提示词进行几次调整（其中几个百分点的提升归功于我给 LLM 的无耻小费，我在此声明该小费不具有法律约束力），相关性就提高了近 30%。

相当令人印象深刻！👏

让我们展示一些我们的 LLM 评判员的错误来分析它们

errors = pd.concat(
    [
        examples.loc[examples["llm_judge_improved_score"] > examples["human_score"]].head(1),
        examples.loc[examples["llm_judge_improved_score"] < examples["human_score"]].head(2),
    ]
)

display(
    errors[
        [
            "question",
            "answer",
            "human_score",
            "explanation_1",
            "llm_judge_improved_score",
            "llm_judge_improved",
        ]
    ]
)

分歧很小：总的来说，我们的系统似乎已经达到了一个良好的性能水平！

4. 我们如何进一步提升我们的 LLM 评判员？

🎯 你永远无法达到 100%： 首先要注意的是，我们的人工真实标签肯定存在一些噪音，所以即使有一个完美的 LLM 评判员，一致性/相关性也永远不会达到 100%。

🧭 提供参考： 如果您能为每个问题提供参考答案，绝对应该将其提供给评判员 LLM 的提示词中，以获得更好的结果！

▶️ 提供少样本示例： 在提示词中添加一些问题和真实标签评估的少样本示例可以改善结果。（我在这里试过，但在这种情况下没有改善结果，所以我跳过了，但它可能对您的数据集有效！）

➕ 累加量表： 当评判可以分解为原子标准时，使用累加量表可以进一步改善结果：见下文 👇

ADDITIVE_PROMPT = """
(...)
- Award 1 point if the answer is related to the question.
- Give 1 additional point if the answer is clear and precise.
- Provide 1 further point if the answer is true.
- One final point should be awarded if the answer provides additional resources to support the user.
...
"""

用结构化生成实现

使用结构化生成，您可以配置 LLM 评判员直接提供 JSON 格式的输出，包含 Evaluation 和 Total rating 字段，这使得解析更容易：请参阅我们的结构化生成指南以了解更多！

结论

今天就到这里，恭喜你坚持下来！🥳

我得走了，有些奇怪的人在我门口敲门，声称他们代表 Mixtral 来收取 H100。🤔

< > 在 GitHub 上更新

←RAG 评估使用 `judges` 评估 AI 搜索引擎 - 用于 LLM 作为评判员的开源库→