使用 Vertex AI 和 Gemini 评估开放式大型语言模型

Vertex AI 中的生成式 AI 评估服务使我们能够使用现有或自定义的评估标准来评估大型语言模型或应用程序。它支持学术指标（如 BLEU、ROUGE）以及使用逐点和成对指标或自定义指标的将大型语言模型作为评判标准，默认情况下，使用Gemini 1.5 Pro 作为评判标准。

我们可以使用生成式 AI 评估服务来评估使用 Vertex AI 端点和计算资源的开放模型和微调模型的性能。在本例中，我们将评估meta-llama/Meta-Llama-3.1-8B-Instruct根据G-Eval一致性指标生成的新闻文章摘要的逐点指标。

我们将涵盖以下主题

设置/配置
在 Vertex AI 上部署 Llama 3.1 8B
使用不同的提示评估 Llama 3.1 8B 的一致性
解释结果
清理资源

设置/配置

首先，您需要在本地机器上安装gcloud，这是 Google Cloud 的命令行工具，请按照Cloud SDK 文档 - 安装 gcloud CLI中的说明进行操作。

然后，您还需要安装google-cloud-aiplatform Python SDK，这是以编程方式创建 Vertex AI 模型、注册模型、创建端点并在 Vertex AI 上部署模型所必需的。

!pip install --upgrade --quiet "google-cloud-aiplatform[evaluation]"  huggingface_hub transformers datasets

为了方便使用，我们为 GCP 定义了以下环境变量。

注释 1：确保将项目 ID 调整为您的 GCP 项目。
注释 2：并非所有区域都提供生成式 AI 评估服务。如果您想使用它，则需要选择支持它的区域。目前支持us-central1。

%env PROJECT_ID=gcp-partnership-412108
%env LOCATION=us-central1
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310

然后，您需要登录到您的 GCP 帐户，并将项目 ID 设置为您想要在 Vertex AI 上注册和部署模型的项目。

!gcloud auth login
!gcloud auth application-default login  # For local development
!gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，例如 Vertex AI API、Compute Engine API 和 Google Container Registry 相关 API。

!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

在 Vertex AI 上部署 Llama 3.1 8B

一切设置完成后，我们就可以在 Vertex AI 上部署 Llama 3.1 8B 模型了。我们将使用google-cloud-aiplatform Python SDK 来实现。 meta-llama/Meta-Llama-3.1-8B-Instruct 是一个受限模型，您需要使用读取访问令牌登录到您的 Hugging Face Hub 帐户，该令牌可以是针对受限模型的细粒度访问令牌，也可以是您帐户的整体读取访问令牌。有关如何在 Hugging Face Hub 中生成只读访问令牌的更多信息，请参阅https://huggingface.co/docs/hub/en/security-tokens中的说明。

from huggingface_hub import interpreter_login

interpreter_login()

登录后，我们可以“上传”模型，即在 Vertex AI 上注册模型。如果您想了解有关可以传递给upload 方法的参数的更多信息，请查看在 Vertex AI 上部署 Gemma 7B 与 TGI。

import os
from google.cloud import aiplatform

aiplatform.init(
    project=os.getenv("PROJECT_ID"),
    location=os.getenv("LOCATION"),
)

我们将meta-llama/Meta-Llama-3.1-8B-Instruct 部署到 1 个 NVIDIA L4 加速器（具有 24GB 内存）。我们设置了 TGI 参数，以允许最多 8000 个输入标记、8192 个最大总标记和 8192 个最大批处理预填充标记。

from huggingface_hub import get_token

vertex_model_name = "llama-3-1-8b-instruct"

model = aiplatform.Model.upload(
    display_name=vertex_model_name,
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "MAX_INPUT_TOKENS": "8000",
        "MAX_TOTAL_TOKENS": "8192",
        "MAX_BATCH_PREFILL_TOKENS": "8192",
        "HUGGING_FACE_HUB_TOKEN": get_token(),
    },
    serving_container_ports=[8080],
)
model.wait()  # wait for the model to be registered

# create endpoint
endpoint = aiplatform.Endpoint.create(display_name=f"{vertex_model_name}-endpoint")

# deploy model to 1x NVIDIA L4
deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-4",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

通过deploy 方法进行的 Vertex AI 端点部署可能需要 15 到 25 分钟。

模型部署后，我们可以测试我们的端点。我们生成一个辅助generate 函数来向已部署的模型发送请求。这将在稍后用于向已部署的模型发送请求并收集用于评估的输出。

import re
from transformers import AutoTokenizer

# grep the model id from the container spec environment variables
model_id = next(
    (
        re.search(r'value: "(.+)"', str(item)).group(1)
        for item in list(model.container_spec.env)
        if "MODEL_ID" in str(item)
    ),
    None,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

generation_config = {
    "max_new_tokens": 256,
    "do_sample": True,
    "top_p": 0.2,
    "temperature": 0.2,
}


def generate(prompt, generation_config=generation_config):
    formatted_prompt = tokenizer.apply_chat_template(
        [
            {"role": "user", "content": prompt},
        ],
        tokenize=False,
        add_generation_prompt=True,
    )

    payload = {"inputs": formatted_prompt, "parameters": generation_config}
    output = deployed_model.predict(instances=[payload])
    generated_text = output.predictions[0]
    return generated_text


generate("How many people live in Berlin?", generation_config)
# 'The population of Berlin is approximately 6.578 million as of my cut off data. However, considering it provides real-time updates, the current population might be slightly higher'

使用不同的提示评估 Llama 3.1 8B 的一致性

我们将使用不同的提示评估 Llama 3.1 8B 模型的一致性。一致性衡量的是摘要新闻文章中各个句子在形成统一且易于理解的叙述方面连接在一起的程度。

我们将使用新的生成式 AI 评估服务。生成式 AI 评估服务可用于

模型选择：根据基准测试结果及其在特定数据上的性能，为您的任务选择最佳预训练模型。
生成设置：调整模型参数（如温度）以优化输出以满足您的需求。
提示工程：设计有效的提示和提示模板，以引导模型朝向您首选的行为和响应。
改进和保障微调：微调模型以提高用例的性能，同时避免偏差或不良行为。
RAG 优化：选择最有效的检索增强生成 (RAG) 架构以增强应用程序的性能。
迁移：通过在较新的模型为您的特定用例提供明显优势时迁移到这些模型，持续评估和改进您的 AI 解决方案的性能。

在本例中，我们将使用它来评估不同的提示模板，以使用 Llama 3.1 8B Instruct 实现最连贯的摘要。

我们将使用基于G-Eval一致性指标的无参考逐点指标。

第一步是定义我们的提示模板并创建我们的PointwiseMetric。Vertex AI 将模型的响应返回到response 字段中，我们的新闻文章将在text 字段中提供。

from vertexai.evaluation import EvalTask, PointwiseMetric

g_eval_coherence = """
You are an expert evaluator. You will be given one summary written for a news article.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.


Example:


Source Text:

{text}

Summary:

{response}

Evaluation Form (scores ONLY):

- Coherence:"""

metric = PointwiseMetric(
    metric="g-eval-coherence",
    metric_prompt_template=g_eval_coherence,
)

我们将使用argilla/news-summary数据集，其中包含来自路透社的新闻文章。我们将使用 15 篇文章的随机子集以保持评估速度。您可以随意更改数据集和文章数量，以便使用更多数据和不同主题来评估模型。

from datasets import load_dataset

subset_size = 15
dataset = load_dataset("argilla/news-summary", split=f"train").shuffle(seed=42).select(range(subset_size))

# print first 150 characters of the first article
print(dataset[0]["text"][:150])

在运行评估之前，我们需要将数据集转换为 Pandas 数据框。

# remove all columns except for "text"
to_remove = [col for col in dataset.features.keys() if col != "text"]
dataset = dataset.remove_columns(to_remove)
df = dataset.to_pandas()
df.head()

太棒了！我们快完成了。最后一步是定义我们想要用于评估的不同摘要提示。

summarization_prompts = {
    "simple": "Summarize the following news article: {text}",
    "eli5": "Summarize the following news article in a way a 5 year old would understand: {text}",
    "detailed": """Summarize the given news article, text, including all key points and supporting details? The summary should be comprehensive and accurately reflect the main message and arguments presented in the original text, while also being concise and easy to understand. To ensure accuracy, please read the text carefully and pay attention to any nuances or complexities in the language.
  
Article:
{text}""",
}

现在，我们可以遍历我们的提示并创建不同的评估任务，使用我们的连贯性指标评估摘要并收集结果。

import uuid


results = {}
for prompt_name, prompt in summarization_prompts.items():
    prompt = summarization_prompts[prompt_name]

    # 1. add new prompt column
    df["prompt"] = df["text"].apply(lambda x: prompt.format(text=x))

    # 2. create eval task
    eval_task = EvalTask(
        dataset=df,
        metrics=[metric],
        experiment="llama-3-1-8b-instruct",
    )
    # 3. run eval task
    # Note: If the last iteration takes > 1 minute you might need to retry the evaluation
    exp_results = eval_task.evaluate(
        model=generate, experiment_run_name=f"prompt-{prompt_name}-{str(uuid.uuid4())[:8]}"
    )
    print(f"{prompt_name}: {exp_results.summary_metrics['g-eval-coherence/mean']}")
    results[prompt_name] = exp_results.summary_metrics["g-eval-coherence/mean"]

for prompt_name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{prompt_name}: {score}")

很好，看起来在我们有限的测试中，“简单”提示产生了最佳结果。您可以在 GCP Console 的 Vertex AI > 模型开发 > 实验中检查和比较结果。

experiment-results

概览允许您比较不同实验的结果并检查各个评估。在这里我们可以看到“详细”的标准差相当高。这可能是由于样本量较小，或者我们需要进一步改进提示。

您可以在 Vertex AI 生成式 AI 文档中找到更多关于如何使用 Gen AI 评估服务的示例，包括如何

资源清理

最后，您可以按照以下步骤释放已创建的资源，以避免不必要的费用

deployed_model.undeploy_all 用于从所有端点取消模型部署。
deployed_model.delete 用于在 undeploy_all 方法之后优雅地删除模型部署到的端点。
model.delete 用于从注册表中删除模型。

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

📍 在 GitHub 上查看完整示例此处！

< > GitHub 更新