在 Vertex AI 上部署 Meta Llama 3.1 405B 和 TGI DLC

Meta Llama 3.1 是 Meta 发布的最新开源大型语言模型，是 Llama 3 的后续版本，于 2024 年 7 月发布。Meta Llama 3.1 有三种尺寸：8B 适用于在消费者级 GPU 上进行高效部署和开发，70B 适用于大规模 AI 原生应用，以及 405B 适用于合成数据、LLM 作为评判或蒸馏等用例。在 Meta Llama 3.1 的新功能中，值得一提的是：128K 个令牌的大上下文长度（相对于原始的 8K），多语言功能，工具使用功能以及更宽松的许可证。

此示例展示了如何在 Vertex AI 上部署 meta-llama/Meta-Llama-3.1-405B-Instruct-FP8，使用具有 8 个 NVIDIA H100 的 A3 加速器优化实例，通过 Hugging Face 为 Google Cloud 上的文本生成推理 (TGI) 量身定制的深度学习容器 (DLC)。

'meta-llama/Meta-Llama-3.1-405B-Instruct-FP8' in the Hugging Face Hub

设置/配置

首先，您需要在本地机器上安装 gcloud，这是 Google Cloud 的命令行工具，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。

然后，您还需要安装 google-cloud-aiplatform Python SDK，它用于以编程方式创建 Vertex AI 模型、注册模型、创建端点并在 Vertex AI 上部署模型。

!pip install --upgrade --quiet google-cloud-aiplatform

可选地，为了方便在本教程中使用命令，您需要为 GCP 设置以下环境变量

%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311

然后，您需要登录到您的 GCP 帐户并将项目 ID 设置为您要在 Vertex AI 上注册和部署模型的项目 ID。

!gcloud auth login
!gcloud auth application-default login  # For local development
!gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，例如 Vertex AI API、Compute Engine API 和 Google Container Registry 相关的 API。

!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

一切设置完成后，您就可以通过 google-cloud-aiplatform Python SDK 初始化 Vertex AI 会话，如下所示

import os
from google.cloud import aiplatform

aiplatform.init(
    project=os.getenv("PROJECT_ID"),
    location=os.getenv("LOCATION"),
)

Google Cloud 上的配额

要服务 meta-llama/Meta-Llama-3.1-405B-Instruct-FP8，您需要一个至少拥有 400GiB GPU VRAM 并支持 FP8 数据类型的实例，而 Google Cloud 上的 A3 加速器优化机器正是您需要使用的机器。

即使 Google Cloud 中提供了配备 8 个 NVIDIA H100 80GB GPU 的 A3 加速器优化机器，您仍然需要申请 Google Cloud 的自定义配额增加，因为这些机器需要特定的批准。请注意，A3 加速器优化机器仅在某些区域可用，因此请确保在 Compute Engine - GPU 区域和区域中检查每个区域的 A3 高级或 A3 超级机器的可用性。

A3 availability in Google Cloud

在这种情况下，要申请配额增加以使用配备 8 个 NVIDIA H100 的机器，您需要增加以下配额

服务：Vertex AI API 和 名称：每个区域的自定义模型服务 Nvidia H100 80GB GPU 设置为 8
服务：Vertex AI API 和 名称：每个区域的自定义模型服务 A3 CPU 设置为 208

A3 Quota Request in Google Cloud

有关如何申请配额增加的更多信息，请参阅 Google Cloud 文档 - 查看和管理配额。

在 Vertex AI 上注册模型

由于 meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 是一个受限模型，您需要登录到您的 Hugging Face Hub 帐户，接受访问限制要求，然后生成一个访问令牌，可以选择仅对受限模型进行细粒度读取访问（推荐），或对您的帐户进行读取访问。

有关 Hugging Face Hub 的访问令牌的更多信息。

要进行身份验证，您可以使用 huggingface_hub Python SDK（推荐）或仅设置环境变量 HF_TOKEN。

!pip install --upgrade --quiet huggingface_hub

from huggingface_hub import interpreter_login

interpreter_login()

然后，您就可以“上传”模型，即在 Vertex AI 上注册模型。它本身并不是上传，因为模型会在启动时通过MODEL_ID环境变量从 Hugging Face Hub 中的 TGI 的 Hugging Face DLC 自动下载，因此上传的只是配置，而不是模型权重。

在深入代码之前，让我们快速回顾一下提供给upload方法的参数。

display_name 是将在 Vertex AI 模型注册表中显示的名称。
serving_container_image_uri 是将用于为模型提供服务的 TGI 的 Hugging Face DLC 的位置。
serving_container_environment_variables 是在容器运行时期间将使用的环境变量，因此这些变量与 TGI 通过text-generation-launcher定义的环境变量保持一致，该变量公开了一些环境变量，例如以下这些：
- MODEL_ID Hugging Face Hub 上的模型 ID。
- NUM_SHARD 要使用的分片数，即要使用的 GPU 数量，在本例中设置为 8，因为将使用具有 8 个 NVIDIA H100 的节点。
- HUGGING_FACE_HUB_TOKEN 是 Hugging Face Hub 令牌，需要此令牌，因为meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 是一个受限模型。
- HF_HUB_ENABLE_HF_TRANSFER 通过hf_transfer库启用更快的下载速度。

有关支持的参数的更多信息，请查看aiplatform.Model.upload Python 参考。

从 TGI 2.3 DLC（即us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311）及更高版本开始，您可以设置环境变量值MESSAGES_API_ENABLED="true"以在 Vertex AI 上部署消息 API，否则将部署生成 API。

from huggingface_hub import get_token

model = aiplatform.Model.upload(
    display_name="meta-llama--Meta-Llama-3.1-405B-Instruct-FP8",
    serving_container_image_uri="",
    serving_container_environment_variables={
        "MODEL_ID": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
        "HUGGING_FACE_HUB_TOKEN": get_token(),
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "NUM_SHARD": "8",
    },
)
model.wait()

Meta Llama 3.1 405B FP8 registered on Vertex AI

在 Vertex AI 上部署模型

一旦 Meta Llama 3.1 405B 在 Vertex AI 模型注册表中注册，您就可以使用 TGI 的 Hugging Face DLC 在 Vertex AI 端点上部署它。

deploy方法将先前创建的端点资源与包含服务容器配置的模型关联，然后在指定的实例上将模型部署到 Vertex AI。

在深入代码之前，让我们快速回顾一下提供给deploy方法的参数。

endpoint 是要将模型部署到的端点，它是可选的，默认情况下将设置为模型显示名称，后缀为_endpoint。
machine_type、accelerator_type 和 accelerator_count 是定义使用哪个实例的参数，此外，还分别定义了要使用的加速器和加速器的数量。machine_type 和 accelerator_type 是相互关联的，因此您需要选择一个支持您正在使用的加速器的实例，反之亦然。有关不同实例的更多信息，请参阅Compute Engine 文档 - GPU 机器类型，以及有关accelerator_type命名的更多信息，请参阅Vertex AI 文档 - MachineSpec。

有关支持的参数的更多信息，您可以查看aiplatform.Model.deploy Python 参考。

如前所述，由于 FP8 中的 Meta Llama 3.1 405B 占用约 400 GiB 的磁盘空间，这意味着您至少需要 400 GiB 的 GPU VRAM 来加载模型，并且节点中的 GPU 需要支持 FP8 数据类型。在本例中，将使用一个 A3 实例，该实例配备 8 个 NVIDIA H100 80GB，总共约 640 GiB 的 VRAM 来加载模型，同时还为 KV 缓存和 CUDA 图保留一些空闲的 VRAM。

deployed_model = model.deploy(
    endpoint=aiplatform.Endpoint.create(display_name="Meta-Llama-3.1-405B-FP8-Endpoint"),
    machine_type="a3-highgpu-8g",
    accelerator_type="NVIDIA_H100_80GB",
    accelerator_count=8,
    enable_access_logging=True,
)

meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 在 Vertex AI 上的部署大约需要 30 分钟才能完成，因为它需要在 Google Cloud 上分配资源，然后从 Hugging Face Hub 下载权重（约 10 分钟），并在 TGI 中加载这些权重以进行推理（约 3 分钟）。

Meta Llama 3.1 405B Instruct FP8 deployed on Vertex AI

在 Vertex AI 上进行在线预测

最后，您可以使用predict方法在 Vertex AI 上运行在线预测，该方法会将请求发送到正在运行的端点中的/predict路由，该路由在容器内指定，并遵循 Vertex AI I/O 负载格式。

由于/generate是在 Vertex AI 上通过 TGI 公开的端点，因此您需要使用聊天模板格式化消息，然后再将请求发送到 Vertex AI，因此您需要安装 🤗transformers以使用PreTrainedTokenizerFast中的apply_chat_template方法。

%%bash
pip install --upgrade --quiet transformers

然后，使用标记器将聊天模板应用于对话，如下所示：

import os
from huggingface_hub import get_token
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    token=get_token(),
)

messages = [
    {"role": "system", "content": "You are an assistant that responds as a pirate."},
    {"role": "user", "content": "What's the Theory of Relativity?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

这就是您将在负载中发送到已部署的 Vertex AI 端点的内容，以及生成参数，如使用文本生成推理 (TGI) -> 生成中所述。

通过 Python

在同一会话中

如果您希望在当前会话中运行在线预测，则可以通过aiplatform.Endpoint（由aiplatform.Model.deploy方法返回）以编程方式发送请求，如下面的代码片段所示：

output = deployed_model.predict(
    instances=[
        {
            "inputs": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 1.0,
            },
        },
    ]
)
print(output.predictions[0])

生成以下output：

Prediction(predictions=["Yer want ta know about them fancy science things, eh? Alright then, matey, settle yerself down with a pint o' grog and listen close. I be tellin' ye about the Theory o' Relativity, as proposed by that swashbucklin' genius, Albert Einstein.\n\nNow, ye see, Einstein said that time and space be connected like the sea and the wind. Ye can't have one without the other, savvy? And he proposed that how ye see time and space depends on how fast ye be movin' and where ye be standin'. That be called relativity, me"], deployed_model_id='***', metadata=None, model_version_id='1', model_resource_name='projects/***/locations/us-central1/models/***', explanations=None)

来自不同的会话

如果 Vertex AI 端点是在不同的会话中部署的，并且您想使用它但无法访问上一节中aiplatform.Model.deploy方法返回的deployed_model变量；您还可以运行以下代码段来通过其资源名称（如projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}）实例化已部署的aiplatform.Endpoint。

您需要通过 Google Cloud Console 自己检索资源名称（即projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID} URL），或者只需替换下面的ENDPOINT_ID，该 ID 可以在先前实例化的endpoint中作为endpoint.id找到，或者可以通过 Google Cloud Console 中的在线预测（其中列出了端点）找到。

import os
from google.cloud import aiplatform

aiplatform.init(project=os.getenv("PROJECT_ID"), location=os.getenv("LOCATION"))

endpoint_display_name = "Meta-Llama-3.1-405B-FP8-Endpoint"  # TODO: change to your endpoint display name

# Iterates over all the Vertex AI Endpoints within the current project and keeps the first match (if any), otherwise set to None
ENDPOINT_ID = next(
    (endpoint.name for endpoint in aiplatform.Endpoint.list() if endpoint.display_name == endpoint_display_name), None
)
assert ENDPOINT_ID, (
    "`ENDPOINT_ID` is not set, please make sure that the `endpoint_display_name` is correct at "
    f"https://console.cloud.google.com/vertex-ai/online-prediction/endpoints?project={os.getenv('PROJECT_ID')}"
)

endpoint = aiplatform.Endpoint(
    f"projects/{os.getenv('PROJECT_ID')}/locations/{os.getenv('LOCATION')}/endpoints/{ENDPOINT_ID}"
)
output = endpoint.predict(
    instances=[
        {
            "inputs": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 0.7,
            },
        },
    ],
)
print(output.predictions[0])

生成以下output：

Prediction(predictions=["Yer lookin' fer a treasure trove o' knowledge about them fancy physics, eh? Alright then, matey, settle yerself down with a pint o' grog and listen close, as I spin ye the yarn o' Einstein's Theory o' Relativity.\n\nIt be a tale o' two parts, me hearty: Special Relativity and General Relativity. Now, I know what ye be thinkin': what in blazes be the difference? Well, matey, let me break it down fer ye.\n\nSpecial Relativity be the idea that time and space be connected like the sea and the sky."], deployed_model_id='***', metadata=None, model_version_id='1', model_resource_name='projects/***/locations/us-central1/models/***', explanations=None)

通过 Vertex AI 在线预测 UI

或者，出于测试目的，您还可以使用 Vertex AI 在线预测 UI，它提供了一个字段，该字段期望根据 Vertex AI 规范（如上例所示）格式化的 JSON 负载，即：

{
    "instances": [
        {
            "inputs": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant that responds as a pirate.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the Theory of Relativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": true,
                "top_p": 0.95,
                "temperature": 0.7
            }
        }
    ]
}

Meta Llama 3.1 405B Instruct FP8 online prediction on Vertex AI

资源清理

最后，您可以按照以下步骤释放已创建的资源，以避免不必要的费用。

deployed_model.undeploy_all 用于从所有端点取消部署模型。
deployed_model.delete 用于在 undeploy_all 方法之后优雅地删除模型部署所在的端点。
model.delete 用于从注册表中删除模型。

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

或者，您也可以按照以下步骤从 Google Cloud Console 中删除这些资源。

转到 Google Cloud 中的 Vertex AI。
转到“部署和使用” -> “在线预测”。
点击端点，然后点击已部署的模型以“从端点取消部署模型”。
然后返回端点列表并删除端点。
最后，转到“部署和使用” -> “模型注册表”，并删除模型。

📍 在 GitHub 上查找完整示例此处！

< > 更新在 GitHub 上