在 Vertex AI 上部署 Gemma 7B 和 TGI DLC

Gemma 是一个轻量级、最先进的开放模型家族，它基于用于创建 Gemini 模型的相同研究和技术，由 Google DeepMind 和 Google 的其他团队开发。文本生成推理 (TGI) 是由 Hugging Face 开发的一个工具包，用于部署和提供 LLM，具有高性能文本生成。此外，Google Vertex AI 是一个机器学习 (ML) 平台，可让您训练和部署 ML 模型和 AI 应用程序，并自定义大型语言模型 (LLM) 以用于您的 AI 驱动的应用程序。

此示例展示了如何部署任何支持的文本生成模型，在本例中为 google/gemma-7b-it，从 Hugging Face Hub 到 Vertex AI，使用 Google Cloud Platform (GCP) 中可用的 TGI DLC。

'google/gemma-7b-it' in the Hugging Face Hub

设置/配置

首先，您需要在本地机器上安装 gcloud，它是 Google Cloud 的命令行工具，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。

然后，您还需要安装 google-cloud-aiplatform Python SDK，它需要以编程方式创建 Vertex AI 模型、注册模型、创建端点以及将其部署到 Vertex AI 上。

!pip install --upgrade --quiet google-cloud-aiplatform

可选地，为了方便在本教程中使用命令，您需要为 GCP 设置以下环境变量

%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311

然后，您需要登录到您的 GCP 帐户，并将项目 ID 设置为要用于在 Vertex AI 上注册和部署模型的项目 ID。

!gcloud auth login
!gcloud auth application-default login  # For local development
!gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，例如 Vertex AI API、Compute Engine API 和 Google Container Registry 相关的 API。

!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

在 Vertex AI 上注册模型

一切设置好后，您就可以通过 google-cloud-aiplatform Python SDK 初始化 Vertex AI 会话，如下所示

import os
from google.cloud import aiplatform

aiplatform.init(
    project=os.getenv("PROJECT_ID"),
    location=os.getenv("LOCATION"),
)

由于 google/gemma-7b-it 是一个受限模型，您需要使用读访问令牌登录到您的 Hugging Face Hub 帐户，该令牌要么是细粒度地对受限模型有访问权限，要么是您帐户的总体读访问权限。有关如何在 Hugging Face Hub 中生成只读访问令牌的更多信息，请参阅 https://huggingface.co/docs/hub/en/security-tokens 中的说明。

!pip install --upgrade --quiet huggingface_hub

from huggingface_hub import interpreter_login

interpreter_login()

然后，您就可以“上传”模型，即在 Vertex AI 上注册模型。这不是真正的上传，因为模型将在启动时通过 MODEL_ID 环境变量从 Hugging Face Hub 中的 Hugging Face DLC for TGI 自动下载，因此上传的只是配置，而不是模型权重。

在深入代码之前，让我们快速回顾一下提供给 upload 方法的参数

display_name 是将在 Vertex AI 模型注册表中显示的名称。
serving_container_image_uri 是用于提供模型的 Hugging Face DLC for TGI 的位置。
serving_container_environment_variables 是在容器运行时使用的环境变量，因此它们与 text-generation-inference 定义的环境变量保持一致，这些环境变量类似于 text-generation-launcher 参数。此外，Hugging Face DLC for TGI 还从 Vertex AI 中捕获 AIP_ 环境变量，如 Vertex AI 文档 - 预测的自定义容器要求中所述。
- MODEL_ID 是 Hugging Face Hub 中模型的标识符。要探索所有支持的模型，您可以查看 https://huggingface.co/models?sort=trending&other=text-generation-inference。
- NUM_SHARD 是要使用的分片数，如果您不想使用给定机器上的所有 GPU（例如，如果您有两个 GPU，但只想要使用一个用于 TGI，那么 NUM_SHARD=1，否则它会匹配 CUDA_VISIBLE_DEVICES）。
- MAX_INPUT_TOKENS 是允许的最大输入长度（以令牌数表示），它越大，提示符可以越大，但也会消耗更多内存。
- MAX_TOTAL_TOKENS 是最重要的值，因为它定义了运行客户端请求的“内存预算”，该值越大，每个请求在您的 RAM 中所占的内存就越大，批处理的有效性就越低。
- MAX_BATCH_PREFILL_TOKENS 限制了预填充操作的令牌数，因为它消耗了最多的内存并且是计算密集型，因此限制可以发送的请求数很有意思。
- HUGGING_FACE_HUB_TOKEN 是 Hugging Face Hub 令牌，这是必需的，因为 google/gemma-7b-it 是一个受限模型。
（可选）serving_container_ports 是 Vertex AI 端点将要暴露的端口，默认情况下为 8080。

有关支持的 aiplatform.Model.upload 参数的更多信息，请查看其 Python 参考，地址为 https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload。

从 TGI 2.3 DLC 开始，例如 us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 及其之后版本，您可以设置环境变量值 MESSAGES_API_ENABLED="true" 来在 Vertex AI 上部署消息 API，否则将部署生成 API。

from huggingface_hub import get_token

model = aiplatform.Model.upload(
    display_name="google--gemma-7b-it",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "MODEL_ID": "google/gemma-7b-it",
        "NUM_SHARD": "1",
        "MAX_INPUT_TOKENS": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "1512",
        "HUGGING_FACE_HUB_TOKEN": get_token(),
    },
    serving_container_ports=[8080],
)
model.wait()

Model on Vertex AI Model Registry

在 Vertex AI 上部署模型

模型注册到 Vertex AI 后，您需要定义要部署模型的端点，然后将模型部署链接到该端点资源。

为此，您需要调用 aiplatform.Endpoint.create 方法来创建一个新的 Vertex AI 端点资源（尚未链接到模型或任何可用的东西）。

endpoint = aiplatform.Endpoint.create(display_name="google--gemma-7b-it-endpoint")

Vertex AI Endpoint created

现在您可以在 Vertex AI 上的端点中部署注册的模型。

deploy 方法将之前创建的端点资源与包含服务容器配置的模型链接起来，然后在指定的实例上在 Vertex AI 上部署模型。

在深入代码之前，让我们快速回顾一下提供给 deploy 方法的参数

endpoint 是要部署模型的端点，它是可选的，默认情况下将设置为模型显示名称，后缀为 _endpoint。
machine_type、accelerator_type 和 accelerator_count 是定义要使用哪个实例的参数，此外还包括要使用的加速器以及加速器的数量。machine_type 和 accelerator_type 是绑定的，因此您需要选择一个支持您正在使用的加速器的实例，反之亦然。有关不同实例的更多信息，请参见 Compute Engine 文档 - GPU 机器类型，有关 accelerator_type 命名的更多信息，请参见 Vertex AI 文档 - MachineSpec。

有关支持的 aiplatform.Model.deploy 参数的更多信息，您可以在 https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy 查看其 Python 参考。

deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-4",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

警告：通过 deploy 方法部署 Vertex AI 端点可能需要 15 到 25 分钟。

Vertex AI Endpoint running the model

Vertex AI Endpoint logs in Cloud Logging

Vertex AI 上的在线预测

最后，您可以使用 predict 方法在 Vertex AI 上运行在线预测，该方法将请求发送到在容器内按照 Vertex AI I/O 负载格式指定的 /predict 路由中运行的端点。

由于您正在提供 text-generation 模型，因此您需要确保正确应用了聊天模板（如果有），这意味着需要安装 transformers 才能为 google/gemma-7b-it 实例化 tokenizer，并在将输入发送到 Vertex AI 端点的负载之前，对输入对话运行 apply_chat_template 方法。

!pip install --upgrade --quiet transformers

安装完成后，以下代码段将聊天模板应用于对话

from huggingface_hub import get_token
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", token=get_token())

messages = [
    {"role": "user", "content": "What's Deep Learning?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
# <bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n

这就是您将在负载中发送到已部署的 Vertex AI 端点的内容，以及在 https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation 中的生成参数。

通过 Python

在同一会话中

如果您愿意在当前会话中运行在线预测，则可以通过 aiplatform.Endpoint（由 aiplatform.Model.deploy 方法返回）以编程方式发送请求，如以下代码段所示

output = deployed_model.predict(
    instances=[
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 256,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 1.0,
            },
        },
    ]
)
print(output.predictions[0])

生成以下 output

Prediction(predictions=['\n\nDeep learning is a type of machine learning that uses artificial neural networks to learn from large amounts of data, making it a powerful tool for various tasks, including image recognition, natural language processing, and speech recognition.\n\n**Key Concepts:**\n\n* **Artificial Neural Networks (ANNs):** Structures that mimic the interconnected neurons in the brain.\n* **Deep Learning Architectures:** Multi-layered ANNs that learn hierarchical features from data.\n* **Transfer Learning:** Reusing learned features from one task to improve performance on another.\n\n**Types of Deep Learning:**\n\n* **Supervised Learning:** Models are trained on labeled data, where inputs are paired with corresponding outputs.\n* **Unsupervised Learning:** Models learn patterns from unlabeled data, such as clustering or dimensionality reduction.\n* **Reinforcement Learning:** Models learn through trial-and-error by interacting with an environment to optimize a task.\n\n**Benefits:**\n\n* **High Accuracy:** Deep learning models can achieve high accuracy on complex tasks.\n* **Adaptability:** Deep learning models can adapt to new data and tasks.\n* **Scalability:** Deep learning models can handle large amounts of data.\n\n**Applications:**\n\n* Image recognition\n* Natural language processing (NLP)\n'], deployed_model_id='***', metadata=None, model_version_id='1', model_resource_name='projects/***/locations/us-central1/models/***', explanations=None)

Vertex AI Endpoint logs in Cloud Logging after predict

从不同的会话

如果 Vertex AI 端点是在不同的会话中部署的，并且您想使用它，但无法访问上一节中 aiplatform.Model.deploy 方法返回的 deployed_model 变量；您也可以运行以下代码段，通过其资源名称（如 projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}）实例化已部署的 aiplatform.Endpoint。

您需要通过 Google Cloud Console 自己检索资源名称（即 projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID} URL），或者只需替换以下 ENDPOINT_ID，该 ID 可以通过先前实例化的 endpoint 作为 endpoint.id 或通过 Google Cloud Console 在列出端点的在线预测中找到。

import os
from google.cloud import aiplatform

aiplatform.init(project=os.getenv("PROJECT_ID"), location=os.getenv("LOCATION"))

endpoint_display_name = "google--gemma-7b-it-endpoint"  # TODO: change to your endpoint display name

# Iterates over all the Vertex AI Endpoints within the current project and keeps the first match (if any), otherwise set to None
ENDPOINT_ID = next(
    (endpoint.name for endpoint in aiplatform.Endpoint.list() if endpoint.display_name == endpoint_display_name), None
)
assert ENDPOINT_ID, (
    "`ENDPOINT_ID` is not set, please make sure that the `endpoint_display_name` is correct at "
    f"https://console.cloud.google.com/vertex-ai/online-prediction/endpoints?project={os.getenv('PROJECT_ID')}"
)

endpoint = aiplatform.Endpoint(
    f"projects/{os.getenv('PROJECT_ID')}/locations/{os.getenv('LOCATION')}/endpoints/{ENDPOINT_ID}"
)
output = endpoint.predict(
    instances=[
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 0.7,
            },
        },
    ],
)
print(output.predictions[0])

通过 Vertex AI 在线预测 UI

或者，为了测试目的，您也可以使用 Vertex AI 在线预测 UI，它提供一个字段，该字段需要按照 Vertex AI 规范（如上面的示例所示）格式化的 JSON 负载，即

{
    "instances": [
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": true,
                "top_p": 0.95,
                "temperature": 0.7
            }
        }
    ]
}

Vertex AI Endpoint online inference

资源清理

最后，您可以按照以下步骤释放已创建的资源，以避免不必要的成本

deployed_model.undeploy_all 用于从所有端点取消部署模型。
deployed_model.delete 用于在 undeploy_all 方法之后优雅地删除部署模型的端点。
model.delete 用于从注册表中删除模型。

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

或者，您也可以按照以下步骤从 Google Cloud Console 中删除它们

转到 Google Cloud 中的 Vertex AI
转到部署和使用 -> 在线预测
单击端点，然后单击已部署的模型以“从端点取消部署模型”
然后返回端点列表并删除端点
最后，转到部署和使用 -> 模型注册表，然后删除模型

📍 在 GitHub 上找到完整的示例这里！

< > 更新在 GitHub 上