在 Vertex AI 上使用 GCS 部署带有 TGI DLC 的 Gemma 7B 模型

Gemma 是一个轻量级、最先进的开源模型系列，其构建基于与创建 Gemini 模型相同的研发和技术，由 Google DeepMind 和 Google 的其他团队开发。文本生成推理 (TGI) 是 Hugging Face 开发的一个工具包，用于部署和服务大型语言模型 (LLM)，并提供高性能文本生成。此外，Google Vertex AI 是一个机器学习 (ML) 平台，可用于训练和部署 ML 模型和 AI 应用程序，以及自定义大型语言模型 (LLM) 以用于您的 AI 驱动的应用程序。

此示例展示了如何在 Vertex AI 上部署任何受支持的文本生成模型，在本例中为 google/gemma-7b-it，该模型从 Hugging Face Hub 下载并上传到 Google Cloud Storage (GCS) 存储桶，并使用 Google Cloud Platform (GCP) 中提供的用于 TGI 的 Hugging Face DLC。

'google/gemma-7b-it' in the Hugging Face Hub

设置/配置

首先，您需要在本地机器上安装 gcloud，它是 Google Cloud 的命令行工具，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。

然后，您还需要安装 google-cloud-aiplatform Python SDK，它用于以编程方式创建 Vertex AI 模型、注册模型、创建端点并在 Vertex AI 上部署模型。

!pip install --upgrade --quiet google-cloud-aiplatform

可选地，为了方便在本教程中使用命令，您需要为 GCP 设置以下环境变量

%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env BUCKET_URI=gs://hf-tgi-vertex-ai
%env ARTIFACT_NAME=google--gemma-7b-it
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311

然后，您需要登录到您的 GCP 帐户并将项目 ID 设置为您要在 Vertex AI 上注册和部署模型的项目。

!gcloud auth login
!gcloud auth application-default login  # For local development
!gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，例如 Vertex AI API、Compute Engine API 和 Google Container Registry 相关的 API。

!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

可选：在 GCS 中创建存储桶并从 Hub 上传模型

除非您已经有一个包含要服务的工件的 GCS 存储桶，否则请按照以下说明创建新的存储桶并将模型权重下载并上传到其中。

要在 Google Cloud Storage (GCS) 上创建存储桶，您首先需要确保存储桶名称对新存储桶是唯一的，或者如果已存在同名存储桶，则需要确保名称唯一。为此，需要提前安装 gsutil SDK 和 crcmod Python 包，如下所示

!gcloud components install gsutil
!pip install --upgrade --quiet crcmod

然后，您可以使用以下 bash 脚本检查 GCS 中是否存在存储桶，如果不存在则创建它

%%bash

# Parse the bucket from the provided $BUCKET_URI path i.e. given gs://bucket-name/dir, extract bucket-name
BUCKET_NAME=$(echo $BUCKET_URI | cut -d'/' -f3)
# Check if the bucket exists, if not create it
if [ -z "$(gsutil ls | grep gs://$BUCKET_NAME)" ]; then
    gcloud storage buckets create gs://$BUCKET_NAME --project=$PROJECT_ID --location=$LOCATION --default-storage-class=STANDARD --uniform-bucket-level-access
fi

如果创建了存储桶或存储桶事先存在，则您就可以从 Hugging Face Hub 或本地存储上传 google/gemma-7b-it。

来自磁盘/本地存储的工件

如果模型在本地可用，例如在 Hugging Face 缓存路径 ~/.cache/huggingface/hub/models--google--gemma-7b-it/snapshots/8adab6a35fdbcdae0ae41ab1f711b1bc8d05727e 下，则应运行以下脚本将其上传到 GCS 存储桶。

%%bash
# Upload the model to Google Cloud Storage
LOCAL_DIR=~/.cache/huggingface/hub/models--google--gemma-7b-it/snapshots/8adab6a35fdbcdae0ae41ab1f711b1bc8d05727e
if [ -d "$LOCAL_DIR" ]; then
    gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -r $LOCAL_DIR/* $BUCKET_URI/$ARTIFACT_NAME
fi

来自 Hugging Face Hub 的工件

或者，您也可以从 Hugging Face Hub 将模型上传到 GCS 存储桶。由于 google/gemma-7b-it 是一个受限模型，因此您需要使用细粒度读取访问令牌（具有对受限模型的访问权限）或对您的帐户的整体读取访问令牌登录到您的 Hugging Face Hub 帐户。

有关如何在 Hugging Face Hub 中生成只读访问令牌的更多信息，请参阅 https://huggingface.co/docs/hub/en/security-tokens 中的说明。

!pip install "huggingface_hub[hf_transfer]" --upgrade --quiet

from huggingface_hub import interpreter_login

interpreter_login()

在完成huggingface_hub安装和登录后，您可以运行以下 bash 脚本将模型下载到本地临时目录中，然后将其上传到 GCS 存储桶。

%%bash
# Ensure the necessary environment variables are set
export HF_HUB_ENABLE_HF_TRANSFER=1

# # Create a local directory to store the downloaded models
LOCAL_DIR="tmp/google--gemma-7b-it"
mkdir -p $LOCAL_DIR

# # Download models from HuggingFace, excluding certain file types
huggingface-cli download google/gemma-7b-it --exclude "*.bin" "*.pth" "*.gguf" ".gitattributes" --local-dir $LOCAL_DIR

# Upload the downloaded models to Google Cloud Storage
gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -e -r $LOCAL_DIR/* $BUCKET_URI/$ARTIFACT_NAME

# Remove all files and hidden files in the target directory
rm -rf tmp/

要查看端到端脚本，请查看此存储库根目录下的./scripts/upload_model_to_gcs.sh。

GCS Bucket with model artifact

在 Vertex AI 上注册模型

一旦所有设置完成，您就可以通过以下方式使用 google-cloud-aiplatform Python SDK 初始化 Vertex AI 会话

import os
from google.cloud import aiplatform

aiplatform.init(
    project=os.getenv("PROJECT_ID"),
    location=os.getenv("LOCATION"),
    staging_bucket=os.getenv("BUCKET_URI"),
)

然后您就可以“上传”模型，即在 Vertex AI 上注册模型。这并非真正的上传，因为模型会在启动时自动从 GCS 存储桶 URI 下载，因此上传的只是配置，而不是模型权重。

在进入代码之前，让我们快速回顾一下提供给 upload 方法的参数

display_name 是将在 Vertex AI 模型注册表中显示的名称。
artifact_uri 是 GCS 存储桶中包含模型工件的目录的路径。
serving_container_image_uri 是将用于为模型提供服务的 TGI Hugging Face DLC 的位置。
serving_container_environment_variables 是在容器运行时将使用的环境变量，因此这些变量与 text-generation-inference 定义的环境变量保持一致，类似于text-generation-launcher 参数。此外，TGI 的 Hugging Face DLC 还会捕获 Vertex AI 中的 AIP_ 环境变量，如Vertex AI 文档 - 预测的自定义容器要求中所述。
- NUM_SHARD 是要使用的分片数量，如果您不想使用给定机器上的所有 GPU（例如，如果您有两块 GPU，但只想使用其中一块用于 TGI，则 NUM_SHARD=1），否则它与 CUDA_VISIBLE_DEVICES 相匹配。
- MAX_INPUT_TOKENS 是允许的最大输入长度（以标记数表示），它越大，提示词可以越长，但也会消耗更多内存。
- MAX_TOTAL_TOKENS 是最重要的设置值，因为它定义了运行客户端请求的“内存预算”，此值越大，每个请求在 RAM 中占用的空间就越大，批处理的效率就越低。
- MAX_BATCH_PREFILL_TOKENS 限制了预填充操作的标记数量，因为它占用了最多的内存并且受计算限制，因此限制可以发送的请求数量很有意义。
- HUGGING_FACE_HUB_TOKEN 是 Hugging Face Hub 令牌，因为google/gemma-7b-it 是一个受限模型，所以需要此令牌。
(可选) serving_container_ports 是 Vertex AI 端点将暴露的端口，默认为 8080。

有关支持的 aiplatform.Model.upload 参数的更多信息，请查看其 Python 参考：https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload。

从 TGI 2.3 DLC（即 us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311）及更高版本开始，您可以设置环境变量值 MESSAGES_API_ENABLED="true" 以在 Vertex AI 上部署消息 API，否则将部署生成 API。

model = aiplatform.Model.upload(
    display_name="google--gemma-7b-it",
    artifact_uri=f"{os.getenv('BUCKET_URI')}/{os.getenv('ARTIFACT_NAME')}",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "NUM_SHARD": "1",
        "MAX_INPUT_TOKENS": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "1512",
    },
    serving_container_ports=[8080],
)
model.wait()

Model on Vertex AI Model Registry

Model on Vertex AI Model Registry with path to GCS

在 Vertex AI 上部署模型

在 Vertex AI 上注册模型后，您需要定义要将模型部署到的端点，然后将模型部署链接到该端点资源。

为此，您需要调用 aiplatform.Endpoint.create 方法来创建一个新的 Vertex AI 端点资源（尚未链接到任何模型或任何可用的东西）。

endpoint = aiplatform.Endpoint.create(display_name="google--gemma-7b-it-endpoint")

Vertex AI Endpoint created

现在您可以在 Vertex AI 上的端点中部署注册的模型。

deploy 方法会将之前创建的端点资源与包含服务容器配置的模型链接起来，然后将其在 Vertex AI 的指定实例上部署模型。

在进入代码之前，让我们快速回顾一下提供给 deploy 方法的参数

endpoint 是要将模型部署到的端点，它是可选的，默认情况下将设置为模型显示名称，后跟 _endpoint 后缀。
machine_type、accelerator_type 和 accelerator_count 是定义要使用哪个实例的参数，此外，还分别定义要使用的加速器和加速器的数量。machine_type 和 accelerator_type 是相互关联的，因此您需要选择一个支持您正在使用的加速器的实例，反之亦然。有关不同实例的更多信息，请参阅Compute Engine 文档 - GPU 机器类型，以及有关 accelerator_type 命名的更多信息，请参阅Vertex AI 文档 - MachineSpec。

有关支持的 aiplatform.Model.deploy 参数的更多信息，您可以查看其 Python 参考：https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy。

deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-4",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

警告：通过 deploy 方法进行的 Vertex AI 端点部署可能需要 15 到 25 分钟。

Vertex AI Endpoint running the model

Vertex AI Endpoint logs in Cloud Logging

在 Vertex AI 上进行在线预测

最后，您可以使用 predict 方法在 Vertex AI 上运行在线预测，该方法会将请求发送到正在运行的端点，端点位于容器中指定的 /predict 路由，并遵循 Vertex AI I/O 负载格式。

由于您正在为 text-generation 模型提供服务，因此您需要确保正确应用了聊天模板（如果有）；这意味着需要安装 transformers 以实例化google/gemma-7b-it 的 tokenizer，并在将输入发送到 Vertex AI 端点的有效负载中之前，对输入对话运行 apply_chat_template 方法。

!pip install --upgrade --quiet transformers

安装完成后，以下代码段将把聊天模板应用于对话

from huggingface_hub import get_token
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", token=get_token())

messages = [
    {"role": "user", "content": "What's Deep Learning?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
# <bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n

这正是您将在有效负载中发送到已部署的 Vertex AI 端点的内容，以及生成参数，如https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation中所述。

通过 Python

在同一会话中

如果您希望在当前会话中运行在线预测，则可以通过 aiplatform.Endpoint（由 aiplatform.Model.deploy 方法返回）以编程方式发送请求，如下面的代码段所示

output = deployed_model.predict(
    instances=[
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 256,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 1.0,
            },
        },
    ]
)
print(output.predictions[0])

生成以下 output

Prediction(predictions=['\n\nDeep learning is a type of machine learning that uses artificial neural networks to learn from large amounts of data, making it a powerful tool for various tasks, including image recognition, natural language processing, and speech recognition.\n\n**Key Concepts:**\n\n* **Artificial Neural Networks (ANNs):** Structures that mimic the interconnected neurons in the brain.\n* **Deep Learning Architectures:** Multi-layered ANNs that learn hierarchical features from data.\n* **Transfer Learning:** Reusing learned features from one task to improve performance on another.\n\n**Types of Deep Learning:**\n\n* **Supervised Learning:** Models are trained on labeled data, where inputs are paired with corresponding outputs.\n* **Unsupervised Learning:** Models learn patterns from unlabeled data, such as clustering or dimensionality reduction.\n* **Reinforcement Learning:** Models learn through trial-and-error by interacting with an environment to optimize a task.\n\n**Benefits:**\n\n* **High Accuracy:** Deep learning models can achieve high accuracy on complex tasks.\n* **Adaptability:** Deep learning models can adapt to new data and tasks.\n* **Scalability:** Deep learning models can handle large amounts of data.\n\n**Applications:**\n\n* Image recognition\n* Natural language processing (NLP)\n'], deployed_model_id='***', metadata=None, model_version_id='1', model_resource_name='projects/***/locations/us-central1/models/***', explanations=None)

来自不同会话

如果 Vertex AI 端点是在不同的会话中部署的，并且您想使用它但无法访问上一节中通过 aiplatform.Model.deploy 方法返回的 deployed_model 变量；您还可以运行以下代码片段，通过其资源名称（例如 projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}）实例化已部署的 aiplatform.Endpoint。

您需要通过 Google Cloud Console 自己检索资源名称（即 projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID} URL），或者只需替换下面的 ENDPOINT_ID，它可以通过之前实例化的 endpoint 作为 endpoint.id 或通过 Google Cloud Console 中的在线预测（列出端点的位置）找到。

import os
from google.cloud import aiplatform

aiplatform.init(project=os.getenv("PROJECT_ID"), location=os.getenv("LOCATION"))

endpoint_display_name = "google--gemma-7b-it-endpoint"  # TODO: change to your endpoint display name

# Iterates over all the Vertex AI Endpoints within the current project and keeps the first match (if any), otherwise set to None
ENDPOINT_ID = next(
    (endpoint.name for endpoint in aiplatform.Endpoint.list() if endpoint.display_name == endpoint_display_name), None
)
assert ENDPOINT_ID, (
    "`ENDPOINT_ID` is not set, please make sure that the `endpoint_display_name` is correct at "
    f"https://console.cloud.google.com/vertex-ai/online-prediction/endpoints?project={os.getenv('PROJECT_ID')}"
)

endpoint = aiplatform.Endpoint(
    f"projects/{os.getenv('PROJECT_ID')}/locations/{os.getenv('LOCATION')}/endpoints/{ENDPOINT_ID}"
)
output = endpoint.predict(
    instances=[
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": True,
                "top_p": 0.95,
                "temperature": 0.7,
            },
        },
    ],
)
print(output.predictions[0])

通过 Vertex AI 在线预测 UI

或者，出于测试目的，您还可以使用 Vertex AI 在线预测 UI，它提供了一个字段，该字段需要根据 Vertex AI 规范（如上例所示）格式化的 JSON 有效负载，即

{
    "instances": [
        {
            "inputs": "<bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n",
            "parameters": {
                "max_new_tokens": 128,
                "do_sample": true,
                "top_p": 0.95,
                "temperature": 0.7
            }
        }
    ]
}

Vertex AI Endpoint online inference

资源清理

最后，您可以按照以下步骤释放已创建的资源，以避免不必要的成本

deployed_model.undeploy_all 从所有端点取消部署模型。
deployed_model.delete 在 undeploy_all 方法之后，优雅地删除模型部署到的端点。
model.delete 从注册表中删除模型。

从 Vertex AI 删除模型时，由于它存储在 GCS 存储桶中，因此从 Vertex AI 删除模型时，存储桶及其内容都不会被删除。

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

或者，您也可以按照以下步骤从 Google Cloud Console 中删除它们

转到 Google Cloud 中的 Vertex AI
转到部署和使用 -> 在线预测
单击端点，然后单击已部署的模型以“从端点取消部署模型”
然后返回端点列表并删除端点
最后，转到部署和使用 -> 模型注册表，并删除模型

此外，您可能还想删除 GCS 存储桶，为此，您可以使用以下 gcloud 命令

!gcloud storage rm -r $BUCKET_URI

或者，或者，只需从 Google Cloud Console 中删除存储桶和/或其内容。

📍 在 GitHub 上查找完整示例此处！

< > 更新在 GitHub 上