在 GKE 上部署 Llama 3.2 11B Vision 和 TGI DLC

Llama 3.2 是 Meta 发布的 Llama 系列开源大型语言模型的最新版本（截至 2024 年 10 月）；Llama 3.2 Vision 有两种尺寸：11B 适用于在消费级 GPU 上进行高效部署和开发，以及 90B 适用于大规模应用。文本生成推理 (TGI) 是 Hugging Face 开发的一个工具包，用于部署和服务大型语言模型，具有高性能的文本生成能力。此外，Google Kubernetes Engine (GKE) 是 Google Cloud 中一个完全托管的 Kubernetes 服务，可用于利用 Google 基础设施大规模部署和运行容器化应用程序。

此示例展示了如何通过 Hugging Face 为文本生成推理 (TGI) 在 Google Cloud 上构建的专用深度学习容器 (DLC) 在 GKE 上部署 meta-llama/Llama-3.2-11B-Vision。

关于许可条款，Llama 3.2 的许可证与 Llama 3.1 非常相似，但在可接受的使用政策方面存在一个关键区别：居住在欧盟 (EU) 或主要营业地在欧盟 (EU) 的任何个人或公司均未被授予使用 Llama 3.2 中包含的多模态模型的许可权。此限制不适用于包含任何此类多模态模型的产品或服务的最终用户，因此人们仍然可以使用视觉变体构建全球产品。

有关完整详细信息，请务必阅读官方许可证和可接受的使用政策。

设置/配置

首先，您需要在本地机器上安装 gcloud 和 kubectl，它们分别是 Google Cloud 和 Kubernetes 的命令行工具，分别用于与 GCP 和 GKE 集群交互。

要安装 gcloud，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。
要安装 kubectl，请按照 Kubernetes 文档 - 安装工具中的说明进行操作。

可选地，为了方便在本教程中使用命令，您需要为 GCP 设置以下环境变量

export PROJECT_ID=your-project-id
export LOCATION=your-location
export CLUSTER_NAME=your-cluster-name

然后，您需要登录到您的 GCP 帐户并将项目 ID 设置为您要用于部署 GKE 集群的项目 ID。

gcloud auth login
gcloud auth application-default login  # For local development
gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，例如 Google Kubernetes Engine API、Google Container Registry API 和 Google Container File System API，这些 API 对于部署 GKE 集群和 Hugging Face TGI DLC 是必需的。

gcloud services enable container.googleapis.com
gcloud services enable containerregistry.googleapis.com
gcloud services enable containerfilesystem.googleapis.com

此外，要使用 kubectl 与 GKE 集群凭据一起使用，您还需要安装 gke-gcloud-auth-plugin，可以使用 gcloud 如下安装

gcloud components install gke-gcloud-auth-plugin

安装 gke-gcloud-auth-plugin 不需要专门通过 gcloud 安装，要了解有关替代安装方法的更多信息，请访问 GKE 文档 - 安装 kubectl 并配置集群访问。

创建 GKE 集群

一切设置完成后，您可以继续在 Autopilot 或 Standard 模式下创建 GKE 集群；对于 Autopilot，您只需要创建集群，节点池将根据部署需求创建；而在 Standard 模式下，您需要自己创建节点池，并管理大部分底层基础设施。

将使用“Autopilot”模式，因为它是大多数工作负载的推荐模式，因为底层基础设施由 Google 管理；但您也可以选择使用“Standard”模式。

在创建 GKE Autopilot 集群之前，务必检查 GKE 文档 - 通过选择机器系列优化 Autopilot Pod 性能，因为并非所有版本都支持 GPU 加速器，例如 nvidia-l4 在 GKE 集群版本 1.28.3 或更低版本中不受支持。

gcloud container clusters create-auto $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --location=$LOCATION \
    --release-channel=stable \
    --cluster-version=1.29 \
    --no-autoprovisioning-enable-insecure-kubelet-readonly-port

要选择您所在位置的 GKE 集群的特定版本，您可以运行以下命令

gcloud container get-server-config \
    --flatten="channels" \
    --filter="channels.channel=STABLE" \
    --format="yaml(channels.channel,channels.defaultVersion)" \
    --location=$LOCATION

有关更多信息，请访问 GKE 文档 - 指定集群版本。

GKE Cluster in the GCP Console

创建 GKE 集群后，您可以使用以下命令获取通过 kubectl 访问它的凭据

gcloud container clusters get-credentials $CLUSTER_NAME --location=$LOCATION

获取 Hugging Face 令牌并在 GKE 中设置密钥

由于 meta-llama/Llama-3.2-11B-Vision-Instruct 是一个受限访问的模型，在欧盟（EU）地区无法直接访问，您需要通过 kubectl 设置一个包含 Hugging Face Hub 令牌的 Kubernetes 密钥。

要为 Hugging Face Hub 生成自定义令牌，您可以按照 Hugging Face Hub - 用户访问令牌中的说明进行操作；推荐的方式是安装 huggingface_hub Python SDK，如下所示：

pip install --upgrade --quiet huggingface_hub

然后使用生成的令牌登录，该令牌对受限/私有模型具有读取权限。

huggingface-cli login

最后，您可以使用 huggingface_hub Python SDK 检索令牌，如下所示创建包含 Hugging Face Hub 生成令牌的 Kubernetes 密钥：

kubectl create secret generic hf-secret \
    --from-literal=hf_token=$(python -c "from huggingface_hub import get_token; print(get_token())") \
    --dry-run=client -o yaml | kubectl apply -f -

或者，您可以直接设置令牌，如下所示：

kubectl create secret generic hf-secret \
    --from-literal=hf_token=hf_*** \
    --dry-run=client -o yaml | kubectl apply -f -

GKE Secret in the GCP Console

有关如何在 GKE 集群中设置 Kubernetes 密钥的更多信息，请参见 Secret Manager 文档 - 使用 Secret Manager 附加组件与 Google Kubernetes Engine。

部署 TGI

现在，您可以继续进行 Hugging Face DLC for TGI 的 Kubernetes 部署，从 Hugging Face Hub 提供 meta-llama/Llama-3.2-11B-Vision-Instruct 模型的服务。

要探索所有可以通过 TGI 提供服务的模型，您可以浏览 Hub 中标记为 text-generation-inference 的模型；特别是，如果您对视觉语言模型 (VLM) 感兴趣，您可以浏览 Hub 中同时标记为 text-generation-inference 和 image-text-to-text 的模型。

Hugging Face DLC for TGI 将通过 kubectl 部署，来自 config/ 目录中的配置文件。

deployment.yaml：包含 Pod 的部署详细信息，包括对 Hugging Face DLC for TGI 的引用，将 MODEL_ID 设置为 meta-llama/Llama-3.2-11B-Vision-Instruct。由于 GKE 集群以自动驾驶模式部署，因此指定的资源（即 2 个 L4s）将自动分配；但如果您使用的是标准模式，则应确保您的节点池具有这些 GPU 可用。
service.yaml：包含 Pod 的服务详细信息，为 TGI 服务公开端口 8080。
(可选) ingress.yaml：包含 Pod 的入口详细信息，将服务公开到外部世界，以便可以通过入口 IP 访问它。

kubectl apply -f config/

Kubernetes 部署可能需要几分钟才能准备就绪，因此您可以使用以下命令检查部署状态：

kubectl get pods

以及检查正在部署的 Pod 的日志：

kubectl logs -f <POD>

或者，您可以使用以下命令等待部署准备就绪：

kubectl wait --for=condition=Available --timeout=700s deployment/tgi-deployment

GKE Deployment in the GCP Console

使用 TGI 进行推理

要对已部署的 TGI 服务运行推理，您可以：

将已部署的 TGI 服务端口转发到端口 8080，以便通过 localhost 访问，使用以下命令：
```
kubectl port-forward service/tgi-service 8080:8080
```
通过入口的外部 IP 访问 TGI 服务，这是此处的默认方案，因为您已在 config/ingress.yaml 文件中定义了入口配置（但可以跳过它，而选择端口转发），可以使用以下命令检索：
```
kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```

通过 cURL

要使用 cURL 向 TGI 服务发送 POST 请求，您可以运行以下命令：

curl https://:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":[{"type":"text","text":"What'\''s in this image?"},{"type":"image_url","image_url":{"url":"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"}}]}],"temperature":0.7,"top_p":0.95,"max_tokens":128,"stream":false}' \
    -H 'Content-Type: application/json'

或者，向入口 IP 发送 POST 请求（无需指定端口，因为不需要）：

curl http://$(kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":[{"type":"text","text":"What'\''s in this image?"},{"type":"image_url","image_url":{"url":"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"}}]}],"temperature":0.7,"top_p":0.95,"max_tokens":128,"stream":false}' \
    -H 'Content-Type: application/json'

这将生成以下输出：

{"object":"chat.completion","id":"","created":1728041178,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","system_fingerprint":"2.3.1-native","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a rabbit wearing a space suit, standing on a rocky, orange-colored surface. The background is a reddish-brown color with a bright light shining from the right side of the image."},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":43,"completion_tokens":42,"total_tokens":85}}

通过 Python

要使用 Python 运行推理，您可以使用 huggingface_hub Python SDK（推荐）或 openai Python SDK。

在下面的示例中，将使用 localhost，但如果您使用入口部署了 TGI，请随意使用上面提到的入口 IP（无需指定端口）。

huggingface_hub

您可以通过 pip 安装它，例如 pip install --upgrade --quiet huggingface_hub，然后运行以下代码段来模拟上面的 cURL 命令，即向 Messages API 发送请求：

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="https://:8080", api_key="-")

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    max_tokens=128,
)

这将生成以下输出：

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content="The image depicts an astronaut rabbit standing on a rocky surface, which is likely Mars or a similar planet. The astronaut's suit takes up most of the shot, with a clear distinction between its lighter parts, such as the chest and limbs, and its darker parts, like the shell that protects it from space's toxic environment. The head portion consists of the full-length helmet that's transparent at the front, allowing the rabbit's face to be visible. \n\nThe astronaut rabbit stands vertically, looking into the distance, its head slightly pointed forward and pointing with its right arm down. Its left arm hangs at its side, adding balance to its stance", tool_calls=None), logprobs=None)], created=1728041247, id='', model='meta-llama/Llama-3.2-11B-Vision-Instruct', system_fingerprint='2.3.1-native', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=43, total_tokens=171))

openai

此外，您还可以通过 openai 使用 Messages API；您可以通过 pip 安装它，例如 pip install --upgrade openai，然后运行：

from openai import OpenAI

client = OpenAI(
    base_url="https://:8080/v1/",
    api_key="-",
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    max_tokens=128,
)

这将生成以下输出：

ChatCompletion(id='', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='The image features an astronaut rabbit on the surface of Mars. Given that rabbits (Oryctolagus cuniculus) are mammals, temperature regulation, gravity, and radiation exposure are all potential hazards in an extraterrestrial space suit designed for a rabbit. As optimal space exploration wear is human-centric, it is challenging to transpose a hypothetical rabbit astronaut suit to adapt to the specialized needs of rabbits.\n\n**Adaptations to Suit Rabbit Physiology**\n\nTo simulate a normalized temperature change, a cooling system designed for thermal comfort could be used. This system might involve a water-based cooling mechanism, similar to a panty hose or suit liner filled with', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1728041281, model='meta-llama/Llama-3.2-11B-Vision-Instruct', object='chat.completion', service_tier=None, system_fingerprint='2.3.1-native', usage=CompletionUsage(completion_tokens=128, prompt_tokens=43, total_tokens=171))

其他用例

此外，除了上面显示的图像字幕之外，一些潜在的 VLM 用例如下：

视觉问答 (VQA)

给定一张图像和一个关于图像的问题，生成问题的答案。

curl https://:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":[{"type":"text","text":"Which era does this piece belong to? Give details about the era."},{"type":"image_url","image_url":{"url":"https://huggingface.co/datasets/huggingface/release-assets/resolve/main/rococo.jpg"}}]}],"temperature":0.7,"top_p":0.95,"max_tokens":128,"stream":false}' \
    -H 'Content-Type: application/json'

例如，给定一件艺术品，您可以向 VLM 提问。

您所指的是一件洛可可时期的绘画作品，具体来说是 18 世纪的作品。洛可可风格起源于 18 世纪初的法国，并在 18 世纪中叶传播到欧洲各地，特别是在德国和俄罗斯。

洛可可艺术的特点

**轻盈和通透：** 洛可可艺术以其轻盈和通透为特征，通常具有精致的线条、柔和的色彩和复杂的细节。
曲线和线条： 洛可可艺术家更偏爱曲线和线条而非直线，在作品中创造出一种流畅和运动的感觉。

图像信息检索

给定一张图像，从图像中检索信息。

curl https://:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":[{"type":"text","text":"How long does it take from invoice date to due date? Be short and concise."},{"type":"image_url","image_url":{"url":"https://huggingface.co/datasets/huggingface/release-assets/resolve/main/invoice.png"}}]}],"temperature":0.7,"top_p":0.95,"max_tokens":128,"stream":false}' \
    -H 'Content-Type: application/json'

例如，给定一张发票，您可以向 VLM 提问，询问从提供的图像中存在或可以推断出的信息。

为了计算发票日期和到期日期之间的时间差，我们需要从到期日期中减去发票日期。

发票日期：2019年11月2日
到期日期：2019年2月26日

时间差 = 到期日期 - 发票日期
时间差 = 2019年2月26日 - 2019年2月11日
时间差 = 15 天

因此，从发票日期到到期日期需要15 天。

删除 GKE 集群

最后，在您完成在 GKE 集群上使用 TGI 后，您可以安全地删除 GKE 集群以避免产生不必要的费用。

gcloud container clusters delete $CLUSTER_NAME --location=$LOCATION

或者，如果您想保留集群，也可以将副本缩减为零，因为 GKE 集群已在 Autopilot 模式下部署，即仅在需要时创建节点池，并在不需要时销毁；并且默认情况下它仅在一个 e2-small 实例上运行。

kubectl scale --replicas=0 deployment/tgi-deployment

📍 在 GitHub 上查找完整示例此处！

< > GitHub 更新