在 GKE 上部署 Llama 3.1 405B 与 TGI DLC

Llama 3.1 是 Meta 发布的 Llama 家族中最新的大型语言模型之一(截至 2024 年 10 月,最新版本是 Llama 3.2);三种尺寸:8B 适用于消费者级 GPU 上的有效部署和开发,70B 适用于大型 AI 原生应用程序,405B 适用于合成数据、将 LLM 作为裁判或蒸馏;以及其他用例;而 405B 变体是最大的开源 LLM 之一。文本生成推理 (TGI) 是 Hugging Face 开发的用于部署和提供 LLM 的工具包,它具有高性能文本生成功能。此外,Google Kubernetes Engine (GKE) 是 Google Cloud 中的完全托管 Kubernetes 服务,可使用 Google 基础设施在规模上部署和操作容器化应用程序。

此示例展示了如何在具有 8 个 NVIDIA H100 的节点的 GKE 集群上部署 meta-llama/Llama-3.1-405B-Instruct-FP8,方法是通过 Hugging Face 为 Google Cloud 上的文本生成推理 (TGI) 特别打造的深度学习容器 (DLC)。


首先,您需要在本地机器上安装 gcloudkubectl,它们分别是 Google Cloud 和 Kubernetes 的命令行工具,分别用于与 GCP 和 GKE 集群交互。

可选地,为了简化本教程中命令的使用,您需要为 GCP 设置以下环境变量

export PROJECT_ID=your-project-id
export LOCATION=your-location
export CLUSTER_NAME=your-cluster-name

然后,您需要登录到您的 GCP 帐户,并将项目 ID 设置为要用于部署 GKE 集群的项目 ID。

gcloud auth login
gcloud auth application-default login  # For local development
gcloud config set project $PROJECT_ID

登录后,您需要在 GCP 中启用必要的服务 API,例如 Google Kubernetes Engine API、Google Container Registry API 和 Google Container File System API,这些 API 是部署 GKE 集群和 Hugging Face TGI DLC 所必需的。

gcloud services enable
gcloud services enable
gcloud services enable

此外,要使用 kubectl 访问 GKE 集群凭据,您还需要安装 gke-gcloud-auth-plugin,它可以与 gcloud 一起安装,如下所示

gcloud components install gke-gcloud-auth-plugin

无需专门通过 gcloud 安装 gke-gcloud-auth-plugin,要详细了解其他安装方法,请访问 GKE 文档 - 安装 kubectl 和配置集群访问

最后,请注意,您可能需要申请配额增加,才能访问具有 8 个 NVIDIA H100 GPU 的 A3 实例,因为这些实例需要 Google Cloud 的特定手动批准。为此,您需要转到 IAM 管理员 - 配额,并应用以下过滤器

  • 服务:Compute Engine API:因为 GKE 依赖 Compute Engine 进行资源分配。

  • 维度(例如位置):区域:$LOCATION:用上面的位置替换 $LOCATION 值,但请注意,并非所有区域都提供 NVIDIA H100 GPU,因此请检查 Compute Engine 文档 - 可用区域和区域

  • gpu_family: NVIDIA_H100:是 Google Cloud 上 NVIDIA H100 GPU 的标识符。

然后,申请将配额增加到 8 个 NVIDIA H100 GPU,以便运行 meta-llama/Llama-3.1-405B-Instruct-FP8

创建 GKE 集群

一切准备就绪后,您可以开始创建 GKE 集群和节点池,在本例中将是单个 GPU 节点,以便使用 GPU 加速器进行高性能推理,这也是基于 TGI 对 GPU 进行内部优化的建议。

要部署 GKE 集群,将使用“自动驾驶”模式,因为它是大多数工作负载的推荐模式,因为底层基础设施由 Google 管理。或者,您也可以使用“标准”模式。

创建 GKE 自动驾驶集群之前,请务必检查 GKE 文档 - 通过选择机器系列优化自动驾驶 Pod 性能,因为并非所有版本都支持 GPU 加速器,例如 nvidia-l4 在 GKE 集群版本 1.28.3 或更低版本中不受支持。

gcloud container clusters create-auto $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --location=$LOCATION \
    --release-channel=stable \
    --cluster-version=1.29 \

要选择 GKE 集群在您所在位置的特定版本,可以运行以下命令

gcloud container get-server-config \
    --flatten="channels" \
    --filter="" \
    --format="yaml(,channels.defaultVersion)" \

有关更多信息,请访问 GKE 文档 - 指定集群版本

GKE Cluster in the GCP Console

创建 GKE 集群后,您可以使用以下命令获取访问该集群的凭据:

gcloud container clusters get-credentials $CLUSTER_NAME --location=$LOCATION

获取 Hugging Face 令牌并在 GKE 中设置密钥

由于 meta-llama/Llama-3.1-405B-Instruct-FP8 是一个受限制的模型,您需要通过 kubectl 使用 Hugging Face Hub 令牌设置 Kubernetes 密钥。

要为 Hugging Face Hub 生成自定义令牌,您可以按照 Hugging Face Hub - 用户访问令牌 中的说明进行操作;推荐的设置方法是安装 huggingface_hub Python SDK,如下所示

pip install --upgrade --quiet huggingface_hub


huggingface-cli login

最后,您可以使用 huggingface_hub Python SDK 检索令牌,如下所示使用生成的令牌创建 Kubernetes 密钥,以供 Hugging Face Hub 使用

kubectl create secret generic hf-secret \
    --from-literal=hf_token=$(python -c "from huggingface_hub import get_token; print(get_token())") \
    --dry-run=client -o yaml | kubectl apply -f -


kubectl create secret generic hf-secret \
    --from-literal=hf_token=hf_*** \
    --dry-run=client -o yaml | kubectl apply -f -

GKE Secret in the GCP Console

有关如何在 GKE 集群中设置 Kubernetes 密钥的更多信息,请参见 Secret Manager 文档 - 在 Google Kubernetes Engine 中使用 Secret Manager 附加组件

部署 TGI

现在您可以继续进行 Hugging Face DLC for TGI 的 Kubernetes 部署,该 DLC 从 Hugging Face Hub 提供 meta-llama/Llama-3.1-405B-Instruct-FP8 模型。

要探索可通过 TGI 提供的所有模型,您可以浏览 Hub 中标记为 text-generation-inference 的模型

Hugging Face DLC for TGI 将通过 kubectlconfig/ 目录中的配置文件进行部署

kubectl apply -f config/

Kubernetes 部署可能需要几分钟才能准备就绪,因此您可以使用以下命令检查部署状态

kubectl get pods


kubectl wait --for=condition=Available --timeout=700s deployment/tgi-deployment

GKE Deployment in the GCP Console

GKE Deployment Logs in the GCP Console

使用 TGI 进行推理

要对部署的 TGI 服务运行推理,您可以

  • 将部署的 TGI 服务端口转发到端口 8080,以便通过 localhost 使用以下命令访问

    kubectl port-forward service/tgi-service 8080:8080
  • 通过入口的外部 IP 访问 TGI 服务,这是此处的默认方案,因为您在 config/ingress.yaml 文件中定义了入口配置(但可以跳过入口配置,转而使用端口转发),可以使用以下命令检索外部 IP

    kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

通过 cURL

要使用 cURL 向 TGI 服务发送 POST 请求,您可以运行以下命令

curl \
    -X POST \
    -d '{"messages":[{"role":"system","content": "You are a helpful assistant."},{"role":"user","content":"What'\''s Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json'

或者向入口 IP 发送 POST 请求(无需指定端口,因为它不需要)

curl http://$(kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What'\''s Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json'


{"object":"chat.completion","id":"","created":1727782287,"model":"meta-llama/Llama-3.1-405B-Instruct-FP8","system_fingerprint":"2.2.0-native","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning, which is a field of artificial intelligence (AI) that enables computers to learn from data without being explicitly programmed. It's a type of neural network that's inspired by the structure and function of the human brain.\n\nIn traditional machine learning, computers are trained on data using algorithms that are designed to recognize patterns and make predictions. However, these algorithms are often limited in their ability to handle complex data, such as images, speech, and text.\n\nDeep learning, on the other hand, uses multiple layers of artificial neural networks to analyze data. Each layer processes the data in a different way, allowing the"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":46,"completion_tokens":128,"total_tokens":174}}

通过 Python

要使用 Python 运行推理,您可以使用 huggingface_hub Python SDK(推荐)或 openai Python SDK

在以下示例中,将使用 localhost,但是如果您使用入口部署了 TGI,请随意使用上面提到的入口 IP(无需指定端口)。


您可以通过 pip 安装它,如 pip install --upgrade --quiet huggingface_hub,然后运行以下代码段来模拟上面的 cURL 命令,即向 Messages API 发送请求

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="", api_key="-")

chat_completion =
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's Deep Learning?"},


ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Deep learning is a subset of machine learning that focuses on neural networks with many layers, typically more than two. These neural networks are designed to mimic the structure and function of the human brain, with each layer processing and transforming inputs in a hierarchical manner.\n\nIn traditional machine learning, models are trained using a set of predefined features, such as edges, textures, or shapes. In contrast, deep learning models learn to extract features from raw data automatically, without the need for manual feature engineering.\n\nDeep learning models are trained using large amounts of data and computational power, which enables them to learn complex patterns and relationships in the data. These models can be', tool_calls=None), logprobs=None)], created=1727782322, id='', model='meta-llama/Llama-3.1-405B-Instruct-FP8', system_fingerprint='2.2.0-native', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=46, total_tokens=174))

或者,您也可以自己格式化提示并通过 Text Generation API 发送它

from huggingface_hub import InferenceClient, get_token
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-405B-Instruct-FP8", token=get_token())
client = InferenceClient("", api_key="-")

generation = client.text_generation(
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What's Deep Learning?"},


'Deep learning is a subset of machine learning that involves the use of artificial neural networks to analyze and interpret data. Inspired by the structure and function of the human brain, deep learning algorithms are designed to learn and improve on their own by automatically adjusting the connections between nodes or "neurons" in the network.\n\nIn traditional machine learning, algorithms are trained on a set of data and then use that training to make predictions or decisions on new, unseen data. However, these algorithms often rely on hand-engineered features and rules to extract relevant information from the data. In contrast, deep learning algorithms can automatically learn to extract relevant features and patterns from the'


此外,您也可以通过 openai 使用 Messages API;您可以通过 pip 安装它,如 pip install --upgrade openai,然后运行

from openai import OpenAI

client = OpenAI(

chat_completion =
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's Deep Learning?"},


ChatCompletion(id='', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Deep learning is a subset of machine learning that involves the use of artificial neural networks to analyze and interpret data. Inspired by the structure and function of the human brain, deep learning algorithms are designed to learn and improve on their own by automatically adjusting the connections between nodes or "neurons" in the network.\n\nIn traditional machine learning, algorithms are trained using a set of predefined rules and features. In contrast, deep learning algorithms learn to identify patterns and features from the data itself, eliminating the need for manual feature engineering. This allows deep learning models to be highly accurate and efficient, especially when dealing with large and complex datasets.\n\nKey characteristics of deep', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1727782478, model='meta-llama/Llama-3.1-405B-Instruct-FP8', object='chat.completion', service_tier=None, system_fingerprint='2.2.0-native', usage=CompletionUsage(completion_tokens=128, prompt_tokens=46, total_tokens=174))

删除 GKE 集群

最后,在您完成在 GKE 集群上使用 TGI 后,可以安全地删除 GKE 集群,以避免产生不必要的费用。

gcloud container clusters delete $CLUSTER_NAME --location=$LOCATION

或者,您也可以将部署的 Pod 的副本缩减到 0,如果您想保留集群,因为使用 GKE Autopilot 模式部署的默认 GKE 集群只运行一个 e2-small 实例。

kubectl scale --replicas=0 deployment/tgi-deployment

📍 在 GitHub 上查找完整的示例 这里

