在 GKE 上使用 TGI DLC 部署带有多个 LoRA 适配器的 Gemma2

Gemma 2 是一个先进的轻量级开放模型，它在 Google DeepMind 和 Google 其他团队开发的 Gemini 模型及其前身的研究和技术基础上，提高了性能和效率。文本生成推理 (TGI) 是 Hugging Face 开发的一个工具包，用于部署和提供 LLM，具有高性能文本生成。此外，Google Kubernetes Engine (GKE) 是 Google Cloud 中一项完全托管的 Kubernetes 服务，可使用 GCP 的基础设施在规模上部署和运行容器化应用程序。

此示例展示了如何将 Hugging Face Hub 中的 Gemma 2 2B 与多个 LoRA 适配器（针对编码、SQL 或日语等不同目的进行了微调）部署到运行 Hugging Face TGI DLC 的 GKE 集群上，即一个专门构建的容器，用于在安全和受管理的环境中部署 LLM。

设置 / 配置

首先，您需要在本地机器上安装 gcloud 和 kubectl，它们分别是 Google Cloud 和 Kubernetes 的命令行工具，分别用于与 GCP 和 GKE 集群交互。

要安装 gcloud，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。
要安装 kubectl，请按照 Kubernetes 文档 - 安装工具中的说明进行操作。

可选地，为了便于本教程中命令的使用，您需要为 GCP 设置以下环境变量

export PROJECT_ID=your-project-id
export LOCATION=your-location
export CLUSTER_NAME=your-cluster-name

然后，您需要登录到您的 GCP 帐户并将项目 ID 设置为您要用于部署 GKE 集群的项目 ID。

gcloud auth login
gcloud auth application-default login  # For local development
gcloud config set project $PROJECT_ID

登录后，您需要在 GCP 中启用必要的服务 API，即 Google Kubernetes Engine API 和 Google Container Registry API，这些 API 是部署 GKE 集群和 Hugging Face TGI DLC 所必需的。

gcloud services enable container.googleapis.com
gcloud services enable containerregistry.googleapis.com

此外，要使用 kubectl 与 GKE 集群凭据，您还需要安装 gke-gcloud-auth-plugin，可以通过 gcloud 安装，如下所示

gcloud components install gke-gcloud-auth-plugin

gke-gcloud-auth-plugin 的安装不需要专门通过 gcloud 完成，要了解有关替代安装方法的更多信息，请访问 GKE 文档 - 安装 kubectl 并配置集群访问。

创建 GKE 集群

一切都设置好后，您可以继续创建 GKE 集群和节点池，在本例中，它将是一个单 GPU 节点，以便使用 GPU 加速器进行高性能推理，也遵循 TGI 基于其 GPU 内部优化的建议。

要部署 GKE 集群，将使用“自动驾驶仪”模式，因为它是大多数工作负载的推荐模式，因为底层基础设施由 Google 管理。或者，您也可以使用“标准”模式。

在创建 GKE 自动驾驶仪集群之前，请务必查看 GKE 文档 - 通过选择机器系列优化自动驾驶仪 Pod 性能，因为并非所有版本都支持 GPU 加速器，例如 nvidia-l4 在 GKE 集群版本 1.28.3 或更低版本中不受支持。

gcloud container clusters create-auto $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --location=$LOCATION \
    --release-channel=stable \
    --cluster-version=1.29 \
    --no-autoprovisioning-enable-insecure-kubelet-readonly-port

要选择您所在位置的 GKE 集群的特定版本，您可以运行以下命令

gcloud container get-server-config \
    --flatten="channels" \
    --filter="channels.channel=STABLE" \
    --format="yaml(channels.channel,channels.defaultVersion)" \
    --location=$LOCATION

有关更多信息，请访问 GKE 文档 - 指定集群版本。

GKE Cluster in the GCP Console

创建 GKE 集群后，您可以使用以下命令获取通过 kubectl 访问它的凭据

gcloud container clusters get-credentials $CLUSTER_NAME --location=$LOCATION

获取 Hugging Face 令牌并在 GKE 中设置秘密

由于 google/gemma-2-2b-it 是一个受限模型，您需要使用 kubectl 设置一个包含 Hugging Face Hub 令牌的 Kubernetes 秘密。

要为 Hugging Face Hub 生成自定义令牌，您可以按照 Hugging Face Hub - 用户访问令牌中的说明进行操作；建议的设置方法是安装 huggingface_hub Python SDK，如下所示

pip install --upgrade --quiet huggingface_hub

然后，使用生成的令牌登录，该令牌对受限/私有模型具有读访问权限

huggingface-cli login

最后，您可以使用 huggingface_hub Python SDK 检索令牌，按照以下步骤创建带有 Hugging Face Hub 生成的令牌的 Kubernetes 密钥。

kubectl create secret generic hf-secret \
    --from-literal=hf_token=$(python -c "from huggingface_hub import get_token; print(get_token())") \
    --dry-run=client -o yaml | kubectl apply -f -

或者，您也可以直接设置令牌，如下所示。

kubectl create secret generic hf-secret \
    --from-literal=hf_token=hf_*** \
    --dry-run=client -o yaml | kubectl apply -f -

GKE Secret in the GCP Console

有关如何在 GKE 集群中设置 Kubernetes 密钥的更多信息，请参见 Secret Manager 文档 - 在 Google Kubernetes Engine 中使用 Secret Manager 附加组件。

部署 TGI

现在，您可以从 Hugging Face Hub 开始部署 TGI 的 Hugging Face DLC 到 Kubernetes，它提供 google/gemma-2-2b-it 模型和多个基于它微调的 LoRA 适配器。

要探索所有可以通过 TGI 提供的服务，您可以浏览 Hub 上标记为 text-generation-inference 的模型。

TGI 的 Hugging Face DLC 将通过 kubectl 从 config/ 目录中的配置文件部署。

deployment.yaml：包含 pod 的部署详细信息，包括对 TGI 的 Hugging Face DLC 的引用，将 MODEL_ID 设置为 google/gemma-2-2b-it，并将 LORA_ADAPTERS 设置为 google-cloud-partnership/gemma-2-2b-it-lora-magicoder,google-cloud-partnership/gemma-2-2b-it-lora-sql，即以下适配器。
- google-cloud-partnership/gemma-2-2b-it-lora-sql：使用 gretelai/synthetic_text_to_sql 微调，在给出 SQL 上下文和有关它的提示/问题的情况下，生成带有解释的 SQL 查询。
- google-cloud-partnership/gemma-2-2b-it-lora-magicoder：使用 ise-uiuc/Magicoder-OSS-Instruct-75K 微调，根据输入问题，生成各种编程语言（如 Python、Rust 或 C 等）的代码。
- google-cloud-partnership/gemma-2-2b-it-lora-jap-en：使用 Jofthomas/japanese-english-translation 微调，这是一个合成生成的日语短句翻译成英语的数据集，用于将英语翻译成日语，反之亦然。
service.yaml：包含 pod 的服务详细信息，为 TGI 服务公开端口 8080。
(可选) ingress.yaml：包含 pod 的入口详细信息，将服务公开到外部世界，以便可以通过入口 IP 访问它。

请注意，所选的 LoRA 适配器不适合在生产环境中使用，因为微调的适配器尚未经过广泛测试。

kubectl apply -f config/

Kubernetes 部署可能需要几分钟才能准备好，因此您可以使用以下命令检查部署状态。

kubectl get pods

或者，您可以直接等待部署准备就绪，使用以下命令。

kubectl wait --for=condition=Available --timeout=700s deployment/tgi-deployment

GKE Deployment in the GCP Console

GKE Deployment Logs in the GCP Console

使用 TGI 推理

要对部署的 TGI 服务运行推理，您需要先确保该服务可访问，您可以通过以下两种方式做到这一点。

将部署的 TGI 服务转发到端口 8080，以便通过以下命令通过 localhost 访问。
```
kubectl port-forward service/tgi-service 8080:8080
```
通过入口的外部 IP 访问 TGI 服务，这是默认情况，因为您在 config/ingress.yaml 文件中定义了入口配置（但可以省略，改为端口转发），可以使用以下命令检索外部 IP。
```
kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```

通过 cURL

要使用 cURL 向 TGI 服务发送 POST 请求，您可以运行以下命令。

curl https://:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json'

或者，您可以向入口 IP 发送 POST 请求，而不是 localhost。

curl http://$(kubectl get ingress tgi-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json'

在本例中，您正在提供多个 LoRA 适配器，要使用这些适配器，您需要在使用 /v1/chat/completions 终结点时指定 model 参数（或在使用 /generate 终结点时指定 adapter_id 参数），以便使用 LoRA 适配器。在任何其他情况下，将使用基本模型，这意味着只有在明确指定的情况下才会使用适配器。

例如，假设您想要为一个无法解决的问题生成一段代码，那么您应该使用微调的适配器 google-cloud-partnership/gemma-2-2b-it-lora-magicoder，该适配器专门为此目的进行了微调；或者，您也可以使用基本指令微调模型，因为它可以处理各种任务，但例如，日语到英语模型并不适合该任务。

curl https://:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"You are given a vector of integers, A, of length n. Your task is to implement a function that finds the maximum product of any two distinct elements in the vector. Write a function in Rust to return this maximum product. Function Signature: rust fn max_product(a: Vec<i32>) -> i32  Input: - A vector a of length n (2 <= n <= 10^5), where each element is an integer (-10^4 <= a[i] <= 10^4). Output: - Return the maximum product of two distinct elements. Example: Input: a = vec![1, 5, 3, 9] Output: max_product(a) -> 45"}],"temperature":0.7,"top_p":0.95,"max_tokens":256,"model":"google-cloud-partnership/gemma-2-2b-it-lora-magicoder"}}' \
    -H 'Content-Type: application/json'

这将生成给定提示的以下解决方案。

{"object":"chat.completion","id":"","created":1727378101,"model":"google/gemma-2-2b-it","system_fingerprint":"2.3.1-dev0-native","choices":[{"index":0,"message":{"role":"assistant","content":"\`\`\`rust\nfn max_product(a: Vec<i32>) -> i32 {\n    let mut max1 = a[0];\n    let mut max2 = a[1];\n    if max2 < max1 {\n        std::mem::swap(&mut max1, &mut max2);\n    }\n    for i in 2..a.len() {\n        if a[i] > max1 {\n            max2 = max1;\n            max1 = a[i];\n        } else if a[i] > max2 {\n            "},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":163,"completion_tokens":128,"total_tokens":291}}

翻译成 Rust 代码将是。

fn max_product(a: Vec<i32>) -> i32 {
    if a.len() < 2 {
        return 0;
    }
    let mut max_product = a[0] * a[1];
    for i in 1..a.len() {
        for j in i + 1..a.len() {
            if a[i] * a[j] > max_product {
                max_product = a[i] * a[j];
            }
        }
    }
    max_product
}

通过 Python

要使用 Python 运行推理，您可以使用 huggingface_hub Python SDK（推荐）或 openai Python SDK。

在下面的示例中，将使用 localhost，但是如果您使用入口部署了 TGI，请随时使用上面提到的入口 IP（无需指定端口）。

huggingface_hub

您可以通过 pip 安装它，如 pip install --upgrade --quiet huggingface_hub，然后运行以下代码片段来模仿上面的 cURL 命令，即向 Messages API 发送请求，通过 model 参数提供适配器标识符。

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="https://:8080", api_key="-")

chat_completion = client.chat.completions.create(
    model="google-cloud-partnership/gemma-2-2b-it-lora-magicoder",
    messages=[
        {"role": "user", "content": "You are given a vector of integers, A, of length n. Your task is to implement a function that finds the maximum product of any two distinct elements in the vector. Write a function in Rust to return this maximum product. Function Signature: rust fn max_product(a: Vec<i32>) -> i32  Input: - A vector a of length n (2 <= n <= 10^5), where each element is an integer (-10^4 <= a[i] <= 10^4). Output: - Return the maximum product of two distinct elements. Example: Input: a = vec![1, 5, 3, 9] Output: max_product(a) -> 45"},
    ],
    max_tokens=128,
)

或者，您也可以自己格式化提示，并通过 Text Generation API 发送，通过 adapter_id 参数提供适配器标识符，如下所示。

from huggingface_hub import InferenceClient

client = InferenceClient("https://:8080", api_key="-")

generation = client.text_generation(
    prompt="You are given a vector of integers, A, of length n. Your task is to implement a function that finds the maximum product of any two distinct elements in the vector. Write a function in Rust to return this maximum product. Function Signature: rust fn max_product(a: Vec<i32>) -> i32  Input: - A vector a of length n (2 <= n <= 10^5), where each element is an integer (-10^4 <= a[i] <= 10^4). Output: - Return the maximum product of two distinct elements. Example: Input: a = vec![1, 5, 3, 9] Output: max_product(a) -> 45",
    max_new_tokens=128,
    adapter_id="google-cloud-partnership/gemma-2-2b-it-lora-magicoder",
)

openai

此外，您也可以通过 openai 使用 Messages API；您可以通过 pip 安装它，如 pip install --upgrade openai，然后运行。

from openai import OpenAI

client = OpenAI(
    base_url="https://:8080/v1/",
    api_key="-",
)

chat_completion = client.chat.completions.create(
    model="google-cloud-partnership/gemma-2-2b-it-lora-magicoder",
    messages=[
        {"role": "user", "content": "You are given a vector of integers, A, of length n. Your task is to implement a function that finds the maximum product of any two distinct elements in the vector. Write a function in Rust to return this maximum product. Function Signature: rust fn max_product(a: Vec<i32>) -> i32  Input: - A vector a of length n (2 <= n <= 10^5), where each element is an integer (-10^4 <= a[i] <= 10^4). Output: - Return the maximum product of two distinct elements. Example: Input: a = vec![1, 5, 3, 9] Output: max_product(a) -> 45"},
    ],
    max_tokens=128,
)

删除 GKE 集群

最后，在您完成在 GKE 集群上使用 TGI 后，可以安全地删除 GKE 集群，以避免产生不必要的费用。

gcloud container clusters delete $CLUSTER_NAME --location=$LOCATION

或者，如果您想保留集群，也可以将部署的 Pod 的副本数量缩减为 0，因为使用 GKE Autopilot 模式部署的默认 GKE 集群只运行一个 e2-small 实例。

kubectl scale --replicas=0 deployment/tgi-deployment

📍 在 GitHub 上查找完整的示例这里！

< > 更新在 GitHub 上