在 Vertex AI 上部署带有 PyTorch 推理 DLC 的 FLUX

FLUX 是一个开源的 120 亿参数整流流变压器，它根据文本描述生成图像，突破了 Black Forest Labs 创建的文本到图像生成的界限，并带有非商业许可，使其广泛可用于探索和实验。Google Cloud Vertex AI 是一个机器学习 (ML) 平台，可让您训练和部署 ML 模型和 AI 应用程序，并自定义大型语言模型 (LLM) 以用于您的 AI 驱动应用程序。

本示例展示了如何使用 Google Cloud Platform (GCP) 中可用的 Hugging Face PyTorch 推理 DLC，在 CPU 和 GPU 实例上，在 Vertex AI 上部署 Hugging Face Hub 中任何受支持的 diffusers 文本到图像模型，在本例中是 black-forest-labs/FLUX.1-dev。

'black-forest-labs/FLUX.1-dev' in the Hugging Face Hub

设置/配置

首先，您需要在本地机器上安装 gcloud，这是 Google Cloud 的命令行工具，请按照 Cloud SDK 文档 - 安装 gcloud CLI 中的说明进行操作。

然后，您还需要安装 google-cloud-aiplatform Python SDK，这是以编程方式创建 Vertex AI 模型、注册模型、创建端点并在 Vertex AI 上部署模型所需的。

!pip install --upgrade --quiet google-cloud-aiplatform

或者，为了简化本教程中命令的使用，您需要为 GCP 设置以下环境变量

%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-44.ubuntu2204.py311

然后您需要登录您的 GCP 帐户，并将项目 ID 设置为您想要用于在 Vertex AI 上注册和部署模型的项目 ID。

!gcloud auth login
!gcloud auth application-default login  # For local development
!gcloud config set project $PROJECT_ID

登录后，您需要启用 GCP 中必要的服务 API，例如 Vertex AI API、Compute Engine API 和 Google Container Registry 相关 API。

!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

在 Vertex AI 中注册模型

一切设置完成后，您可以通过 google-cloud-aiplatform Python SDK 初始化 Vertex AI 会话，如下所示：

import os
from google.cloud import aiplatform

aiplatform.init(
    project=os.getenv("PROJECT_ID"),
    location=os.getenv("LOCATION"),
)

由于 black-forest-labs/FLUX.1-dev 是一个门控模型，您需要使用具有门控模型访问权限的细粒度访问令牌，或仅具有帐户读取访问权限的读取访问令牌登录到 Hugging Face Hub 帐户。有关如何生成 Hugging Face Hub 的只读访问令牌的更多信息，请参阅 https://huggingface.co/docs/hub/en/security-tokens。

!pip install --upgrade --quiet huggingface_hub

from huggingface_hub import interpreter_login

interpreter_login()

然后，您就可以“上传”模型，即在 Vertex AI 上注册模型。这本身并不是一个上传过程，因为模型会在启动时通过 HF_MODEL_ID 环境变量从 Hugging Face Hub 的 Hugging Face PyTorch 推理 DLC 中自动下载，因此上传的只是配置，而不是模型权重。

在深入代码之前，我们先快速回顾一下提供给 upload 方法的参数

display_name 是将在 Vertex AI 模型注册表中显示的名称。
serving_container_image_uri 是用于提供模型服务的 Hugging Face PyTorch 推理 DLC 的位置。
serving_container_environment_variables 是容器运行时将使用的环境变量，因此它们与 huggingface-inference-toolkit Python SDK 定义的环境变量对齐，该 SDK 公开了一些环境变量，例如以下变量：
- HF_MODEL_ID 是 Hugging Face Hub 中模型的标识符。要探索所有支持的模型，请访问 https://huggingface.co/models?sort=trending 并按您要使用的任务（例如 text-classification）进行筛选。
- HF_TASK 是 Hugging Face Hub 中的任务标识符。要查看所有支持的任务，请访问 [https://huggingface.co/docs/transformers/en/task_summary#natural-language-processing。
- HUGGING_FACE_HUB_TOKEN 是 Hugging Face Hub 令牌，因为 black-forest-labs/FLUX.1-dev 是一个门控模型，所以需要它。

有关受支持的 aiplatform.Model.upload 参数的更多信息，请查阅其 Python 参考资料：https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload。

from huggingface_hub import get_token

model = aiplatform.Model.upload(
    display_name="black-forest-labs--FLUX.1-dev",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "HF_MODEL_ID": "black-forest-labs/FLUX.1-dev",
        "HF_TASK": "text-to-image",
        "HF_TOKEN": get_token(),
    },
)
model.wait()

Model in Vertex AI Model Registry

Model Version in Vertex AI Model Registry

在 Vertex AI 中部署模型

在 Vertex AI 上注册模型后，您需要定义要将模型部署到的端点，然后将模型部署链接到该端点资源。

为此，您需要调用 aiplatform.Endpoint.create 方法来创建一个新的 Vertex AI 端点资源（该资源尚未链接到模型或任何可用的东西）。

endpoint = aiplatform.Endpoint.create(display_name="black-forest-labs--FLUX.1-dev-endpoint")

Vertex AI Endpoint created

现在您可以在 Vertex AI 上的端点中部署已注册的模型。

deploy 方法会将之前创建的端点资源与包含服务容器配置的模型链接起来，然后将其部署到指定实例的 Vertex AI 上。

在进入代码之前，让我们快速回顾一下提供给deploy方法的参数

endpoint 是要将模型部署到的端点，它是可选的，默认情况下将设置为模型显示名称加上 _endpoint 后缀。
machine_type、accelerator_type 和 accelerator_count 是定义要使用哪个实例以及要使用的加速器和加速器数量的参数。machine_type 和 accelerator_type 是关联的，因此您需要选择支持您正在使用的加速器的实例，反之亦然。有关不同实例的更多信息，请参见 Compute Engine 文档 - GPU 机器类型，有关 accelerator_type 命名方式的更多信息，请参见 Vertex AI 文档 - MachineSpec。
enable_access_logging 是一个参数，用于启用端点访问日志记录，即在 Google Cloud Logging 中记录对端点发出的请求信息。

有关受支持的 aiplatform.Model.deploy 参数的更多信息，您可以查阅其 Python 参考资料：https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy。

%%time
deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-48",
    accelerator_type="NVIDIA_L4",
    accelerator_count=4,
    enable_access_logging=True,
)

通过 deploy 方法部署 Vertex AI 端点可能需要 15 到 25 分钟。

Vertex AI Endpoint Ready

Vertex AI Model Ready

Vertex AI 上的在线预测

最后，您可以使用 predict 方法在 Vertex AI 上运行在线预测，该方法将根据 Vertex AI I/O 有效负载格式将请求发送到容器中指定在 /predict 路由中的运行端点。

%%time
output = deployed_model.predict(
    instances=["a photo of an astronaut riding a horse on mars"],
    parameters={
        "width": 512,
        "height": 512,
        "num_inference_steps": 8,
        "guidance_scale": 3.5,
    },
)

由于部署的模型是文本到图像模型，输出负载将包含所提供文本描述生成的图像，但以 base64 编码；这意味着您需要使用 Pillow Python 库对其进行解码和加载。

!pip install --upgrade --quiet Pillow

import base64
import io

from IPython.display import display
from PIL import Image

image = Image.open(io.BytesIO(base64.b64decode(output.predictions[0])))
display(image)

FLUX.1-dev output

警告目前，Google Cloud 上的 Hugging Face DLC 的请求超时时间为 60 秒，这意味着如果预测请求超过该时间，将引发 HTTP 503 错误。为了防止这种情况发生，您可以按照 Vertex AI 文档 - 从自定义训练模型获取在线预测中的说明，通过提交支持工单或联系您的 Google Cloud 代表来增加默认的预测超时时间。无论如何，对于公开发布的 Hugging Face DLC，Google Cloud 团队目前正在努力将其列入白名单，以将预测超时时间增加到 600 秒。

资源清理

最后，您可以在同一个 Python 会话中以编程方式释放资源，如下所示：

deployed_model.undeploy_all 用于从所有端点取消部署模型。
deployed_model.delete 在 undeploy_all 之后优雅地删除部署模型的端点。
model.delete 用于从注册表中删除模型。

deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

📍 在 GitHub 上找到完整的示例此处！

< > 在 GitHub 上更新