在 Google Cloud Platform (GCP) 上部署 Google TPU 实例

上下文

我们假设读者已经创建了 Google Cloud Platform (GCP) 用户或组织帐户以及关联的项目。

我们还假设读者已安装 Google Cloud CLI。如果没有，请按照后面的链接进行安装和设置。

在 GCP 上创建初始 TPU VM

为了创建您的初始 TPU 实例，您需要提供一些信息

您希望看到实例部署到的GCP 区域（例如，出于开发目的，靠近读者；出于生产目的，靠近最终用户）
您想要定位的TPU 类型
您想要在实例上利用的TPU 运行时版本
自定义实例名称，以便快速浏览和参考实例

总的来说，最终命令如下所示

gcloud compute tpus tpu-vm create <ref_instance_name> \
--zone=<deploiment_zone> \
--accelerator-type=<target_tpu_generation> \
--version=<runtime_version>

部署 TPU v5litepod-8 实例

在我们的例子中，我们将部署一个名为 optimum-tpu-get-started 的 v5litepod-8 实例，该实例位于 GCP 区域 us-west4-a，并使用最新的 v2-alpha-tpuv5-lite 运行时版本。

当然，请随意调整所有这些参数以匹配您的使用情况和配额。

在创建实例之前，请确保已安装 gcloud alpha component，因为需要才能定位 TPUv5 VM：gcloud components install alpha

gcloud alpha compute tpus tpu-vm create optimum-tpu-get-started \
--zone=us-west4-a \
--accelerator-type=v5litepod-8 \
--version=v2-alpha-tpuv5

连接到实例

gcloud compute tpus tpu-vm ssh <ref_instance_name> --zone=<deploiment_zone>
$ >

在上面部署 v5litepod-8 的示例中，它将类似于

gcloud compute tpus tpu-vm ssh optimum-tpu-get-started --zone=us-west4-a
$ >

设置实例以在 TPU 上运行 AI 工作负载

使用 PyTorch/XLA 的 Optimum-TPU

如果您想通过 Optimum-TPU 利用 PyTorch/XLA，它应该很简单

$ python3 -m pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html
$ export PJRT_DEVICE=TPU

现在，您可以使用以下命令验证安装，该命令应该打印 xla:0，因为我们确实有一个绑定到此实例的单个 TPU 设备。

$ python -c "import torch_xla.core.xla_model as xm; print(xm.xla_device())"
xla:0

使用 JAX 的 Optimum-TPU

JAX 即将推出 - 请继续关注！