快速上手

设置

开始使用 TEI 最简单的方法是使用官方 Docker 容器之一（请参阅支持的模型和硬件以选择正确的容器）。

因此，需要按照其安装说明安装 Docker。

TEI 支持在 GPU 和 CPU 上进行推理。如果您计划使用 GPU，请务必通过查看此表来检查您的硬件是否受支持。接下来，安装 NVIDIA Container Toolkit。您设备上的 NVIDIA 驱动程序需要与 CUDA 12.2 或更高版本兼容。

部署

接下来是部署模型。假设您想使用Qwen/Qwen3-Embedding-0.6B。以下是您如何操作：

model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

我们还建议与 Docker 容器共享一个卷 (volume=$PWD/data)，以避免每次运行都下载权重。

推理

推理可以通过 3 种方式进行：使用 cURL，或通过 InferenceClient 或 OpenAI Python SDK。

cURL

要使用 cURL 向 TEI 端点发送 POST 请求，您可以运行以下命令

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Python

要使用 Python 运行推理，您可以使用 huggingface_hub Python SDK（推荐）或 openai Python SDK。

huggingface_hub

您可以通过 pip 安装它，运行 pip install --upgrade --quiet huggingface_hub，然后运行

from huggingface_hub import InferenceClient

client = InferenceClient()

embedding = client.feature_extraction("What is deep learning?",
                                      model="https://:8080/embed")
print(len(embedding[0]))

OpenAI

您可以通过 pip 安装它，运行 pip install --upgrade openai，然后运行

import os
from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1/embeddings")

response = client.embeddings.create(
  model="tei",
  input="What is deep learning?"
)

print(response)

重排序器和序列分类

TEI 还支持重排序器和经典序列分类模型。

重排序器

重排序器，也称为交叉编码器，是具有单个类别的序列分类模型，用于评估查询和文本之间的相似性。请参阅 LlamaIndex 团队的这篇博文，了解如何在您的 RAG 管道中使用重排序器模型来提高下游性能。

假设您想使用BAAI/bge-reranker-large。首先，您可以这样部署它

model=BAAI/bge-reranker-large
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

部署模型后，您可以使用 rerank 端点来对查询和文本列表之间的相似度进行排序。使用 cURL 可以这样操作

curl 127.0.0.1:8080/rerank \
    -X POST \
    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
    -H 'Content-Type: application/json'

序列分类模型

您还可以使用经典的序列分类模型，例如SamLowe/roberta-base-go_emotions

model=SamLowe/roberta-base-go_emotions
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

部署模型后，您可以使用 predict 端点来获取与输入最相关的情绪

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":"I like you."}' \
    -H 'Content-Type: application/json'

批处理

您可以批量发送多个输入。例如，对于嵌入

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":["Today is a nice day", "I like you"]}' \
    -H 'Content-Type: application/json'

以及用于序列分类

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
    -H 'Content-Type: application/json'

气隙部署

要在气隙环境中部署文本嵌入推理，请先下载权重，然后使用卷将它们挂载到容器内。

例如：

# (Optional) create a `models` directory
mkdir models
cd models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5

# Set the models directory as the volume path
volume=$PWD

# Mount the models directory inside the container with a volume and set the model ID
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/gte-base-en-v1.5

< > 在 GitHub 上更新