text-generation-inference 文档

TGI 中的视觉语言模型推理

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

TGI 中的视觉语言模型推理

视觉语言模型 (VLM) 是同时接收图像和文本输入以生成文本的模型。

VLM 经过图像和文本数据的组合训练,可以处理广泛的任务,例如图像字幕、视觉问答和视觉对话。

VLM 与其他文本和图像模型的区别在于它们能够处理长上下文并生成与图像连贯且相关的文本,即使经过多轮或在某些情况下,多张图像之后也是如此。

以下是视觉语言模型的几个常见用例:

  • 图像字幕:给定图像,生成描述图像的字幕。
  • 视觉问答 (VQA):给定图像和关于图像的问题,生成问题的答案。
  • 多模态对话:生成对多轮图像和对话的响应。
  • 图像信息检索:给定图像,从图像中检索信息。

如何使用视觉语言模型?

Hugging Face Hub Python 库

要通过 Python 进行视觉语言模型推理,可以使用 huggingface_hub 库。InferenceClient 类提供了一种与 Inference API 交互的简单方法。图像可以作为 URL 或 Base64 编码的字符串传递。InferenceClient 将自动检测图像格式。

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:3000")
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.
from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient(base_url="http://127.0.0.1:3000")

# read image from local file
image_path = "rabbit.png"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.

或者通过 chat_completion 端点

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:3000")

chat = client.chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    seed=42,
    max_tokens=100,
)

print(chat)
# ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" The image you've provided features an anthropomorphic rabbit in spacesuit attire. This rabbit is depicted with human-like posture and movement, standing on a rocky terrain with a vast, reddish-brown landscape in the background. The spacesuit is detailed with mission patches, circuitry, and a helmet that covers the rabbit's face and ear, with an illuminated red light on the chest area.\n\nThe artwork style is that of a", name=None, tool_calls=None), logprobs=None)], created=1714589614, id='', model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=2943, total_tokens=3043))

或者使用 OpenAI 的 客户端库

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(base_url="https://:3000/v1", api_key="-")

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    stream=False,
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image depicts an anthropomorphic rabbit dressed in a space suit with gear that resembles NASA attire. The setting appears to be a solar eclipse with dramatic mountain peaks and a partial celestial body in the sky. The artwork is detailed and vivid, with a warm color palette and a sense of an adventurous bunny exploring or preparing for a journey beyond Earth. ', role='assistant', function_call=None, tool_calls=None))], created=1714589732, model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=84, prompt_tokens=2943, total_tokens=3027))

通过发送 cURL 请求进行推理

要使用 curl 搭配 generate_stream 端点,可以添加 -N 标志。此标志禁用 curl 的默认缓冲,并显示从服务器接收到的数据。

curl -N 127.0.0.1:3000/generate_stream \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' \
    -H 'Content-Type: application/json'

# ...
# data:{"index":16,"token":{"id":28723,"text":".","logprob":-0.6196289,"special":false},"generated_text":"This is a picture of an anthropomorphic rabbit in a space suit.","details":null}

通过 JavaScript 进行推理

首先,我们需要安装 @huggingface/inference 库。

npm install @huggingface/inference

无论您使用 Inference Providers(我们的无服务器 API)还是 Inference Endpoints,都可以调用 InferenceClient

我们可以创建一个 InferenceClient,提供我们的端点 URL 和 Hugging Face 访问令牌

import { InferenceClient } from "@huggingface/inference";

const client = new InferenceClient('hf_YOUR_TOKEN', { endpointUrl: 'https://YOUR_ENDPOINT.endpoints.huggingface.cloud' });

const prompt =
  "![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n";

const stream = client.textGenerationStream({
  inputs: prompt,
  parameters: { max_new_tokens: 16, seed: 42 },
});
for await (const r of stream) {
  // yield the generated token
  process.stdout.write(r.token.text);
}

// This is a picture of an anthropomorphic rabbit in a space suit.

将视觉语言模型与其他功能结合使用

TGI 中的 VLM 具有多项优势,例如这些模型可以与其他功能协同使用以完成更复杂的任务。例如,您可以将 VLM 与引导生成结合使用,从图像生成特定的 JSON 数据。

例如,我们可以从兔子图像中提取信息,并生成一个 JSON 对象,其中包含位置、活动、看到的动物数量以及看到的动物。这看起来会像这样:

{
  "activity": "Standing",
  "animals": ["Rabbit"],
  "animals_seen": 1,
  "location": "Rocky surface with mountains in the background and a red light on the rabbit's chest"
}

我们只需要向 VLM 模型提供 JSON 模式,它就会为我们生成 JSON 对象。

curl localhost:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 42,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}'

# {
#   "generated_text": "{ \"activity\": \"Standing\", \"animals\": [ \"Rabbit\" ], \"animals_seen\": 1, \"location\": \"Rocky surface with mountains in the background and a red light on the rabbit's chest\" }"
# }

想了解更多关于视觉语言模型如何工作的信息?请查看这篇关于该主题的精彩博客文章

< > 在 GitHub 上更新