TGI 中的视觉语言模型推理

视觉语言模型 (VLM) 是指接收图像和文本输入以生成文本的模型。

VLM 在图像和文本数据的组合上进行训练，可以处理各种任务，例如图像描述、视觉问答和视觉对话。

VLM 与其他文本和图像模型的区别在于它们能够处理长上下文，并生成连贯且与图像相关的文本，即使在多次对话或在某些情况下，多张图像之后也是如此。

以下是视觉语言模型的一些常见用例

图像描述：给定一张图像，生成描述该图像的描述。
视觉问答 (VQA)：给定一张图像和一个关于该图像的问题，生成该问题的答案。
多模态对话：对多轮图像和对话生成回复。
图像信息检索：给定一张图像，从图像中检索信息。

如何使用视觉语言模型？

Hugging Face Hub Python 库

要通过 Python 使用视觉语言模型进行推理，您可以使用 huggingface_hub 库。InferenceClient 类提供了一种与 Inference API 交互的简单方法。图像可以作为 URL 或 base64 编码的字符串传递。InferenceClient 将自动检测图像格式。

from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient("http://127.0.0.1:3000")

# read image from local file
image_path = "rabbit.png"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.

或通过 chat_completion 端点

from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")

chat = client.chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    seed=42,
    max_tokens=100,
)

print(chat)
# ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" The image you've provided features an anthropomorphic rabbit in spacesuit attire. This rabbit is depicted with human-like posture and movement, standing on a rocky terrain with a vast, reddish-brown landscape in the background. The spacesuit is detailed with mission patches, circuitry, and a helmet that covers the rabbit's face and ear, with an illuminated red light on the chest area.\n\nThe artwork style is that of a", name=None, tool_calls=None), logprobs=None)], created=1714589614, id='', model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=2943, total_tokens=3043))

或使用 OpenAI 的客户端库

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="-")

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    stream=False,
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image depicts an anthropomorphic rabbit dressed in a space suit with gear that resembles NASA attire. The setting appears to be a solar eclipse with dramatic mountain peaks and a partial celestial body in the sky. The artwork is detailed and vivid, with a warm color palette and a sense of an adventurous bunny exploring or preparing for a journey beyond Earth. ', role='assistant', function_call=None, tool_calls=None))], created=1714589732, model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=84, prompt_tokens=2943, total_tokens=3027))

通过发送 cURL 请求进行推理

要将 generate_stream 端点与 curl 结合使用，您可以添加 -N 标志。此标志禁用 curl 默认缓冲，并显示从服务器到达的数据。

curl -N 127.0.0.1:3000/generate_stream \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' \
    -H 'Content-Type: application/json'

# ...
# data:{"index":16,"token":{"id":28723,"text":".","logprob":-0.6196289,"special":false},"generated_text":"This is a picture of an anthropomorphic rabbit in a space suit.","details":null}

通过 JavaScript 进行推理

首先，我们需要安装 @huggingface/inference 库。

npm install @huggingface/inference

如果您使用的是免费的 Inference API，则可以使用 Huggingface.js 的 HfInference。如果您使用的是推理端点，则可以使用 HfInferenceEndpoint 类轻松地与 Inference API 交互。

我们可以创建一个 HfInferenceEndpoint，提供我们的端点 URL，并且我们可以创建一个 HfInferenceEndpoint，提供我们的端点 URL 和 Hugging Face 访问令牌。

import { HfInferenceEndpoint } from "@huggingface/inference";

const hf = new HfInferenceEndpoint("http://127.0.0.1:3000", "HF_TOKEN");

const prompt =
  "![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n";

const stream = hf.textGenerationStream({
  inputs: prompt,
  parameters: { max_new_tokens: 16, seed: 42 },
});
for await (const r of stream) {
  // yield the generated token
  process.stdout.write(r.token.text);
}

// This is a picture of an anthropomorphic rabbit in a space suit.

将视觉语言模型与其他功能结合使用

TGI 中的 VLM 具有多个优势，例如，这些模型可以与其他功能结合使用，以完成更复杂的任务。例如，您可以将 VLM 与引导生成结合使用，以从图像生成特定的 JSON 数据。

例如，我们可以从兔子图像中提取信息，并生成一个包含位置、活动、看到的动物数量和看到的动物的 JSON 对象。看起来会是这样

{
  "activity": "Standing",
  "animals": ["Rabbit"],
  "animals_seen": 1,
  "location": "Rocky surface with mountains in the background and a red light on the rabbit's chest"
}

我们只需要向 VLM 模型提供一个 JSON 模式，它就会为我们生成 JSON 对象。

curl localhost:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 42,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}'

# {
#   "generated_text": "{ \"activity\": \"Standing\", \"animals\": [ \"Rabbit\" ], \"animals_seen\": 1, \"location\": \"Rocky surface with mountains in the background and a red light on the rabbit's chest\" }"
# }

想了解更多关于视觉语言模型如何工作的信息？请查看关于该主题的精彩博客文章。