TGI 中的视觉语言模型推理
视觉语言模型 (VLM) 是能够接受图像和文本输入并生成文本的模型。
VLM 在图像和文本数据的组合上进行训练,可以处理各种任务,例如图像字幕、视觉问答和视觉对话。
VLM 与其他文本和图像模型的区别在于它们能够处理长上下文,并在多轮或在某些情况下多张图像之后生成与图像相关联且连贯的文本。
以下是视觉语言模型的一些常见用例
- 图像字幕:给定一张图像,生成描述该图像的字幕。
- 视觉问答 (VQA):给定一张图像和关于该图像的问题,生成问题的答案。
- 多模态对话:对多轮图像和对话进行响应。
- 图像信息检索:给定一张图像,从图像中检索信息。
如何使用视觉语言模型?
Hugging Face Hub Python 库
要通过 Python 推理视觉语言模型,可以使用 huggingface_hub
库。InferenceClient
类提供了一种简单的与 推理 API 交互的方式。图像可以作为 URL 或 base64 编码的字符串传递。InferenceClient
将自动检测图像格式。
from huggingface_hub import InferenceClient
client = InferenceClient("http://127.0.0.1:3000")
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
print(token)
# This is a picture of an anthropomorphic rabbit in a space suit.
from huggingface_hub import InferenceClient
import base64
import requests
import io
client = InferenceClient("http://127.0.0.1:3000")
# read image from local file
image_path = "rabbit.png"
with open(image_path, "rb") as f:
image = base64.b64encode(f.read()).decode("utf-8")
image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
print(token)
# This is a picture of an anthropomorphic rabbit in a space suit.
或通过 chat_completion
端点
from huggingface_hub import InferenceClient
client = InferenceClient("http://127.0.0.1:3000")
chat = client.chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Whats in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
},
},
],
},
],
seed=42,
max_tokens=100,
)
print(chat)
# ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" The image you've provided features an anthropomorphic rabbit in spacesuit attire. This rabbit is depicted with human-like posture and movement, standing on a rocky terrain with a vast, reddish-brown landscape in the background. The spacesuit is detailed with mission patches, circuitry, and a helmet that covers the rabbit's face and ear, with an illuminated red light on the chest area.\n\nThe artwork style is that of a", name=None, tool_calls=None), logprobs=None)], created=1714589614, id='', model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=2943, total_tokens=3043))
或者使用 OpenAI 的 客户端库
from openai import OpenAI
# init the client but point it to TGI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="-")
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Whats in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
},
},
],
},
],
stream=False,
)
print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image depicts an anthropomorphic rabbit dressed in a space suit with gear that resembles NASA attire. The setting appears to be a solar eclipse with dramatic mountain peaks and a partial celestial body in the sky. The artwork is detailed and vivid, with a warm color palette and a sense of an adventurous bunny exploring or preparing for a journey beyond Earth. ', role='assistant', function_call=None, tool_calls=None))], created=1714589732, model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=84, prompt_tokens=2943, total_tokens=3027))
通过发送 cURL 请求进行推理
要使用 generate_stream
端点与 curl,您可以添加 -N
标志。此标志禁用 curl 默认缓冲并显示从服务器到达的数据。
curl -N 127.0.0.1:3000/generate_stream \
-X POST \
-d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' \
-H 'Content-Type: application/json'
# ...
# data:{"index":16,"token":{"id":28723,"text":".","logprob":-0.6196289,"special":false},"generated_text":"This is a picture of an anthropomorphic rabbit in a space suit.","details":null}
通过 JavaScript 进行推理
首先,我们需要安装 @huggingface/inference
库。
npm install @huggingface/inference
如果您使用的是免费的推理 API,则可以使用 Huggingface.js 的 HfInference
。如果您使用的是推理端点,则可以使用 HfInferenceEndpoint
类轻松与推理 API 交互。
我们可以创建一个 HfInferenceEndpoint
,提供我们的端点 URL 和 Hugging Face 访问令牌.
import { HfInferenceEndpoint } from "@huggingface/inference";
const hf = new HfInferenceEndpoint("http://127.0.0.1:3000", "HF_TOKEN");
const prompt =
"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n";
const stream = hf.textGenerationStream({
inputs: prompt,
parameters: { max_new_tokens: 16, seed: 42 },
});
for await (const r of stream) {
// yield the generated token
process.stdout.write(r.token.text);
}
// This is a picture of an anthropomorphic rabbit in a space suit.
将视觉语言模型与其他功能相结合
TGI 中的 VLM 具有多个优势,例如,这些模型可以与其他功能一起使用,以完成更复杂的任务。例如,您可以将 VLM 与 引导生成 一起使用,以从图像中生成特定的 JSON 数据。
例如,我们可以从兔子图像中提取信息,并生成一个包含位置、活动、所见动物数量以及所见动物的 JSON 对象。这将看起来像这样
{
"activity": "Standing",
"animals": ["Rabbit"],
"animals_seen": 1,
"location": "Rocky surface with mountains in the background and a red light on the rabbit's chest"
}
我们只需要向 VLM 模型提供 JSON 架构,它就会为我们生成 JSON 对象。
curl localhost:3000/generate \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n",
"parameters": {
"max_new_tokens": 100,
"seed": 42,
"grammar": {
"type": "json",
"value": {
"properties": {
"location": {
"type": "string"
},
"activity": {
"type": "string"
},
"animals_seen": {
"type": "integer",
"minimum": 1,
"maximum": 5
},
"animals": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["location", "activity", "animals_seen", "animals"]
}
}
}
}'
# {
# "generated_text": "{ \"activity\": \"Standing\", \"animals\": [ \"Rabbit\" ], \"animals_seen\": 1, \"location\": \"Rocky surface with mountains in the background and a red light on the rabbit's chest\" }"
# }
想要了解更多关于视觉语言模型如何工作的信息?查看 关于此主题的精彩博客文章.
< > 更新 在 GitHub 上