图像-文本到文本

图像-文本到文本模型，也称为视觉语言模型 (VLMs)，是接受图像输入的语言模型。这些模型可以处理各种任务，从视觉问答到图像分割。此任务与图像到文本任务有很多相似之处，但与图像描述等一些重叠的用例。图像到文本模型仅接受图像输入，并且通常完成特定任务，而 VLM 接受开放式文本和图像输入，并且是更通用的模型。

在本指南中，我们将简要概述 VLM，并展示如何将它们与 Transformers 一起用于推理。

首先，VLM 有多种类型

用于微调的基础模型
用于对话的聊天微调模型
指令微调模型

本指南侧重于使用指令调整模型进行推理。

让我们开始安装依赖项。

pip install -q transformers accelerate flash_attn

让我们初始化模型和处理器。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

device = torch.device("cuda")
model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to(device)

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")

此模型具有聊天模板，可帮助用户解析聊天输出。此外，该模型还可以在单个对话或消息中接受多个图像作为输入。我们现在将准备输入。

图像输入如下所示。

from PIL import Image
import requests

img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
           "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
          Image.open(requests.get(img_urls[1], stream=True).raw)]

以下是聊天模板的示例。我们可以通过将对话轮次和最后一条消息附加到模板末尾来将其作为输入。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image we can see two cats on the nets."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },
]

我们现在将调用处理器的 apply_chat_template() 方法来预处理其输出以及图像输入。

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[images[0], images[1]], return_tensors="pt").to(device)

我们现在可以将预处理后的输入传递给模型。

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']

Pipeline

最快的入门方法是使用 Pipeline API。指定 "image-text-to-text" 任务和您要使用的模型。

from transformers import pipeline
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")

以下示例使用聊天模板来格式化文本输入。

messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
             },
             {"type": "text", "text": "Describe this image."},
         ],
     },
     {
         "role": "assistant",
         "content": [
             {"type": "text", "text": "There's a pink flower"},
         ],
     },
 ]

将聊天模板格式化的文本和图像传递给 Pipeline 并设置 return_full_text=False 以从生成的输出中删除输入。

outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
outputs[0]["generated_text"]
#  with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems

流式传输

我们可以使用文本流式传输来获得更好的生成体验。Transformers 通过 TextStreamer 或 TextIteratorStreamer 类支持流式传输。我们将结合 IDEFICS-8B 使用 TextIteratorStreamer。

假设我们有一个应用程序，它可以保留聊天历史记录并接收新的用户输入。我们将像往常一样预处理输入，并初始化 TextIteratorStreamer 以在单独的线程中处理生成。这允许您实时流式传输生成的文本标记。任何生成参数都可以传递给 TextIteratorStreamer。

import time
from transformers import TextIteratorStreamer
from threading import Thread

def model_inference(
    user_prompt,
    chat_history,
    max_new_tokens,
    images
):
    user_prompt = {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": user_prompt},
        ]
    }
    chat_history.append(user_prompt)
    streamer = TextIteratorStreamer(
        processor.tokenizer,
        skip_prompt=True,
        timeout=5.0,
    )

    generation_args = {
        "max_new_tokens": max_new_tokens,
        "streamer": streamer,
        "do_sample": False
    }

    # add_generation_prompt=True makes model generate bot response
    prompt = processor.apply_chat_template(chat_history, add_generation_prompt=True)
    inputs = processor(
        text=prompt,
        images=images,
        return_tensors="pt",
    ).to(device)
    generation_args.update(inputs)

    thread = Thread(
        target=model.generate,
        kwargs=generation_args,
    )
    thread.start()

    acc_text = ""
    for text_token in streamer:
        time.sleep(0.04)
        acc_text += text_token
        if acc_text.endswith("<end_of_utterance>"):
            acc_text = acc_text[:-18]
        yield acc_text

    thread.join()

现在让我们调用我们创建的 model_inference 函数并流式传输值。

generator = model_inference(
    user_prompt="And what is in this image?",
    chat_history=messages[:2],
    max_new_tokens=100,
    images=images
)

for value in generator:
  print(value)

# In
# In this
# In this image ...

在较小的硬件中适配模型

VLM 通常很大，需要进行优化才能在较小的硬件上运行。Transformers 支持许多模型量化库，这里我们仅展示使用 Quanto 进行 int8 量化。int8 量化可将内存提高高达 75%（如果所有权重都已量化）。然而，这并非免费的午餐，因为 8 位不是 CUDA 原生精度，权重会在运行时来回量化，从而增加了延迟。

首先，安装依赖项。

pip install -U quanto bitsandbytes

要在加载期间量化模型，我们需要首先创建 QuantoConfig。然后像往常一样加载模型，但在模型初始化期间传递 quantization_config。

from transformers import AutoModelForImageTextToText, QuantoConfig

model_id = "HuggingFaceM4/idefics2-8b"
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="cuda", quantization_config=quantization_config
)

就是这样，我们可以以相同的方式使用模型，无需任何更改。

Transformers

图像-文本到文本

Pipeline

流式传输

在较小的硬件中适配模型

延伸阅读