多模态模板

多模态模型聊天模板需要与纯文本模型类似的模板。它需要包含 role 和 content 字典的 messages。

多模态模板包含在 Processor 类中，并且需要额外的 type 键来指定包含的内容是图像、视频还是文本。

本指南将向您展示如何格式化多模态模型的聊天模板，以及配置模板的一些最佳实践

ImageTextToTextPipeline

ImageTextToTextPipeline 是一个高级图像和文本生成类，具有“聊天模式”。当检测到对话模型并且聊天提示格式正确时，聊天模式将被启用。

首先使用以下两个角色构建聊天历史记录。

system 描述了当您与模型聊天时，模型应如何表现和响应。并非所有聊天模型都支持此角色。
user 是您输入给模型的第一个消息的位置。

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

创建一个 ImageTextToTextPipeline 并将聊天传递给它。对于大型模型，设置 device_map=“auto” 有助于更快地加载模型并自动将其放置在最快的可用设备上。将数据类型更改为 torch.bfloat16 也有助于节省内存。

ImageTextToTextPipeline 接受 OpenAI 格式的聊天，以使推理更轻松、更易于访问。

import torch
from transformers import pipeline

pipeline = pipeline("image-text-to-text", model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda", torch_dtype=torch.float16)
pipeline(text=messages, max_new_tokens=50, return_full_text=False)
[{'input_text': [{'role': 'system',
    'content': [{'type': 'text',
      'text': 'You are a friendly chatbot who always responds in the style of a pirate'}]},
   {'role': 'user',
    'content': [{'type': 'image',
      'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
     {'type': 'text', 'text': 'What are these?'}]}],
  'generated_text': 'The image shows two cats lying on a pink surface, which appears to be a cushion or a soft blanket. The cat on the left has a striped coat, typical of tabby cats, and is lying on its side with its head resting on the'}]

图像输入

对于接受图像的多模态模型（例如 LLaVA），请在 content 中包含以下内容，如下所示。

内容 "type" 可以是 "image" 或 "text"。
对于图像，它可以是图像链接 ("url")、文件路径 ("path") 或 "base64"。图像会自动加载、处理并准备为模型的像素值输入。

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

将 messages 传递给 apply_chat_template() 以标记输入内容并返回 input_ids 和 pixel_values。

processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(processed_chat.keys())

这些输入现在可以用于 generate() 中。

视频输入

一些视觉模型也支持视频输入。消息格式与图像输入的格式非常相似。

内容 "type" 应为 "video"，以指示内容是视频。
对于视频，它可以是视频链接 ("url") 或文件路径 ("path")。从 URL 加载的视频只能使用 PyAV 或 Decord 进行解码。

仅 PyAV 或 Decord 后端支持从 "url" 加载视频。

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
            {"type": "text", "text": "What do you see in this video?"},
        ],
    },
]

将 messages 传递给 apply_chat_template() 以标记输入内容。在 apply_chat_template() 中包含一些额外的参数来控制采样过程。

video_load_backend 参数指的是加载视频的特定框架。它支持 PyAV、Decord、OpenCV 和 torchvision。

以下示例使用 Decord 作为后端，因为它比 PyAV 快一点。

固定帧数

fps

自定义帧采样

图像帧列表

模板配置

您可以使用 Jinja 创建自定义聊天模板，并使用 apply_chat_template() 进行设置。有关更多详细信息，请参阅模板编写指南。

例如，为了使模板能够处理来自多种模态的内容列表，同时仍然支持用于纯文本推理的普通字符串，请指定如何处理 content['type']（如果它是图像或文本），如下面的 Llama 3.2 Vision Instruct 模板中所示。

{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['content'] is string %}
{{ message['content'] }}
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' %}
{{ '<|image|>' }}
{% elif content['type'] == 'text' %}
{{ content['text'] }}
{% endif %}
{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}

< > 在 GitHub 上更新

Transformers

多模态模板

ImageTextToTextPipeline

图像输入

视频输入

模板配置