Idefics2

概述

Idefics2 模型由 Léo Tronchon、Hugo Laurencon 和 Victor Sanh 在构建视觉语言模型时什么重要？中提出。随附的博客文章可在此处找到。

Idefics2 是一个开放的多模态模型，它接受任意图像和文本输入序列并生成文本输出。该模型可以回答关于图像的问题，描述视觉内容，创建基于多幅图像的故事，或者在没有视觉输入的情况下简单地作为纯语言模型运行。它在文档理解、OCR 或视觉推理方面比 IDEFICS-1 有显著改进。Idefics2 轻量级（80 亿参数），并以其原始宽高比和分辨率处理图像，这使得推理效率各异。

论文摘要如下：

大型语言模型和视觉 Transformer 的改进推动了人们对视觉语言模型（VLM）日益增长的兴趣。尽管关于这个主题的文献很多，但我们发现关于 VLM 设计的关键决策往往没有得到证实。我们认为这些未经支持的决策阻碍了该领域的进展，因为它使识别哪些选择能提高模型性能变得困难。为了解决这个问题，我们围绕预训练模型、架构选择、数据和训练方法进行了大量实验。我们的研究结果整合包括开发 Idefics2，一个高效的基础 VLM，拥有 80 亿参数。Idefics2 在各种多模态基准测试中，在其规模类别内实现了最先进的性能，并且通常与大小是其四倍的模型不相上下。我们发布了该模型（基础版、指令版和聊天版）以及为其训练创建的数据集。

Idefics2 架构。摘自原始论文。

此模型由amyeroberts贡献。原始代码可在此处找到。

使用技巧

每个样本可以包含多个图像，并且图像数量在不同样本之间可以变化。处理器会将输入填充到批处理中图像的最大数量，以供模型输入。
处理器有一个 `do_image_splitting` 选项。如果为 `True`，则每个输入图像将被分割成 4 个子图像，并与原始图像连接形成 5 个图像。这对于提高模型性能很有用。如果模型在训练时没有使用此选项，请确保将 `processor.image_processor.do_image_splitting` 设置为 `False`。
传递给处理器的 `text` 应在应插入图像的位置包含 `` 标记。如果文本是聊天消息，则在每个话语的末尾应包含 ``。
处理器有自己的 `apply_chat_template` 方法，用于将聊天消息转换为文本，然后可以将该文本作为 `text` 传递给处理器。

如何在聊天消息上使用处理器的示例

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

# at inference time, one needs to pass `add_generation_prompt=True` in order to make sure the model completes the prompt
text = processor.apply_chat_template(messages, add_generation_prompt=True)
print(text)
# 'User: What’s the difference between these two images?<image><image><end_of_utterance>\nAssistant:'

inputs = processor(images=images, text=text, return_tensors="pt").to(device)

generated_text = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
print("Generated text:", generated_text)

在训练期间，确定模型不应学习哪些 token 非常重要。对于 Idefics2 来说，这通常归结为图像和填充 token。这意味着可以按如下方式创建标签：

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
import torch

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "What’s the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
},
{
    "role": "assistant",
    "content": [
        {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
    ],
}]

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

text = processor.apply_chat_template(messages, add_generation_prompt=False)
inputs = processor(images=images, text=text, return_tensors="pt").to(device)

labels = inputs.input_ids.clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == model.config.image_token_id] = -100

inputs["labels"] = labels

outputs = model(**inputs)
loss = outputs.loss
loss.backward()

请注意，当在用户和助手之间的多轮对话中训练 Idefics2 时，通常还会将所有对应于用户消息的 token 设置为 -100。

模型优化：Flash Attention

上述代码片段展示了没有任何优化技巧的推理。然而，通过利用模型内部使用的更快注意力机制实现 Flash Attention，可以大大加快模型速度。

首先，请确保安装最新版本的 Flash Attention 2，以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

另外，请确保您的硬件与 Flash-Attention 2 兼容。有关更多信息，请参阅 Flash Attention 仓库的官方文档。同时，请确保以半精度（例如 `torch.float16`）加载模型。

要使用 Flash Attention-2 加载和运行模型，只需将上面的代码片段更改为以下内容

model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,
+    attn_implementation="flash_attention_2",
).to(device)

使用量化缩小 Idefics2

由于 Idefics2 模型有 80 亿参数，在半精度（float16）下大约需要 16GB 的 GPU RAM，因为每个参数存储为 2 字节。但是，可以使用量化来缩小模型大小。如果模型量化到 4 位（或每个参数半字节），则只需要大约 3.5GB 的 RAM。

量化模型只需将 `quantization_config` 传递给模型即可。可以将上述代码片段更改为以下内容。我们将利用 BitsAndyBytes 量化（但请参阅此页面了解其他量化方法）

+ from transformers import BitsAndBytesConfig

+ quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.float16
+ )
model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,
+    quantization_config=quantization_config,
).to(device)

资源

一份官方 Hugging Face 和社区（用🌎表示）资源列表，帮助您开始使用 Idefics2。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！资源最好能展示一些新内容，而不是重复现有资源。

关于如何使用 Trainer 在自定义数据集上微调 Idefics2 的 Notebook 可以在此处找到。它支持完全微调和（量化）LoRa。
有关如何使用 TRL 库微调 Idefics2 的脚本可在此处找到。
有关微调 Idefics2 用于 JSON 提取用例的演示 Notebook 可以在此处找到。🌎

Idefics2Config

class transformers.Idefics2Config

< 来源 >

( use_cache = True image_token_id = 32001 tie_word_embeddings = False vision_config = None perceiver_config = None text_config = None **kwargs )

参数

use_cache (bool, 可选, 默认为 True) — 模型是否应缓存注意力机制的关键/值对。
image_token_id (int, 可选, 默认为 32001) — “图像” token 的 ID。
tie_word_embeddings (bool, 可选, 默认为 False) — 是否将词嵌入与 token 嵌入绑定。
vision_config (IdeficsVisionConfig 或 dict, 可选) — 自定义视觉配置或字典
perceiver_config (IdeficsPerceiverConfig 或 dict, 可选) — 自定义感知器配置或字典
text_config (MistralConfig 或 dict, 可选) — 文本模型的自定义文本配置或字典

这是存储 Idefics2Model 配置的配置类。它用于根据指定参数实例化 Idefics2 模型，定义模型架构。使用默认值实例化配置将产生与 Idefics2 HuggingFaceM4/idefics2-8b 架构模型相似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import Idefics2Model, Idefics2Config
>>> # Initializing configuration
>>> configuration = Idefics2Config()
>>> # Initializing a model from the configuration
>>> model = Idefics2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Idefics2Model

class transformers.Idefics2Model

< 来源 >

( config: Idefics2Config )

参数

config (Idefics2Config) — 包含模型所有参数的模型配置类。使用配置文件初始化并不会加载与模型相关的权重，仅加载配置。请查阅 from_pretrained() 方法以加载模型权重。

Idefics2 模型由 SIGLIP 视觉编码器和 Mistral 语言解码器组成

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。请将其作为常规 PyTorch 模块使用，并参阅 PyTorch 文档以了解所有与一般使用和行为相关的事项。

forward

< 来源 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_attention_mask: typing.Optional[torch.BoolTensor] = None image_hidden_states: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None return_dict: typing.Optional[bool] = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.models.idefics2.modeling_idefics2.Idefics2BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (形状为 (batch_size, sequence_length) 的 torch.LongTensor, 可选) — 词汇表中输入序列 token 的索引。默认情况下会忽略填充。

可以使用 AutoTokenizer 获取索引。有关详细信息，请参见 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
attention_mask (形状为 (batch_size, sequence_length) 的 torch.Tensor, 可选) — 避免对填充 token 索引执行注意力的掩码。掩码值选择在 [0, 1]:
- 1 表示未被掩码的 token，
- 0 表示被掩码的 token。
什么是注意力掩码？
position_ids (形状为 (batch_size, sequence_length) 的 torch.LongTensor, 可选) — 每个输入序列 token 在位置嵌入中的位置索引。选择范围为 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (list[torch.FloatTensor], 可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码。这通常包括模型在解码上一阶段返回的 past_key_values，当 use_cache=True 或 config.use_cache=True 时。

允许两种格式：
- Cache 实例，请参阅我们的kv 缓存指南；
- 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的 2 个张量）。这也称为旧版缓存格式。
模型将输出与输入相同的缓存格式。如果未传递 past_key_values，则将返回旧版缓存格式。

如果使用 past_key_values，用户可以选择仅输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将其过去的键值状态提供给此模型的），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可选) — 或者，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有比模型内部嵌入查找矩阵更多的控制，这很有用。
pixel_values (形状为 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor, 可选) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获得。有关详细信息，请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
pixel_attention_mask (形状为 (batch_size, image_size, image_size) 的 torch.Tensor, 可选) — 避免对填充像素索引执行注意力的掩码。
image_hidden_states (形状为 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 经过模态投影和感知器重采样后，图像编码器的隐藏状态。
use_cache (bool, 可选) — 如果设置为 True，则返回 past_key_values 键值状态，可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
cache_position (形状为 (sequence_length) 的 torch.LongTensor, 可选) — 指示输入序列 token 在序列中位置的索引。与 position_ids 不同，此张量不受填充影响。它用于在正确位置更新缓存并推断完整的序列长度。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。

transformers.models.idefics2.modeling_idefics2.Idefics2BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.idefics2.modeling_idefics2.Idefics2BaseModelOutputWithPast 或一个 torch.FloatTensor 元组（如果传递了 return_dict=False 或 config.return_dict=False），包含根据配置（Idefics2Config）和输入的不同元素。

last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor) — 模型最后一层输出的隐藏状态序列。如果使用 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当 use_cache=True 或 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的 2 个张量），如果 config.is_encoder_decoder=True 则可选地包含形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的 2 个额外张量。包含预先计算的隐藏状态（自注意力块中的键和值，如果 config.is_encoder_decoder=True 则可选地包含交叉注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple[torch.FloatTensor], 可选, 当 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每一层的输出）。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor], 可选, 当 output_attentions=True 或 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元组（每个层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (tuple(torch.FloatTensor), 可选) — torch.FloatTensor 的元组（一个用于图像嵌入的输出，(batch_size, num_images, sequence_length, hidden_size)）。图像编码器和可选的感知器生成的模型图像隐藏状态。

输入到模型的图像数量可以任意。为了解决这个问题，输入到模型的像素值具有图像填充 -> (batch_size, max_num_images, 3, max_heights, max_widths)，其中 max_num_images 是批处理中 batch_size 样本中图像的最大数量。

除了在模型入口处填充像素值外，不需要填充图像。为了提高效率，我们只通过 vision_model 的前向传递实际图像，丢弃填充图像，例如大小为 (image_batch_size, 3, height, width) 的像素值，其中当 num_images_per_sample=[1, 3, 1, 2] 时，image_batch_size 为 7，max_num_images 为 3。

Idefics2ForConditionalGeneration

class transformers.Idefics2ForConditionalGeneration

< 来源 >

( config )

参数

config (Idefics2ForConditionalGeneration) — 模型的配置类，包含模型的所有参数。用配置文件初始化并不会加载与模型相关的权重，只加载配置。请查看 from_pretrained() 方法来加载模型权重。

带有语言建模头的 Idefics2 模型。它由一个 SigLIP 视觉编码器组成，顶部带有一个语言建模头。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。请将其作为常规 PyTorch 模块使用，并参阅 PyTorch 文档以了解所有与一般使用和行为相关的事项。

forward

< 源代码 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_attention_mask: typing.Optional[torch.BoolTensor] = None image_hidden_states: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.idefics2.modeling_idefics2.KwargsForCausalLM] ) → transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor 形状为 (batch_size, sequence_length)，可选) — 词汇表中输入序列 token 的索引。默认情况下会忽略填充。

索引可以使用 AutoTokenizer 获取。有关详细信息，请参见 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是 input ID？
attention_mask (torch.Tensor 形状为 (batch_size, sequence_length)，可选) — 用于避免对填充 token 索引执行注意力操作的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示 未被掩盖 的 token，
- 0 表示 被掩盖 的 token。
什么是注意力掩码？
position_ids (torch.LongTensor 形状为 (batch_size, sequence_length)，可选) — 每个输入序列 token 在位置嵌入中的位置索引。选择范围在 [0, config.n_positions - 1] 之间。

什么是位置 ID？
past_key_values (list[torch.FloatTensor]，可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速序列解码。这通常包括模型在解码上一阶段返回的 past_key_values，当 use_cache=True 或 config.use_cache=True 时。

允许两种格式：
- Cache 实例，请参见我们的 kv 缓存指南；
- 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量。这也被称为传统缓存格式。
模型将输出与输入相同的缓存格式。如果未传入 past_key_values，则将返回传统缓存格式。

如果使用 past_key_values，用户可以选择只输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将过去的键值状态提供给此模型的），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor 形状为 (batch_size, sequence_length, hidden_size)，可选) — 可选地，您可以选择直接传入嵌入表示，而不是传入 input_ids。如果您希望对 input_ids 索引如何转换为相关向量有比模型内部嵌入查找矩阵更多的控制，这将很有用。
pixel_values (torch.FloatTensor 形状为 (batch_size, num_channels, image_size, image_size)，可选) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
pixel_attention_mask (torch.Tensor 形状为 (batch_size, image_size, image_size)，可选) — 用于避免对填充像素索引执行注意力操作的掩码。
image_hidden_states (torch.FloatTensor 形状为 (batch_size, num_channels, image_size, image_size)) — 图像编码器在模态投影和感知器重采样后的隐藏状态。
labels (torch.LongTensor 形状为 (batch_size, sequence_length)，可选) — 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 model.image_token_id（其中 model 是 Idefics2ForConditionalGeneration 的实例）中。索引设置为 model.image_token_id 的 token 将被忽略（掩码），损失仅针对标签在 [0, ..., config.vocab_size] 中的 token 计算。
use_cache (bool，可选) — 如果设置为 True，则返回 past_key_values 键值状态，可用于加速解码（参见 past_key_values）。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。
cache_position (torch.LongTensor 形状为 (sequence_length)，可选) — 描述输入序列 token 在序列中位置的索引。与 position_ids 不同，此张量不受填充影响。它用于在正确位置更新缓存并推断完整的序列长度。
logits_to_keep (Union[int, torch.Tensor]，默认为 0) — 如果是 int，则计算最后 logits_to_keep 个 token 的 logits。如果为 0，则计算所有 input_ids 的 logits（特殊情况）。生成时只需要最后一个 token 的 logits，只计算该 token 可以节省内存，这对于长序列或大词汇表来说非常重要。如果是 torch.Tensor，则必须是 1D，对应于在序列长度维度中要保留的索引。这在使用 packed tensor 格式（批次和序列长度的单维度）时很有用。

transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.idefics2.modeling_idefics2.Idefics2CausalLMOutputWithPast 或一个 torch.FloatTensor 元组（如果传入 return_dict=False 或 config.return_dict=False），包含根据配置 (Idefics2Config) 和输入的不同元素。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个 token 预测）。
logits (形状为 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 语言建模头部的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传入 use_cache=True 或 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量。包含预先计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速序列解码。
hidden_states (tuple[torch.FloatTensor], 可选, 当 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每一层的输出）。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor], 可选, 当 output_attentions=True 或 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元组（每个层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (tuple(torch.FloatTensor), 可选) — torch.FloatTensor 的元组（一个用于图像嵌入的输出，(batch_size, num_images, sequence_length, hidden_size)）。图像编码器和可选的感知器生成的模型图像隐藏状态。

Idefics2ForConditionalGeneration 的 forward 方法，重写了 __call__ 特殊方法。

尽管 forward pass 的配方需要在此函数中定义，但在此之后应调用 Module 实例，而不是直接调用此函数，因为前者会处理运行预处理和后处理步骤，而后者则会默默地忽略它们。

示例

>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO

>>> from transformers import AutoProcessor, AutoModelForVision2Seq
>>> from transformers.image_utils import load_image

>>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible
>>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
>>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
>>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

>>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
>>> model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b-base", device_map="auto")

>>> BAD_WORDS_IDS = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> EOS_WORDS_IDS = [processor.tokenizer.eos_token_id]

>>> # Create inputs
>>> prompts = [
...   "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
...   "In which city is that bridge located?<image>",
... ]
>>> images = [[image1, image2], [image3]]
>>> inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt").to("cuda")

>>> # Generate
>>> generated_ids = model.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=20)
>>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> print(generated_texts)
['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of New York, and more specifically the Statue of Liberty.\n\n', 'In which city is that bridge located?\n\nThe bridge is located in the city of Pittsburgh, Pennsylvania.\n\n\nThe bridge is']

Idefics2ImageProcessor

类 transformers.Idefics2ImageProcessor

< 源代码 >

( do_convert_rgb: bool = True do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: bool = True do_image_splitting: bool = False **kwargs )

参数

do_convert_rgb (bool，可选，默认为 True) — 是否将图像转换为 RGB。如果输入图像是不同格式（例如 RGBA），这将很有用。仅当输入图像为 PIL 格式时才有效。
do_resize (bool，可选，默认为 True) — 是否调整图像大小。图像的最长边将调整为小于等于 size["longest_edge"]，最短边将保持输入长宽比进行调整，最小尺寸为 size["shortest_edge"]。
size (Dict，可选) — 控制输出图像的大小。这是一个包含键 "shortest_edge" 和 "longest_edge" 的字典。
resample (Resampling，可选，默认为 Resampling.BILINEAR) — 调整图像大小时使用的重采样滤波器。
do_rescale (bool，可选，默认为 True) — 是否重新缩放图像。如果设置为 True，图像将被重新缩放，使其像素值在 0 和 1 之间。
rescale_factor (float，可选，默认为 1/255) — 如果 do_rescale 设置为 True，则用于重新缩放图像的缩放因子。
do_normalize (bool，可选，默认为 True) — 是否对图像进行归一化。如果设置为 True，图像将归一化为具有 image_mean 的均值和 image_std 的标准差。
image_mean (float 或 list[float]，可选，默认为 IDEFICS_STANDARD_MEAN) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或浮点数列表，其长度与图像中的通道数相同。可以通过 preprocess 方法中的 image_mean 参数覆盖。
image_std (float 或 list[float]，可选，默认为 IDEFICS_STANDARD_STD) — 如果对图像进行归一化，则使用的标准差。这是一个浮点数或浮点数列表，其长度与图像中的通道数相同。可以通过 preprocess 方法中的 image_std 参数覆盖。
do_pad (bool，可选，默认为 True) — 是否将图像填充到批次中最大的高度和宽度以及批次中每个样本的图像数量，以便返回的张量形状为 (batch_size, max_num_images, num_channels, max_height, max_width)。
do_image_splitting (bool，可选，默认为 False) — 是否将图像分成 4 个相等的子图像序列，并与原始图像连接。此策略首次在 https://huggingface.co/papers/2311.06607 中引入。

构造一个 Idefics 图像处理器。

预处理

< 源代码 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: typing.Optional[bool] = None do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: typing.Optional[bool] = None do_image_splitting: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None input_data_format: typing.Optional[transformers.image_utils.ChannelDimension] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> )

参数

images (ImageInput) — 要预处理的图像列表。
do_convert_rgb (bool，可选，默认为 self.do_convert_rgb) — 是否将图像转换为 RGB。
do_resize (bool，可选，默认为 self.do_resize) — 是否调整图像大小。
size (dict[str, int]，可选，默认为 self.size) — 调整大小后图像的尺寸。图像最短边将调整为 size["shortest_edge"]，最长边将保持输入长宽比进行调整。
resample (int，可选，默认为 self.resample) — 如果调整图像大小，则使用的重采样滤波器。这可以是枚举 PILImageResampling 之一。仅当 do_resize 设置为 True 时才有效。
do_rescale (bool，可选，默认为 self.do_rescale) — 是否重新缩放图像。
rescale_factor (float，可选，默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放图像的缩放因子。
do_normalize (bool，可选，默认为 self.do_normalize) — 是否对图像进行归一化。
image_mean (float 或 list[float]，可选，默认为 self.image_mean) — 用于归一化的图像均值。仅当 do_normalize 设置为 True 时才有效。
image_std (float 或 list[float]，可选，默认为 self.image_std) — 用于归一化的图像标准差。仅当 do_normalize 设置为 True 时才有效。
do_pad (bool，可选，默认为 self.do_pad) — 是否将图像填充到批次中最大的高度和宽度。
do_image_splitting (bool，可选，默认为 self.do_image_splitting) — 是否将图像分成 4 个相等的子图像序列，并与原始图像连接。此策略首次在 https://huggingface.co/papers/2311.06607 中引入。
return_tensors (str 或 TensorType，可选) — 要返回的张量类型。可以是以下之一：
- 未设置：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 类型的批处理。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 类型的批处理。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 类型的批处理。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 类型的批处理。
data_format (ChannelDimension 或 str，可选，默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像为 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：图像为 (height, width, num_channels) 格式。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str，可选) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像为 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：图像为 (height, width, num_channels) 格式。
- "none" 或 ChannelDimension.NONE：图像为 (height, width) 格式。

预处理一批图像。

Idefics2ImageProcessorFast

类 transformers.Idefics2ImageProcessorFast

< 源代码 >

( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )

构建一个快速的Idefics2图像处理器。

预处理

< 来源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] **kwargs: typing_extensions.Unpack[transformers.models.idefics2.image_processing_idefics2_fast.Idefics2FastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

参数

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — 要预处理的图像。期望像素值范围为0到255的单个或批量图像。如果传入像素值介于0到1之间的图像，请设置do_rescale=False。
do_resize (bool, 可选) — 是否调整图像大小。
size (dict[str, int], 可选) — 描述模型的最大输入尺寸。
default_to_square (bool, 可选) — 如果尺寸为整数，是否默认调整为方形图像。
resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — 如果调整图像大小，则使用的重采样过滤器。这可以是枚举PILImageResampling之一。仅在do_resize设置为True时才有效。
do_center_crop (bool, 可选) — 是否对图像进行中心裁剪。
crop_size (dict[str, int], 可选) — 应用center_crop后输出图像的尺寸。
do_rescale (bool, 可选) — 是否对图像进行缩放。
rescale_factor (Union[int, float, NoneType]) — 如果do_rescale设置为True，则按此缩放因子对图像进行缩放。
do_normalize (bool, 可选) — 是否对图像进行归一化。
image_mean (Union[float, list[float], NoneType]) — 用于归一化的图像平均值。仅在do_normalize设置为True时才有效。
image_std (Union[float, list[float], NoneType]) — 用于归一化的图像标准差。仅在do_normalize设置为True时才有效。
do_convert_rgb (bool, 可选) — 是否将图像转换为RGB。
return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — 如果设置为“pt”，则返回堆叠的张量，否则返回张量列表。
data_format (~image_utils.ChannelDimension, 可选) — 仅支持ChannelDimension.FIRST。为与慢速处理器兼容而添加。
input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像为 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：图像为 (height, width, num_channels) 格式。
- "none" 或 ChannelDimension.NONE：图像为 (height, width) 格式。
device (torch.device, 可选) — 处理图像的设备。如果未设置，则从输入图像推断设备。
disable_grouping (bool, 可选) — 是否禁用按大小对图像进行分组，以单独而不是批处理方式处理它们。如果为None，则如果图像在CPU上，将设置为True，否则设置为False。此选择基于经验观察，详情请参阅：https://github.com/huggingface/transformers/pull/38157
do_image_splitting (bool, 可选, 默认为False) — 是否将图像分割成4个相等的子图像序列并与原始图像连接。
do_pad (bool, 可选, 默认为True) — 是否将图像填充到批次中的最大高度和宽度。

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — 由 call 方法返回的列表/数组/张量字典（“pixel_values”等）。
tensor_type (Union[None, str, TensorType], 可选) — 您可以在此处提供一个`tensor_type`，以便在初始化时将整数列表转换为PyTorch/TensorFlow/Numpy张量。

Idefics2Processor

class transformers.Idefics2Processor

< 来源 >

( image_processor tokenizer = None image_seq_len: int = 64 chat_template: typing.Optional[str] = None **kwargs )

参数

image_processor (Idefics2ImageProcessor) — Idefics2ImageProcessor的一个实例。图像处理器是必需输入。
tokenizer (PreTrainedTokenizerBase, 可选) — PreTrainedTokenizerBase的一个实例。这应该与模型的文本模型相对应。分词器是必需输入。
image_seq_len (int, 可选, 默认为 64) — 图像序列的长度，即输入中每张图像的标记数量。此参数用于根据输入提示和图像标记构建字符串，应与所用模型的config.perceiver_config.resampler_n_latents值匹配。
chat_template (str, 可选) — 一个Jinja模板，用于将聊天中的消息列表转换为可分词字符串。

构建一个IDEFICS2处理器，它将LLama分词器和IDEFICS2图像处理器封装成一个单一的处理器。

IdeficsProcessor提供了Idefics2ImageProcessor和LlamaTokenizerFast的所有功能。有关更多信息，请参阅call()和decode()的文档字符串。

call

< 来源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], ForwardRef('torch.Tensor')], list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], ForwardRef('torch.Tensor']]], list[list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], ForwardRef('torch.Tensor']]]]] = None text: typing.Union[str, ForwardRef('PreTokenizedInput'), list[str], list['PreTokenizedInput']] = None audio = None videos = None **kwargs: typing_extensions.Unpack[transformers.models.idefics2.processing_idefics2.Idefics2ProcessorKwargs] )

参数

images (PIL.Image.Image, np.ndarray, torch.Tensor, list[PIL.Image.Image], list[np.ndarray], list[torch.Tensor], 可选) — 要准备的图像或批量图像。每张图像可以是PIL图像、NumPy数组或PyTorch张量。如果是list[ImageInput]类型，则假定这是单个提示（即批量大小为1）。
text (Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]], 可选) — 要编码的序列或批量序列。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，您必须设置is_split_into_words=True（以消除与批量序列的歧义）。

无论在哪里遇到图像标记，它都会扩展为 + image_seq_len `。
return_tensors (Union[str, TensorType], 可选) — 如果设置，将返回特定框架的张量。有关更多信息，请参阅PreTrainedTokenizerFast.call()。

处理输入提示并返回BatchEncoding。

示例

>>> import requests
>>> from transformers import Idefics2Processor
>>> from transformers.image_utils import load_image

>>> processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", image_seq_len=2)
>>> processor.image_processor.do_image_splitting = False  # Force as False to simplify the example

>>> url1 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
>>> url2 = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"

>>> image1, image2 = load_image(url1), load_image(url2)
>>> images = [[image1], [image2]]

>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(images=images, text=text, return_tensors="pt", padding=True)
>>> input_ids = outputs.input_ids
>>> input_tokens = processor.tokenizer.batch_decode(input_ids)
>>> print(input_tokens)
['<s><fake_token_around_image><image><image><fake_token_around_image> In this image, we see', '<s> bla bla bla<fake_token_around_image><image><image><fake_token_around_image>']

< > 在 GitHub 上更新

Transformers

Idefics2

概述

使用技巧

模型优化：Flash Attention

使用量化缩小 Idefics2

资源

Idefics2Config

class transformers.Idefics2Config

Idefics2Model

class transformers.Idefics2Model

forward

Idefics2ForConditionalGeneration

class transformers.Idefics2ForConditionalGeneration

forward

Idefics2ImageProcessor

类 transformers.Idefics2ImageProcessor

预处理

Idefics2ImageProcessorFast

类 transformers.Idefics2ImageProcessorFast

预处理

Idefics2Processor

class transformers.Idefics2Processor

__call__

call