Idefics3

概述

Idefics3 模型在 Building and better understanding vision-language models: insights and future directions 中被提出，作者是 Hugo Laurençon、Andrés Marafioti、Victor Sanh 和 Léo Tronchon。

Idefics3 是 Idefics2 模型的改编版本，主要有三个不同之处

它使用 Llama3 作为文本模型。
它对图像使用了更新的处理逻辑。
它移除了 perceiver。

论文的摘要如下

视觉语言模型（VLM）领域，它以图像和文本作为输入并输出文本，正在快速发展，但在开发流程的几个关键方面尚未达成共识，包括数据、架构和训练方法。本文可以看作是构建 VLM 的教程。我们首先全面概述当前最先进的方法，重点介绍每种方法的优势和劣势，解决该领域的主要挑战，并为尚未充分探索的领域提出有希望的研究方向。然后，我们将逐步介绍构建 Idefics3-8B 的实际步骤，这是一款强大的 VLM，其性能显著优于其前身 Idefics2-8B，同时训练效率高，完全基于开放数据集，并使用直接的流程。这些步骤包括创建 Docmatix，这是一个用于提高文档理解能力的数据集，其规模是之前可用数据集的 240 倍。我们将发布该模型以及为其训练创建的数据集。

使用技巧

输入图像可以通过上采样（如果启用调整大小）或以其原始分辨率进行处理。调整大小的行为取决于两个参数：do_resize 和 size。

如果 do_resize 设置为 True，模型会调整图像大小，使最长边默认为 4*364 像素。默认的调整大小行为可以通过将字典传递给 size 参数进行自定义。例如，`{“longest_edge”: 4 * 364}` 是默认值，但您可以根据需要将其更改为不同的值。

以下是如何控制调整大小并设置自定义大小

image_processor = Idefics3ImageProcessor(do_resize=True, size={"longest_edge": 2 * 364}, max_image_size=364)

此外，max_image_size 参数控制图像分解成的每个正方形补丁的大小，默认设置为 364，但可以根据需要进行调整。调整大小（如果适用）后，图像处理器会根据 max_image_size 参数将图像分解为正方形补丁。

此模型由 amyeroberts 和 andimarafioti 贡献。

Idefics3Config

class transformers.Idefics3Config

< source >

( use_cache = True image_token_id = 128257 tie_word_embeddings = False vision_config = None text_config = None scale_factor = 2 pad_token_id = 128002 **kwargs )

参数

use_cache (bool, 可选, 默认为 True) — 模型是否应缓存注意力机制的键/值对。仅当 config.is_decoder=True 时相关。
image_token_id (int, 可选, 默认为 128257) — “image” 令牌的 ID。
tie_word_embeddings (bool, 可选, 默认为 False) — 是否将词嵌入与令牌嵌入绑定。
vision_config (IdeficsVisionConfig 或 dict, 可选, 默认为 IdeficsVisionConfig) — 视觉塔的自定义视觉配置或字典
text_config (PretrainedConfig 或 dict, 可选, 默认为 LlamaConfig) — 文本模型的自定义文本配置或字典
scale_factor (int, 可选, 默认为 2) — 图像编码器的缩放因子。
pad_token_id (int, 可选, 默认为 128002) — padding 令牌的 ID。

这是用于存储 Idefics3Model 配置的配置类。它用于根据指定的参数实例化 Idefics3 模型，定义模型架构。使用默认值实例化配置将产生与 Idefics3 HuggingFaceM4/Idefics3-8B-Llama3 架构模型相似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import Idefics3Model, Idefics3Config
>>> # Initializing configuration
>>> configuration = Idefics3Config()
>>> # Initializing a model from the configuration
>>> model = Idefics3Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Idefics3VisionConfig

class transformers.Idefics3VisionConfig

< 源码 >

( hidden_size = 1152 intermediate_size = 3072 num_hidden_layers = 12 num_attention_heads = 16 num_channels = 3 image_size = 224 patch_size = 32 hidden_act = 'gelu_pytorch_tanh' layer_norm_eps = 1e-06 attention_dropout = 0.0 initializer_range = 0.02 **kwargs )

参数

hidden_size (int, 可选, 默认为 1152) — 编码器层和池化器层的维度。
intermediate_size (int, 可选, 默认为 3072) — Transformer 编码器中“中间层”（即，前馈层）的维度。
num_hidden_layers (int, 可选, 默认为 12) — Transformer 编码器中隐藏层的数量。
num_attention_heads (int, 可选, 默认为 16) — Transformer 编码器中每个注意力层的注意力头的数量。
num_channels (int, 可选, 默认为 3) — 输入图像中的通道数。
image_size (int, 可选, 默认为 224) — 每张图像的大小（分辨率）。
patch_size (int, 可选, 默认为 32) — 每个图像块的大小（分辨率）。
hidden_act (str 或 function, 可选, 默认为 "gelu_pytorch_tanh") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，则支持 "gelu", "relu", "selu", "gelu_new" 和 "quick_gelu"。
layer_norm_eps (float, 可选, 默认为 1e-06) — 层归一化层使用的 epsilon 值。
attention_dropout (float, 可选, 默认为 0.0) — 注意力概率的 dropout 比率。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。

这是用于存储 Idefics3VisionModel 配置的配置类。它用于根据指定的参数实例化 Idefics3 视觉编码器，定义模型架构。使用默认值实例化配置将产生与 SigLIP 检查点 google/siglip-base-patch16-224 类似的配置，该检查点在 Idefics3 模型 HuggingFaceM4/Idefics3-8B-Llama3 中使用。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers.models.idefics3.modeling_idefics3 import Idefics3VisionTransformer
>>> from transformers.models.idefics3.configuration_idefics3 import Idefics3VisionConfig

>>> # Initializing a Idefics3VisionConfig with google/siglip-base-patch16-224 style configuration
>>> configuration = Idefics3VisionConfig()

>>> # Initializing a Idefics3VisionTransformer (with random weights) from the google/siglip-base-patch16-224 style configuration
>>> model = Idefics3VisionTransformer(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Idefics3VisionTransformer

class transformers.Idefics3VisionTransformer

< 源码 >

( config: Idefics3VisionConfig )

参数

config (Idefics3VisionConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

Idefics3 视觉 Transformer 模型，输出原始图像嵌入。此模型继承自 PreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、剪枝头等）。

此模型也是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解所有与常规用法和行为相关的事项。

Idefics3Model

class transformers.Idefics3Model

< 源码 >

( config: Idefics3Config )

参数

config (Idefics3Config 或 Idefics3VisionConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

Idefics3 模型由 SIGLIP 视觉编码器和 Llama3 语言解码器组成。此模型继承自 PreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、剪枝头等）。

此模型也是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解所有与常规用法和行为相关的事项。

forward

< 源码 >

( input_ids：LongTensor = None attention_mask：typing.Optional[torch.Tensor] = None position_ids：typing.Optional[torch.LongTensor] = None past_key_values：typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds：typing.Optional[torch.FloatTensor] = None pixel_values：typing.Optional[torch.FloatTensor] = None pixel_attention_mask：typing.Optional[torch.BoolTensor] = None image_hidden_states：typing.Optional[torch.FloatTensor] = None use_cache：typing.Optional[bool] = None output_attentions：typing.Optional[bool] = None output_hidden_states：typing.Optional[bool] = None cache_position：typing.Optional[torch.LongTensor] = None return_dict：typing.Optional[bool] = None )

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。如果您提供填充，默认情况下将忽略填充。

索引可以使用 AutoTokenizer 获取。有关详细信息，请参见 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length), 可选) — 掩码，以避免对填充标记索引执行注意力机制。掩码值在 [0, 1] 中选择：
- 1 表示标记未被掩蔽，
- 0 表示标记被掩蔽。
什么是注意力掩码？

索引可以使用 AutoTokenizer 获取。有关详细信息，请参见 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

如果使用 past_key_values，则可以选择仅输入最后一个 decoder_input_ids（请参阅 past_key_values）。

如果您想更改填充行为，则应阅读 modeling_opt._prepare_decoder_attention_mask 并根据您的需求进行修改。有关默认策略的更多信息，请参见论文中的图 1。
- 1 表示头未被掩蔽，
- 0 表示头被掩蔽。
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length), 可选) — 位置嵌入中每个输入序列标记的位置索引。在 [0, config.n_positions - 1] 范围内选择。什么是位置 ID？
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)) — 长度为 config.n_layers 的 tuple(tuple(torch.FloatTensor)) 元组，其中每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量，以及 2 个形状为 (batch_size, num_heads, encoder_sequence_length, embed_size_per_head) 的附加张量，当传递 use_cache=True 或 config.use_cache=True 时返回，可选。

Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding. 包含预先计算的隐藏状态（自注意力模块和交叉注意力模块中的键和值），可以用于（参见 past_key_values 输入）加速顺序解码。

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length). 如果使用 past_key_values，用户可以选择仅输入最后一次的 decoder_input_ids（那些没有将其过去的键值状态提供给此模型的），形状为 (batch_size, 1)，而不是所有形状为 (batch_size, sequence_length) 的 decoder_input_ids。
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. inputs_embeds (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor，可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您想要比模型的内部嵌入查找矩阵更精细地控制如何将 input_ids 索引转换为关联的向量，这将非常有用。
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.50.0/en/model_doc/vilt#transformers.ViltFeatureExtractor.__call__) for details ([]`LlavaProcessor`] uses CLIPImageProcessor for processing images). pixel_values (形状为 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) -- 对应于输入图像的张量。像素值可以使用 [AutoImageProcessor](/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoImageProcessor) 获得。有关详细信息，请参阅 [CLIPImageProcessor.__call__()](/docs/transformers/v4.50.0/en/model_doc/vilt#transformers.ViltFeatureExtractor.__call__)（[`LlavaProcessor`] 使用 CLIPImageProcessor 处理图像）。
pixel_attention_mask (torch.Tensor of shape (batch_size, image_size, image_size), optional) — Mask to avoid performing attention on padding pixel indices. pixel_attention_mask (形状为 (batch_size, image_size, image_size) 的 torch.Tensor，可选) — 用于避免对填充像素索引执行注意力的掩码。
image_hidden_states (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The hidden states of the image encoder after modality projection. image_hidden_states (形状为 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 模态投影后图像编码器的隐藏状态。
use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). use_cache (bool，可选) — 如果设置为 True，则返回 past_key_values 键值状态，并可用于加速解码（请参阅 past_key_values）。
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail. output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail. output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple. return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。
cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. cache_position (形状为 (sequence_length) 的 torch.LongTensor，可选) — 描述输入序列标记在序列中位置的索引。与 position_ids 相反，此张量不受填充的影响。它用于在正确的位置更新缓存并推断完整序列长度。

The Idefics3Model forward method, overrides the __call__ special method. Idefics3Model 的前向方法，覆盖了 __call__ 特殊方法。

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. 尽管前向传递的步骤需要在该函数中定义，但应该在之后调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者会默默忽略它们。

Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where max_num_images is the maximum number of images among the batch_size samples in the batch. Padding images are not needed beyond padding the pixel_values at the entrance of the model. For efficiency, we only pass through the vision_model’s forward the real images by discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3. 馈送到模型的输入可以具有任意数量的图像。为了考虑到这一点，馈送到模型的 pixel_values 具有图像填充 -> (batch_size, max_num_images, 3, max_heights, max_widths)，其中 max_num_images 是批次中 batch_size 样本中图像的最大数量。除了在模型入口处填充 pixel_values 之外，不需要填充图像。为了提高效率，我们仅通过视觉模型的前向传递真实图像，方法是丢弃填充图像，即大小为 (image_batch_size, 3, height, width) 的 pixel_values，其中当 num_images_per_sample=[1, 3, 1, 2] 时，image_batch_size 将为 7，而 max_num_images 将为 3。

Idefics3ForConditionalGeneration

class transformers.Idefics3ForConditionalGeneration

< source >

( config )

参数

config (Idefics3Config or Idefics3VisionConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. config (Idefics3Config 或 Idefics3VisionConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

The Idefics3 Model with a language modeling head. It is made up a SigLIP vision encoder, with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) 带有语言建模头的 Idefics3 模型。它由 SigLIP 视觉编码器和顶部的语言建模头组成。此模型继承自 PreTrainedModel。查看超类文档，了解库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、剪枝头等）。

此模型也是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解所有与常规用法和行为相关的事项。

forward

< source >

( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_attention_mask: typing.Optional[torch.BoolTensor] = None image_hidden_states: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None return_dict: typing.Optional[bool] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 ) → transformers.models.idefics3.modeling_idefics3.Idefics3CausalLMOutputWithPast or tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs?
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.50.0/en/model_doc/auto#transformers.AutoImageProcessor). See [CLIPImageProcessor.__call__()](/docs/transformers/v4.50.0/en/model_doc/vilt#transformers.ViltFeatureExtractor.__call__) for details ([]`LlavaProcessor`] uses CLIPImageProcessor for processing images).
pixel_attention_mask (torch.Tensor，形状为 (batch_size, image_size, image_size)，可选) — 用于避免对填充像素索引执行注意力的掩码。
image_hidden_states (torch.FloatTensor，形状为 (batch_size, num_channels, image_size, image_size)) — 模态投影后图像编码器的隐藏状态。
use_cache (bool，可选) — 如果设置为 True，则返回 past_key_values 键值状态，并可用于加速解码（请参阅 past_key_values）。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是纯元组。
cache_position (torch.LongTensor，形状为 (sequence_length)，可选) — 索引，描述输入序列标记在序列中的位置。与 position_ids 相反，此张量不受填充的影响。它用于在正确的位置更新缓存，并推断完整序列的长度。
labels (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 用于计算掩码语言建模损失的标签。索引应为 [0, ..., config.vocab_size] 或 model.image_token_id （其中 model 是 Idefics3ForConditionalGeneration 的实例）。索引设置为 model.image_token_id 的标记将被忽略（掩码），损失仅针对标签在 [0, ..., config.vocab_size] 中的标记计算。
logits_to_keep (int 或 torch.Tensor，可选) — 如果是 int，则计算最后 logits_to_keep 个标记的 logits。如果是 0，则计算所有 input_ids 的 logits（特殊情况）。仅生成最后一个标记 logits 是需要的，并且仅针对该标记计算它们可以节省内存，这对于长序列或大词汇表大小来说变得非常重要。如果是 torch.Tensor，则必须是 1D，对应于在序列长度维度中要保留的索引。这在使用打包张量格式（批次和序列长度的单个维度）时很有用。

transformers.models.idefics3.modeling_idefics3.Idefics3CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.idefics3.modeling_idefics3.Idefics3CausalLMOutputWithPast 或 torch.FloatTensor 元组（如果传递 return_dict=False 或当 config.return_dict=False 时），包括各种元素，具体取决于配置 (Idefics3Config) 和输入。

loss (torch.FloatTensor，形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个标记预测）。
logits (torch.FloatTensor，形状为 (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇表标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传递 use_cache=True 或当 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组具有 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head)) 的张量。包含可用于加速顺序解码的预先计算的隐藏状态（自注意力块中的键和值）（请参阅 past_key_values 输入）。
hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型具有嵌入层，则为嵌入输出一个，+ 每个层的输出一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每一层输出端的隐藏状态，加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (tuple(torch.FloatTensor)，可选) — torch.FloatTensor 元组（图像嵌入的输出一个，(batch_size, num_images, sequence_length, hidden_size)。由视觉编码器生成的模型的 image_hidden_states

Idefics3ForConditionalGeneration forward 方法，覆盖了 __call__ 特殊方法。

示例

>>> import requests
>>> import torch
>>> from PIL import Image
>>> from io import BytesIO

>>> from transformers import AutoProcessor, AutoModelForVision2Seq
>>> from transformers.image_utils import load_image

>>> # Note that passing the image urls (instead of the actual pil images) to the processor is also possible
>>> image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
>>> image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
>>> image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

>>> processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")
>>> model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3", torch_dtype=torch.bfloat16, device_map="auto")

>>> # Create inputs
>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image"},
...             {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
...             {"type": "image"},
...             {"type": "text", "text": "What can we see in this image?"},
...         ]
...     },
...     {
...         "role": "user",
...         "content": [
...             {"type": "image"},
...             {"type": "text", "text": "In which city is that bridge located?"},
...         ]
...     }
... ]

>>> prompts = [processor.apply_chat_template([message], add_generation_prompt=True) for message in messages]
>>> images = [[image1, image2], [image3]]
>>> inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)

>>> # Generate
>>> generated_ids = model.generate(**inputs, max_new_tokens=256)
>>> generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> print(generated_texts[0])
Assistant: There are buildings, trees, lights, and water visible in this image.

>>> print(generated_texts[1])
Assistant: The bridge is in San Francisco.

Idefics3ImageProcessor

class transformers.Idefics3ImageProcessor

< source >

( do_convert_rgb: bool = True do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.LANCZOS: 1> do_image_splitting: bool = True max_image_size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = True **kwargs )

参数

do_convert_rgb (bool，可选，默认为 True) — 是否将图像转换为 RGB 格式。如果输入图像是不同的格式（例如 RGBA），这将非常有用。仅当输入图像为 PIL 格式时才有效。
do_resize (bool，可选，默认为 True) — 是否调整图像大小。图像的最长边调整为 <= size["longest_edge"]，最短边调整大小以保持输入纵横比。
size (Dict，可选，默认为 {"longest_edge" -- 4 * 364}): 控制输出图像的大小。这是一个包含键 “longest_edge” 的字典。图像将被调整大小，使得最长边 <= size["longest_edge"]，最短边调整大小以保持输入纵横比。
resample (Resampling，可选，默认为 Resampling.LANCZOS) — 调整图像大小时要使用的重采样过滤器。
do_image_splitting (bool，可选，默认为 True) — 是否将图像拆分为子图像，并与原始图像连接。它们被拆分为补丁，使得每个补丁的大小为 max_image_size["height"] x max_image_size["width"]。
max_image_size (Dict，可选，默认为 {"longest_edge" -- 364}): 模型接受的图像补丁的最大分辨率。这是一个包含键 “longest_edge” 的字典。
do_rescale (bool，可选，默认为 True) — 是否重新缩放图像。如果设置为 True，则图像将被重新缩放，使其像素值介于 0 和 1 之间。
rescale_factor (float，可选，默认为 1/255) — 如果 do_rescale 设置为 True，则用于重新缩放图像的重新缩放因子。
do_normalize (bool，可选，默认为 True) — 是否标准化图像。如果设置为 True，则将图像标准化为具有 image_mean 的均值和 image_std 的标准差。
image_mean (float 或 List[float]，可选，默认为 IDEFICS_STANDARD_MEAN) — 如果标准化图像，则使用的均值。这是一个浮点数或浮点数列表，其长度等于图像中通道数。可以被 preprocess 方法中的 image_mean 参数覆盖。可以被 preprocess 方法中的 image_mean 参数覆盖。
image_std (float 或 List[float]，可选，默认为 IDEFICS_STANDARD_STD) — 如果标准化图像，则使用的标准差。这是一个浮点数或浮点数列表，其长度等于图像中通道数。可以被 preprocess 方法中的 image_std 参数覆盖。可以被 preprocess 方法中的 image_std 参数覆盖。
do_pad (bool，可选，默认为 True) — 是否将批次中的图像填充到批次中最大的高度和宽度以及每个样本的图像数量，以便返回的张量的形状为 (batch_size, max_num_images, num_channels, max_height, max_width)。

构建 Idefics3 图像处理器。

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: typing.Optional[bool] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = None do_image_splitting: typing.Optional[bool] = None do_rescale: typing.Optional[bool] = None max_image_size: typing.Optional[typing.Dict[str, int]] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_row_col_info: bool = False data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None )

参数

images (ImageInput) — 预处理的图像列表。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为 RGB 格式。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (Dict[str, int], 可选, 默认为 self.size) — 调整大小后图像的尺寸。最长边会调整大小以保持输入图像的宽高比。
resample (int, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样滤波器。可以是枚举类型 PILImageResampling 中的一个。仅在 do_resize 设置为 True 时有效。
do_image_splitting (bool, 可选, 默认为 self.do_image_splitting) — 是否将图像分割成子图像，并与原始图像连接。它们被分割成块，使得每个块的大小为 max_image_size["height"] x max_image_size["width"]。
max_image_size (Dict, 可选, 默认为 self.max_image_size) — 图像的最大分辨率。如果图像大于此尺寸，则图像将被分割成块。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否对图像进行重新缩放。
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放图像的缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 用于归一化的图像均值。仅在 do_normalize 设置为 True 时有效。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 用于归一化的图像标准差。仅在 do_normalize 设置为 True 时有效。
do_pad (bool, 可选, 默认为 self.do_pad) — 是否将图像填充到批次中最大的高度和宽度。
return_tensors (str 或 TensorType, 可选) — 返回张量的类型。可以是以下之一：
- Unset: 返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf': 返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt': 返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np': 返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax': 返回 jax.numpy.ndarray 类型的批次。
return_row_col_info (bool, 可选, 默认为 False) — 是否返回分割图像的行数和列数。这用于 Idefics3Processor 以根据行数和列数生成提示字符串。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- Unset: 使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则通道维度格式从输入图像推断。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

预处理一批图像。

Idefics3Processor

class transformers.Idefics3Processor

< source >

( image_processor tokenizer = None image_seq_len: int = 169 chat_template: str = None **kwargs )

参数

image_processor (Idefics3ImageProcessor) — Idefics3ImageProcessor 的实例。图像处理器是必需的输入。
tokenizer (PreTrainedTokenizerBase, 可选) — PreTrainedTokenizerBase 的实例。这应与模型的文本模型相对应。分词器是必需的输入。
image_seq_len (int, 可选, 默认为 169) — 图像序列的长度，即输入中每个图像的 token 数量。此参数用于从输入提示和图像 token 构建字符串，并且应与模型使用的值匹配。它计算为： image_seq_len = int(((image_size // patch_size) 2) / (scale_factor2))
chat_template (str, 可选) — Jinja 模板，用于将聊天中的消息列表转换为可标记化的字符串。

构建一个 Idefics3 处理器，它将 LLama 分词器和 Idefics3 图像处理器包装到单个处理器中。

Idefics3Processor 提供 Idefics3ImageProcessor 和 Idefics3TokenizerFast 的所有功能。有关更多信息，请参见 call() 和 decode() 的文档字符串。

call

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]], typing.List[typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]]]] = None text: typing.Union[str, ForwardRef('PreTokenizedInput'), typing.List[str], typing.List[ForwardRef('PreTokenizedInput')]] = None audio = None videos = None image_seq_len: typing.Optional[int] = None **kwargs: typing_extensions.Unpack[transformers.models.idefics3.processing_idefics3.Idefics3ProcessorKwargs] )

参数

images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor], 可选) — 要准备的图像或图像批次。每个图像可以是 PIL 图像、NumPy 数组或 PyTorch 张量。如果类型为 List[ImageInput]，则假定这用于单个提示，即批次大小为 1。
text (Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]], 可选) — 要编码的序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，则必须设置 is_split_into_words=True （以消除与序列批次的歧义）。无论何时遇到图像 token，<image> 都会被扩展为 <fake_token_around_image> + <row_x_col_y> + <image> image_seq_len `。
image_seq_len (int, 可选) — 图像序列的长度。如果未提供，则使用 self.image_seq_len 的默认值。image_seq_len 应该等于 int(((image_size // patch_size) 2) / (scale_factor2))
return_tensors (Union[str, TensorType], 可选) — 如果设置，将返回特定框架的张量。有关更多信息，请参见 PreTrainedTokenizerFast.call()。

处理输入提示并返回 BatchEncoding。

示例

>>> import requests
>>> from transformers import Idefics3Processor
>>> from transformers.image_utils import load_image

>>> processor = Idefics3Processor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")
>>> processor.image_processor.do_image_splitting = False  # Force as False to simplify the example

>>> url1 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
>>> url2 = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"

>>> image1, image2 = load_image(url1), load_image(url2)
>>> images = [[image1], [image2]]

>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(images=images, text=text, return_tensors="pt", padding=True)
>>> input_ids = outputs.input_ids
>>> input_tokens = processor.tokenizer.batch_decode(input_ids)
>>> print(input_tokens)
['<|begin_of_text|><fake_token_around_image><global-img>((<image>)*169)<fake_token_around_image> In this image, we see', '<|reserved_special_token_0|><|reserved_special_token_0|><|reserved_special_token_0|><|begin_of_text|>bla bla bla<fake_token_around_image><global-img>((<image>)*169)<fake_token_around_image>']

< > 在 GitHub 上更新

Transformers

Idefics3

概述

使用技巧

Idefics3Config

class transformers.Idefics3Config

Idefics3VisionConfig

class transformers.Idefics3VisionConfig

Idefics3VisionTransformer

class transformers.Idefics3VisionTransformer

Idefics3Model

class transformers.Idefics3Model

forward

Idefics3ForConditionalGeneration

class transformers.Idefics3ForConditionalGeneration

forward

Idefics3ImageProcessor

class transformers.Idefics3ImageProcessor

preprocess

Idefics3Processor

class transformers.Idefics3Processor

__call__

call