InternVL

InternVL3 系列视觉语言模型在 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models 中被提出。

论文摘要如下：

我们引入了 InternVL3，这是 InternVL 系列中的一项重大进展，其特点是采用了原生的多模态预训练范式。InternVL3 不再是将纯文本大型语言模型 (LLM) 改编成支持视觉输入的多模态大型语言模型 (MLLM)，而是在一个预训练阶段中，通过多样化的多模态数据和纯文本语料库共同获取多模态和语言能力。这种统一的训练范式有效地解决了传统 MLLM 后期训练管道中常见的复杂性和对齐挑战。为了进一步提高性能和可扩展性，InternVL3 融合了可变视觉位置编码 (V2PE) 以支持扩展的多模态上下文，采用了先进的后训练技术，如监督微调 (SFT) 和混合偏好优化 (MPO)，并采用了测试时缩放策略以及优化的训练基础设施。广泛的实证评估表明，InternVL3 在各种多模态任务中都表现出卓越的性能。特别是，InternVL3-78B 在 MMMU 基准测试中取得了 72.2 分的成绩，在开源 MLLM 中树立了新的SOTA。它的能力与领先的专有模型（包括 ChatGPT-4o、Claude 3.5 Sonnet 和 Gemini 2.5 Pro）保持高度竞争力，同时还保持了强大的纯语言能力。为了秉承开放科学原则，我们将公开训练数据和模型权重，以促进下一代 MLLM 的进一步研究和开发。

InternVL3 模型架构概述，与 InternVL2.5 相同。摘自原始检查点。 drawing

InternVL3 在 OpenCompass 上与其他 SOTA VLLM 的性能比较。摘自原始检查点。

此模型由 yonigozlan 贡献。原始代码可在此处找到。

使用示例

使用管道进行推理

以下是如何使用 image-text-to-text 管道，仅用几行代码即可对 InternVL3 模型执行推理：

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'

单图像推理

此示例演示了如何使用聊天模板对 InternVL 模型进行单图像推理。

[!注意] 请注意，该模型已针对特定的聊天提示格式进行了训练。请使用 processor.apply_chat_template(my_conversation_dict) 来正确格式化您的提示。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Please describe the image explicitly."},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'

纯文本生成

此示例展示了如何在使用 InternVL 模型时，在不提供任何图像输入的情况下生成文本。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "text", "text": "Write a haiku"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."

批量图像和文本输入

InternVL 模型也支持批量图像和文本输入。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']

批量多图像输入

InternVL 模型的此实现支持批量文本图像输入，其中每个文本的图像数量不同。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']

视频输入

InternVL 模型还可以处理视频输入。以下是使用聊天模板对视频输入执行推理的示例。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "What type of shot is the man performing?"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     num_frames=8,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'

交错图像和视频输入

此示例展示了如何使用聊天模板处理包含交错图像和视频输入的批量聊天对话。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
...                 {"type": "text", "text": "What type of shot is the man performing?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     padding=True,
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)

>>> outputs = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
 "user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace."]

InternVLVisionConfig

class transformers.InternVLVisionConfig

< 源代码 >

( hidden_size = 1024 num_hidden_layers = 24 num_attention_heads = 16 attention_bias = False use_qk_norm = False intermediate_size = 4096 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_dropout = 0.0 projection_dropout = 0.0 initializer_range = 0.02 norm_type = 'layer_norm' layer_norm_eps = 1e-06 image_size = [448, 448] patch_size = [14, 14] num_channels = 3 use_mask_token = False use_absolute_position_embeddings = True layer_scale_init_value = 0.1 use_mean_pooling = True **kwargs )

参数

hidden_size (int, 可选, 默认为 1024) — 编码器层和池化层的维度。
num_hidden_layers (int, 可选, 默认为 24) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, 可选, 默认为 16) — Transformer 编码器中每个注意力层的注意力头数量。
attention_bias (bool, 可选, 默认为 False) — 是否为查询、键和值添加偏置。
use_qk_norm (bool, 可选, 默认为 False) — 是否在注意力操作之前对查询和键应用归一化。
intermediate_size (int, 可选, 默认为 4096) — Transformer 编码器中“中间”（即，前馈）层的维度。
hidden_act (str 或 function, 可选, 默认为 "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果为字符串，则支持 "gelu", "relu", "selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可选, 默认为 0.0) — 嵌入、编码器和池化器中所有全连接层的 dropout 概率。
attention_dropout (float, 可选, 默认为 0.0) — 注意力权重的 dropout 概率。
projection_dropout (float, 可选, 默认为 0.0) — 投影层的 dropout 概率。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
norm_type (str, 可选, 默认为 "layer_norm") — 编码器中使用的归一化类型。可以是 "layer_norm" 或 "rms_norm"。
layer_norm_eps (float, 可选, 默认为 1e-06) — 层归一化层使用的 epsilon 值。
image_size (int 或 list[int], 可选, 默认为 [448, 448]) — 每张图像的大小（分辨率）。
patch_size (int 或 list[int], 可选, 默认为 [14, 14]) — 每个补丁的大小（分辨率）。
num_channels (int, 可选, 默认为 3) — 输入通道的数量。
use_mask_token (bool, 可选, 默认为 False) — 是否使用掩码标记进行掩码图像建模。
use_absolute_position_embeddings (bool, 可选, 默认为 True) — 是否使用 BERT 风格的绝对位置嵌入。
layer_scale_init_value (float, 可选, 默认为 0.1) — 自注意力层中使用的缩放比例。基础模型为 0.1，大型模型为 1e-5。设置为 0 则禁用层缩放。
use_mean_pooling (bool, 可选, 默认为 True) — 在应用分类头之前，是否对补丁的最终隐藏状态进行平均池化，而不是使用 CLS 标记的最终隐藏状态。

这是用于存储 InternVLVisionModel 配置的配置类。它用于根据指定的参数实例化 InternVLVisionModel 模型，定义模型架构。使用默认值实例化配置将产生与 InternVL3-1B 类似的配置。例如 OpenGVLab/InternVL3-1B-hf

示例

>>> from transformers import InternVLVisionConfig, InternVLVisionModel

>>> # Initializing a InternVLVisionModel OpenGVLab/InternVL3-1B-hf style configuration
>>> configuration = InternVLVisionConfig()

>>> # Initializing a model (with random weights) from the OpenGVLab/InternVL3-1B-hf configuration
>>> model = InternVLVisionModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

InternVLConfig

class transformers.InternVLConfig

< 源代码 >

( vision_config = None text_config = None image_token_id = 151667 image_seq_length = 256 downsample_ratio = 0.5 projector_hidden_act = 'gelu' vision_feature_layer = -1 vision_feature_select_strategy = 'default' **kwargs )

参数

vision_config (Union[AutoConfig, dict], 可选, 默认为 InternVisonConfig) — 视觉骨干的配置对象或字典。
text_config (Union[AutoConfig, dict], 可选, 默认为 Qwen2Config) — 文本主干的配置对象或字典。
image_token_id (int, 可选, 默认为 151667) — 用于编码图像提示的图像标记索引。
image_seq_length (int, 可选, 默认为 256) — 每个图像块使用的图像标记数。
downsample_ratio (float, 可选, 默认为 0.5) — 图像的下采样因子。
projector_hidden_act (str 或 function, 可选, 默认为 "gelu") — 投影仪中的非线性激活函数（函数或字符串）。
vision_feature_layer (int, 可选, 默认为 -1) — 用作图像特征的层索引。
vision_feature_select_strategy (str, 可选, 默认为 "default") — 用于从视觉主干中选择视觉特征的特征选择策略。可以是 "default" 或 "full" 之一。

这是用于存储 InternVLForConditionalGeneration 配置的配置类。它用于根据指定参数实例化 InternVL 模型，定义模型架构。使用默认值实例化配置将产生与 InternVL3-1B 相似的配置。例如 OpenGVLab/InternVL3-1B-hf

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请参阅 PretrainedConfig 的文档。

>>> from transformers import InternVLForConditionalGeneration, InternVLConfig

>>> # Initializing a InternVL style configuration
>>> configuration = InternVLConfig()

>>> # Initializing a model (with random weights) from the OpenGVLab/InternVL3-1B-hf configuration
>>> model = InternVLForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

InternVLVisionModel

class transformers.InternVLVisionModel

< source >

( config: InternVLVisionConfig )

参数

config (InternVLVisionConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不加载与模型关联的权重，只加载配置。请查阅 from_pretrained() 方法来加载模型权重。

裸 Internvl 模型，输出原始隐藏状态，不带任何特定头部。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为其所有模型实现的一般方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档中有关一般使用和行为的所有事项。

forward

< source >

( pixel_values: Tensor bool_masked_pos: typing.Optional[torch.BoolTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None ) → transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.Tensor，形状为 (batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参阅 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
bool_masked_pos (torch.BoolTensor，形状为 (batch_size, num_patches), 可选) — 布尔掩码位置。指示哪些块被掩码（1）哪些未被掩码（0）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。

transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或 tuple(torch.FloatTensor)

一个 transformers.models.internvl.modeling_internvl.InternVLVisionModelOutputWithPooling 或一个 torch.FloatTensor 的元组（如果传入 return_dict=False 或 config.return_dict=False），包含根据配置 (InternVLConfig) 和输入的不同元素。

last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可选) — 模型最后一层输出的隐藏状态序列。
pooler_output (torch.FloatTensor，形状为 (batch_size, hidden_size)) — 如果 config.use_mean_pooling 设置为 True，则为补丁标记（不包括 [CLS] 标记）的最后一层隐藏状态的平均值。如果设置为 False，则返回 [CLS] 标记的最终隐藏状态。
hidden_states (tuple[torch.FloatTensor, ...], 可选, 当传入 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型有嵌入层，则一个用于嵌入输出，加上每个层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor, ...], 可选, 当传入 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

InternVLVisionModel 前向方法，覆盖 __call__ 特殊方法。

虽然前向传播的配方需要在该函数中定义，但在此之后应调用 Module 实例，因为前者负责运行预处理和后处理步骤，而后者则默默地忽略它们。

InternVLModel

class transformers.InternVLModel

< source >

( config: InternVLConfig )

参数

config (InternVLConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不加载与模型关联的权重，只加载配置。请查阅 from_pretrained() 方法来加载模型权重。

InternVL 模型，由一个视觉主干和一个语言模型组成，不带语言建模头部。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为其所有模型实现的一般方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档中有关一般使用和行为的所有事项。

forward

< source >

( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, list[int], NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。默认情况下会忽略填充。

索引可以使用 AutoTokenizer 获取。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参阅 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length), 可选) — 避免对填充标记索引执行注意力的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示 未被掩码 的标记，
- 0 表示 被掩码 的标记。
什么是注意力掩码？
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围为 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (list[torch.FloatTensor], 可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码。这通常包括模型在先前解码阶段返回的 past_key_values，当 use_cache=True 或 config.use_cache=True 时。

允许两种格式：
- 一个 Cache 实例，请参阅我们的 kv 缓存指南；
- 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）。这也被称为旧版缓存格式。
模型将输出与作为输入提供的缓存格式相同的缓存格式。如果未传入 past_key_values，则将返回旧版缓存格式。

如果使用 past_key_values，用户可以选择只输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将其过去的键值状态提供给此模型的），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size), 可选) — 可选地，除了传递 input_ids，您还可以选择直接传递嵌入表示。如果您想对如何将 input_ids 索引转换为关联向量（而不是模型的内部嵌入查找矩阵）有更多控制，这会很有用。
vision_feature_layer (Union[int, list[int], NoneType]) — 选择视觉特征的层索引。如果提供了多个索引，则相应索引的视觉特征将连接起来形成视觉特征。
vision_feature_select_strategy (str, 可选) — 用于从视觉主干中选择视觉特征的特征选择策略。可以是 "default" 或 "full" 之一。
use_cache (bool, 可选) — 如果设置为 True，则返回 past_key_values 键值状态，可用于加速解码（请参阅 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
cache_position (torch.LongTensor，形状为 (sequence_length), 可选) — 描述输入序列标记在序列中位置的索引。与 position_ids 不同，此张量不受填充影响。它用于在正确位置更新缓存并推断完整的序列长度。

transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.internvl.modeling_internvl.InternVLModelOutputWithPast 或一个 torch.FloatTensor 的元组（如果传入 return_dict=False 或 config.return_dict=False），包含根据配置 (InternVLConfig) 和输入的不同元素。

last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可选) — 模型最后一层输出的隐藏状态序列。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传入 use_cache=True 或 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）。

包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple[torch.FloatTensor, ...], 可选, 当传入 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型有嵌入层，则一个用于嵌入输出，加上每个层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor, ...], 可选, 当传入 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (torch.FloatTensor, 可选) — 形状为 (batch_size, num_images, sequence_length, hidden_size) 的 torch.FloatTensor。模型由视觉编码器生成并经过最后隐藏状态投影后的 image_hidden_states。

InternVLModel 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在该函数中定义，但在此之后应调用 Module 实例，因为前者负责运行预处理和后处理步骤，而后者则默默地忽略它们。

InternVLForConditionalGeneration

class transformers.InternVLForConditionalGeneration

< source >

( config: InternVLConfig )

参数

config (InternVLConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不加载与模型关联的权重，只加载配置。请查阅 from_pretrained() 方法来加载模型权重。

INTERNVL 模型，由一个视觉主干和一个语言模型组成。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为其所有模型实现的一般方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档中有关一般使用和行为的所有事项。

forward

< source >

( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, list[int], NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 image_sizes: typing.Optional[torch.Tensor] = None **kwargs: typing_extensions.Unpack[transformers.models.internvl.modeling_internvl.KwargsForCausalLM] ) → transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。默认情况下会忽略填充。

索引可以使用 AutoTokenizer 获取。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参阅 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length), 可选) — 避免对填充标记索引执行注意力的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示 未被掩码 的标记，
- 0 表示 被掩码 的标记。
什么是注意力掩码？
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围为 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (list[torch.FloatTensor], 可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码。这通常包括模型在先前解码阶段返回的 past_key_values，当 use_cache=True 或 config.use_cache=True 时。

允许两种格式：
- 一个 Cache 实例，请参阅我们的 kv 缓存指南；
- 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）。这也被称为旧版缓存格式。
模型将输出与作为输入提供的缓存格式相同的缓存格式。如果未传入 past_key_values，则将返回旧版缓存格式。

如果使用 past_key_values，用户可以选择只输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将其过去的键值状态提供给此模型的），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)，可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 input_ids。如果您希望对如何将 input_ids 索引转换为关联向量有比模型内部嵌入查找矩阵更强的控制，这将非常有用。
vision_feature_layer (Union[int, list[int], NoneType]) — 用于选择视觉特征的层索引。如果提供了多个索引，则相应索引的视觉特征将连接起来形成视觉特征。
vision_feature_select_strategy (str, 可选) — 用于从视觉骨干网中选择视觉特征的特征选择策略。可以是 "default" 或 "full" 之一。
labels (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 用于计算掩码语言模型损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100 (参见 input_ids 文档字符串) 之间。索引设置为 -100 的标记将被忽略（掩码），损失仅针对标签在 [0, ..., config.vocab_size] 中的标记计算。
use_cache (bool, 可选) — 如果设置为 True，则返回 past_key_values 键值状态，可用于加速解码（参见 past_key_values）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。更多详细信息请参见返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。更多详细信息请参见返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通的元组。
cache_position (torch.LongTensor，形状为 (sequence_length)，可选) — 指示输入序列标记在序列中的位置的索引。与 position_ids 不同，此张量不受填充影响。它用于在正确位置更新缓存并推断完整的序列长度。
logits_to_keep (Union[int, torch.Tensor]，默认为 0) — 如果是 int，则计算最后 logits_to_keep 个标记的 logits。如果是 0，则计算所有 input_ids 的 logits（特殊情况）。生成时只需要最后一个标记的 logits，并且仅为该标记计算可以节省内存，这对于长序列或大词汇量来说非常重要。如果是 torch.Tensor，则必须是 1D，对应于序列长度维度中要保留的索引。这在使用打包张量格式（批次和序列长度的单维度）时很有用。
image_sizes (torch.Tensor，形状为 (batch_size, 2)，可选) — 批处理中图像的大小，每个图像为 (高度, 宽度)。

transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.models.internvl.modeling_internvl.InternVLCausalLMOutputWithPast 或一个 torch.FloatTensor 元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含根据配置（InternVLConfig）和输入的不同元素。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个 token 预测）。
logits (形状为 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 语言建模头部的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor)), 可选, 当传入 use_cache=True 或 config.use_cache=True 时返回) — 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量）。

包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple[torch.FloatTensor]，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型有嵌入层，则其中一个用于嵌入层输出，加上每个层输出一个），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor]，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (torch.FloatTensor, 可选) — 形状为 (batch_size, num_images, sequence_length, hidden_size) 的 torch.FloatTensor。模型由视觉编码器生成并经过最后隐藏状态投影后的 image_hidden_states。

InternVLForConditionalGeneration 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在该函数中定义，但在此之后应调用 Module 实例，因为前者负责运行预处理和后处理步骤，而后者则默默地忽略它们。

示例

>>> import torch
>>> from transformers import AutoProcessor, AutoModelForImageTextToText

>>> torch_device = "cuda"
>>> processor = AutoProcessor.from_pretrained("OpenGVLab/InternVL3-1B-hf")
>>> model = AutoModelForImageTextToText.from_pretrained(
...     "OpenGVLab/InternVL3-1B-hf", torch_dtype=torch.bfloat16, device_map=torch_device
... )

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
...             },
...             {
...                 "type": "image",
...                 "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
...             },
...             {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...         ],
...     },
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device)
>>> generate_ids = model.generate(**inputs, max_new_tokens=200)
>>> print(processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True))
The images depict the Statue of Liberty and the Golden Gate Bridge.

InternVLProcessor

类 transformers.InternVLProcessor

< 源 >

( image_processor = None tokenizer = None video_processor = None image_seq_length: int = 256 chat_template = None **kwargs )

参数

image_processor (AutoImageProcessor，可选) — 图像处理器是必需输入。
tokenizer ([PreTrainedTokenizer, PreTrainedTokenizerFast]，可选) — 分词器是必需输入。
video_processor (AutoVideoProcessor，可选) — 视频处理器是必需输入。
image_seq_length (int，可选，默认为 256) — 每幅图像补丁使用的图像标记数。应设置为：image_seq_length = (config.image_size // config.patch_size) ** 2 * (config.scale_factor**2)
chat_template (str，可选) — 一个 Jinja 模板，用于将聊天中的消息列表转换为可标记化的字符串。

构建一个 InternVL 处理器，该处理器将 AutoImageProcessor 和 PretrainedTokenizerFast 分词器封装到一个继承了图像处理器和分词器功能的单个处理器中。有关更多信息，请参见 __call__() 和 decode()。

batch_decode

< 源 >

( *args **kwargs )

此方法将其所有参数转发给 PreTrainedTokenizerFast 的 batch_decode()。有关更多信息，请参阅此方法的文档字符串。

decode

< 源 >

( *args **kwargs )

此方法将其所有参数转发给 PreTrainedTokenizerFast 的 decode()。有关更多信息，请参阅此方法的文档字符串。

InternVLVideoProcessor

类 transformers.InternVLVideoProcessor

< 源 >

( **kwargs: typing_extensions.Unpack[transformers.models.internvl.video_processing_internvl.InternVLVideoProcessorInitKwargs] )

sample_frames

< 源 >

( video: torch.Tensor metadata: typing.Union[transformers.video_utils.VideoMetadata, dict, NoneType] = None num_frames: typing.Optional[int] = None fps: typing.Optional[int] = None initial_shift: typing.Union[bool, float, int, NoneType] = None ) → torch.Tensor

参数

video (torch.Tensor) — 需要采样的视频。
metadata (VideoMetadata，可选) — 视频的元数据，包含总时长、帧率和总帧数等信息。
num_frames (int，可选) — 要采样的最大帧数。默认为 self.num_frames。
fps (int，可选) — 每秒采样的目标帧数。默认为 self.fps。
initial_shift (bool、float 或 int，默认为 self.initial_shift) — 采样帧时要应用的初始偏移。如果为 True，则偏移量设置为从视频中间采样帧。

torch.Tensor

采样的视频帧。

默认采样函数，用于在 0 和总帧数之间均匀采样所需数量的帧。如果同时传递了 fps 和元数据，则均匀采样每秒 fps 帧。参数 num_frames 和 fps 互斥。

< > 在 GitHub 上更新

Transformers

InternVL

使用示例

使用管道进行推理

单图像推理

纯文本生成

批量图像和文本输入

批量多图像输入

视频输入

交错图像和视频输入

InternVLVisionConfig

class transformers.InternVLVisionConfig

InternVLConfig

class transformers.InternVLConfig

InternVLVisionModel

class transformers.InternVLVisionModel

forward

InternVLModel

class transformers.InternVLModel

forward

InternVLForConditionalGeneration

class transformers.InternVLForConditionalGeneration

forward

InternVLProcessor

类 transformers.InternVLProcessor

batch_decode

decode

InternVLVideoProcessor

类 transformers.InternVLVideoProcessor

sample_frames