推出 Idefics2：一个强大的 8B 视觉语言模型，面向社区

发布于 2024 年 4 月 15 日

在 GitHub 上更新

185

Idefics-Obelics logo

我们很高兴发布 Idefics2，这是一个通用的多模态模型，可以接收任意序列的文本和图像作为输入，并生成文本响应。它可以回答关于图像的问题，描述视觉内容，根据多张图像创建故事，从文档中提取信息，并执行基本的算术操作。
Idefics2 在 Idefics1 的基础上进行了改进：Idefics2 拥有 8B 参数，采用开放许可证（Apache 2.0），并增强了 OCR（光学字符识别）功能，是多模态社区的强大基础。其在视觉问答基准测试中的表现处于同类模型前列，并可与 LLaVa-Next-34B 和 MM1-30B-chat 等更大模型竞争。
Idefics2 也从一开始就集成在 🤗 Transformers 中，因此很容易针对许多多模态应用进行微调。您现在就可以在 Hub 上试用这些模型！

The Cauldron

模型	开放权重	大小	每张图像的令牌数量每张图像	MMMU （验证/测试）	MathVista （测试迷你）	文本视觉问答（验证）	MMBench （测试）	VQAv2 （测试-开发）	DocVQA （测试）
DeepSeek-VL	✅	7B	576	36.6/-	36.1	64.4	73.2	-	49.6
LLaVa-NeXT-Mistral-7B	✅	7B	2880	35.3/-	37.7	65.7	68.7	82.2	-
LLaVa-NeXT-13B	✅	13B	2880	36.2/-	35.3	67.1	70.0	82.8	-
LLaVa-NeXT-34B	✅	34B	2880	51.1/44.7	46.5	69.5	79.3	83.7	-
MM1-Chat-7B	❌	7B	720	37.0/35.6	35.9	72.8	72.3	82.8	-
MM1-Chat-30B	❌	30B	720	44.7/40.3	39.4	73.5	75.1	83.7
Gemini 1.0 Pro	❌	🤷‍♂️	🤷‍♂️	47.9/-	45.2	74.6	-	71.2	88.1
Gemini 1.5 Pro	❌	🤷‍♂️	🤷‍♂️	58.5/-	52.1	73.5	-	73.2	86.5
Claude 3 Haiku	❌	🤷‍♂️	🤷‍♂️	50.2/-	46.4	-	-	-	88.8

Idefics1 指令 (32-shots)	✅	80B	-	-	-	39.3	-	68.8	-

Idefics2 (无图像分割)*	✅	8B	64	43.5/37.9	51.6	70.4	76.8	80.8	67.3
Idefics2 (有图像分割)*	✅	8B	320	43.0/37.7	51.4	73.0	76.7	81.2	74.0

* w/ im. split: 遵循 SPHINX 和 LLaVa-NeXT 的策略，我们允许将子图像可选地分割成 4 份。

训练数据

Idefics2 的预训练数据混合了公开可用的数据集：交错式网页文档（Wikipedia、OBELICS）、图像-字幕对（Public Multimodal Dataset、LAION-COCO）、OCR 数据（PDFA (en)、IDL 和 Rendered-text）以及图像到代码数据（WebSight）。
交互式可视化允许探索 OBELICS 数据集。
遵循基础模型社区的惯例，我们进一步在面向任务的数据上训练了基础模型。然而，这些数据通常格式不一，散落在各地。收集它们对社区来说是一个障碍。为了解决这个问题，我们发布了我们一直在开发的通用多模态指令微调数据集：_《大熔炉 (The Cauldron)》_，这是一个包含 **50** 个手动策划的、为多轮对话格式化的数据集的开放汇编。我们对 Idefics2 进行了指令微调，将其与《大熔炉》以及各种纯文本指令微调数据集进行拼接。

The Cauldron

相较于 Idefics1 的改进

我们遵循 NaViT 策略，以其原始分辨率（最高 980 x 980）和原始宽高比处理图像。这避免了像计算机视觉社区以往那样将图像调整为固定大小的正方形的需要。此外，我们遵循 SPHINX 的策略，并（可选地）允许子图像分割和传递非常大分辨率的图像。
我们通过整合要求模型转录图像或文档中文本的数据，显著增强了 OCR 能力。我们还通过适当的训练数据，改进了在图表、图形和文档上回答问题的能力。
我们摒弃了 Idefics1 的架构（门控交叉注意力），并简化了视觉特征与语言主干的集成。图像被送入视觉编码器，然后是学习的 Perceiver pooling 和 MLP 模态投影。该 pooling 序列随后与文本嵌入拼接，以获得图像和文本的（交错）序列。

所有这些改进，加上更好的预训练主干，使得一个比 Idefics1 小 10 倍的模型在性能上有了显著提升。

Idefics2 Architecture

Idefics2 快速入门

Idefics2 在 Hugging Face Hub 上可用，并受到最新 transformers 版本的支持。以下是尝试它的代码示例：

import requests
import torch
from PIL import Image

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")


processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)


# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)