使用 IDEFICS 进行图像任务

虽然可以通过微调专门的模型来处理单个任务，但最近出现并流行的一种替代方法是使用大型模型来处理各种任务，而无需微调。例如，大型语言模型可以处理诸如摘要、翻译、分类等 NLP 任务。这种方法不再局限于单一模态（例如文本），在本指南中，我们将说明如何使用名为 IDEFICS 的大型多模态模型来解决图像-文本任务。

IDEFICS 是一个开源的视觉和语言模型，基于 Flamingo，这是一个最初由 DeepMind 开发的先进的视觉语言模型。该模型接受任意的图像和文本输入序列，并生成连贯的文本作为输出。它可以回答关于图像的问题，描述视觉内容，创建基于多张图像的故事等等。IDEFICS 有两个变体 - 800 亿参数和 90 亿参数，两者都可以在 🤗 Hub 上找到。对于每个变体，您还可以找到经过微调的指令版本模型，适用于对话用例。

该模型用途极其广泛，可用于各种图像和多模态任务。然而，作为一个大型模型，这意味着它需要大量的计算资源和基础设施。这取决于您来决定这种方法是否比为每个单独的任务微调专门的模型更适合您的用例。

在本指南中，您将学习如何

加载 IDEFICS 和加载模型的量化版本
使用 IDEFICS 进行
以批量模式运行推理
运行 IDEFICS instruct 进行对话使用

在开始之前，请确保您已安装所有必要的库。

pip install -q bitsandbytes sentencepiece accelerate transformers

要使用非量化版本的模型检查点运行以下示例，您将需要至少 20GB 的 GPU 内存。

加载模型

让我们从加载模型的 90 亿参数检查点开始

>>> checkpoint = "HuggingFaceM4/idefics-9b"

就像其他 Transformers 模型一样，您需要从检查点加载处理器和模型本身。IDEFICS 处理器将 LlamaTokenizer 和 IDEFICS 图像处理器包装到一个处理器中，以负责准备模型的文本和图像输入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

将 device_map 设置为 "auto" 将自动确定如何以最优化方式加载和存储模型权重，考虑到现有设备。

量化模型

如果高内存 GPU 可用性是一个问题，您可以加载模型的量化版本。要以 4 位精度加载模型和处理器，请将 BitsAndBytesConfig 传递给 from_pretrained 方法，模型将在加载时进行动态压缩。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

现在您已经以建议的方式之一加载了模型，让我们继续探索您可以使用 IDEFICS 执行的任务。

图像描述

图像描述是预测给定图像的标题的任务。一个常见的应用是帮助视力障碍人士浏览不同的情况，例如，探索在线图像内容。

为了说明该任务，获取要加标题的图像，例如

照片由 Hendo Wang 拍摄。

IDEFICS 接受文本和图像提示。但是，要描述图像，您不必向模型提供文本提示，只需提供预处理的输入图像即可。在没有文本提示的情况下，模型将从 BOS（序列开始）标记开始生成文本，从而创建标题。

作为模型的图像输入，您可以使用图像对象 (PIL.Image) 或可以从中检索图像的 URL。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

最好在调用 generate 时包含 bad_words_ids，以避免在增加 max_new_tokens 时出现错误：当模型没有生成图像时，模型会想要生成新的 <image> 或 <fake_token_around_image> 标记。您可以像本指南中那样动态设置它，或者将其存储在 GenerationConfig 中，如文本生成策略指南中所述。

提示图像描述

您可以通过提供文本提示来扩展图像描述，模型将根据图像继续生成文本。让我们再看一张图片来说明

照片由 Denys Nevozhai 拍摄。

文本和图像提示可以作为单个列表传递给模型的处理器，以创建适当的输入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少样本提示

虽然 IDEFICS 展示了出色的零样本结果，但您的任务可能需要某种格式的标题，或者带有增加任务复杂性的其他限制或要求。少样本提示可用于启用上下文学习。通过在提示中提供示例，您可以引导模型生成模仿给定示例格式的结果。

让我们使用之前的埃菲尔铁塔图像作为模型的示例，并构建一个提示，向模型演示除了了解图像中的对象是什么之外，我们还希望获得一些关于它的有趣信息。然后，让我们看看，我们是否可以获得与自由女神像图像相同的响应格式

照片由 Juan Mayobre 拍摄。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

请注意，仅从一个示例（即 1-shot）中，模型就学会了如何执行任务。对于更复杂的任务，请随意尝试更多数量的示例（例如，3-shot、5-shot 等）。

视觉问答

视觉问答 (VQA) 是根据图像回答开放式问题的任务。与图像描述类似，它可用于辅助功能应用程序，也可用于教育（推理视觉材料）、客户服务（基于图像的产品问题）和图像检索。

让我们为这个任务获取一张新图像

照片由 Jarritos Mexican Soda 拍摄。

您可以通过使用适当的指令提示模型，将模型从图像描述引导到视觉问答

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

图像分类

IDEFICS 能够将图像分类到不同的类别中，而无需在包含来自这些特定类别的标记示例的数据上进行显式训练。给定类别列表并使用其图像和文本理解能力，模型可以推断出图像可能属于哪个类别。

假设，我们有这张蔬菜摊位的图像

照片由 Peter Wendt 拍摄。

我们可以指示模型将图像分类到我们拥有的类别之一

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的示例中，我们指示模型将图像分类到单个类别中，但是，您也可以提示模型进行排名分类。

图像引导的文本生成

对于更具创意的应用，您可以使用图像引导的文本生成来根据图像生成文本。这对于创建产品描述、广告、场景描述等非常有用。

让我们提示 IDEFICS 根据红色门的简单图像编写一个故事

Image of a red door with a pumpkin on the steps

照片由 Craig Tidball 拍摄。

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

看起来 IDEFICS 注意到了门阶上的南瓜，并创作了一个关于鬼魂的怪异万圣节故事。

对于像这样更长的输出，您将从调整文本生成策略中获益匪浅。这可以帮助您显着提高生成输出的质量。查看文本生成策略以了解更多信息。

以批量模式运行推理

前面的所有部分都说明了 IDEFICS 用于单个示例。以非常相似的方式，您可以通过传递提示列表来为一批示例运行推理

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

用于对话使用的 IDEFICS instruct

对于对话用例，您可以在 🤗 Hub 上找到经过微调的指令版本模型：HuggingFaceM4/idefics-80b-instruct 和 HuggingFaceM4/idefics-9b-instruct。

这些检查点是在监督和指令微调数据集的混合上微调各自的基础模型的结果，这提高了下游性能，同时使模型在对话设置中更易于使用。

对话用例的使用和提示与使用基础模型非常相似

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
>>> from accelerate.test_utils.testing import get_backend

>>> device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")

< > 在 GitHub 上更新