Transformers 文档

使用 IDEFICS 完成图像任务

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

使用 IDEFICS 完成图像任务

虽然单个任务可以通过微调专用模型来解决,但最近出现并流行的一种替代方法是使用大型模型处理各种任务而无需微调。例如,大型语言模型可以处理摘要、翻译、分类等 NLP 任务。这种方法不再局限于单一模态(例如文本),在本指南中,我们将演示如何使用名为 IDEFICS 的大型多模态模型来解决图像-文本任务。

IDEFICS 是一个基于 Flamingo 的开放式视觉和语言模型,Flamingo 是 DeepMind 最初开发的一种最先进的视觉语言模型。该模型接受任意图像和文本输入序列,并生成连贯的文本作为输出。它可以回答有关图像的问题,描述视觉内容,创建基于多幅图像的故事等等。IDEFICS 有两种变体 - 800 亿参数90 亿参数,这两种变体都可以在 🤗 Hub 上找到。对于每种变体,您还可以找到经过微调的指令版本模型,适用于对话式用例。

此模型用途广泛,可用于各种图像和多模态任务。然而,作为大型模型意味着它需要大量的计算资源和基础设施。您需要决定这种方法是否比为每个单独任务微调专用模型更适合您的用例。

在本指南中,您将学习如何

在开始之前,请确保您已安装所有必要的库。

pip install -q bitsandbytes sentencepiece accelerate transformers
要使用非量化版本的模型检查点运行以下示例,您将需要至少 20GB 的 GPU 内存。

加载模型

让我们开始加载模型的 90 亿参数检查点

>>> checkpoint = "HuggingFaceM4/idefics-9b"

与其他 Transformers 模型一样,您需要从检查点加载处理器和模型本身。IDEFICS 处理器将 LlamaTokenizer 和 IDEFICS 图像处理器封装到一个处理器中,负责为模型准备文本和图像输入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

device_map 设置为 "auto" 将自动确定如何以最优化的方式加载和存储模型权重,给定现有设备。

量化模型

如果高端 GPU 内存可用性有问题,您可以加载模型的量化版本。要以 4 位精度加载模型和处理器,请将 BitsAndBytesConfig 传递给 from_pretrained 方法,模型将在加载时动态压缩。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

现在您已经以建议的方式之一加载了模型,让我们继续探索可以使用 IDEFICS 的任务。

图像字幕

图像字幕是预测给定图像的字幕的任务。一个常见的应用是帮助视障人士在不同情况下导航,例如在线浏览图像内容。

为了说明此任务,获取要添加字幕的图像,例如

Image of a puppy in a flower bed

图片由 Hendo Wang 拍摄。

IDEFICS 接受文本和图像提示。但是,要为图像添加字幕,您不必向模型提供文本提示,只需提供预处理的输入图像。如果没有文本提示,模型将从 BOS(序列开始)标记开始生成文本,从而创建字幕。

作为模型的图像输入,您可以使用图像对象(PIL.Image)或可以从中检索图像的 URL。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

在调用 generate 时包含 bad_words_ids 是个好主意,以避免在增加 max_new_tokens 时出现错误:当模型未生成图像时,模型会想要生成新的 <image><fake_token_around_image> 标记。您可以像本指南中一样动态设置它,或者将其存储在 GenerationConfig 中,如 文本生成策略 指南中所述。

带提示的图像字幕

您可以通过提供文本提示来扩展图像字幕,模型将根据图像继续生成。让我们再看一个图像来演示

Image of the Eiffel Tower at night

图片由 Denys Nevozhai 拍摄。

文本和图像提示可以作为单个列表传递给模型的处理器,以创建适当的输入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少样本提示

虽然 IDEFICS 展示了出色的零样本结果,但您的任务可能需要特定格式的字幕,或带有其他限制或要求,从而增加了任务的复杂性。少样本提示可用于实现上下文学习。通过在提示中提供示例,您可以引导模型生成模仿给定示例格式的结果。

让我们以上一张埃菲尔铁塔的图像为例,为模型构建一个提示,向模型演示除了学习图像中的对象是什么之外,我们还希望获得一些有趣的关于它的信息。然后,让我们看看,我们是否可以获得自由女神像图像的相同响应格式

Image of the Statue of Liberty

图片由 Juan Mayobre 拍摄。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

请注意,仅仅通过一个示例(即 1 样本)模型就已经学会了如何执行任务。对于更复杂的任务,可以随意尝试更多示例(例如,3 样本、5 样本等)。

视觉问答

视觉问答(VQA)是基于图像回答开放式问题的人物。与图像字幕类似,它可用于辅助应用,但也可用于教育(关于视觉材料的推理)、客户服务(基于图像的产品问题)和图像检索。

让我们为这项任务获取一张新图像

Image of a couple having a picnic

图片由 Jarritos Mexican Soda 拍摄。

您可以通过提供适当的指令,将模型从图像字幕引导到视觉问答

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

图像分类

IDEFICS 能够将图像分类到不同的类别中,而无需在包含这些特定类别标记示例的数据上进行明确训练。给定一个类别列表,并利用其图像和文本理解能力,模型可以推断出图像可能属于哪个类别。

例如,我们有这张蔬菜摊的图片

Image of a vegetable stand

图片由 Peter Wendt 拍摄。

我们可以指示模型将图像分类到我们拥有的其中一个类别中

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的例子中,我们指示模型将图像分类为单个类别,但是,您也可以提示模型进行排名分类。

图像引导文本生成

对于更具创造性的应用程序,您可以使用图像引导文本生成来根据图像生成文本。这对于创建产品描述、广告、场景描述等非常有用。

让我们提示 IDEFICS 根据一张简单的红门图像写一个故事

Image of a red door with a pumpkin on the steps

图片由 Craig Tidball 拍摄。

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

IDEFICS 似乎注意到了门阶上的南瓜,然后写了一个关于鬼魂的万圣节恐怖故事。

对于像这样较长的输出,您将极大地受益于调整文本生成策略。这可以帮助您显著提高生成的输出质量。查看 文本生成策略 以了解更多信息。

批处理模式运行推理

前面所有部分都演示了 IDEFICS 的单个示例。以非常相似的方式,您可以通过传递提示列表来对一批示例运行推理

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

IDEFICS 指令用于对话

对于对话用例,您可以在 🤗 Hub 上找到经过微调的指令版本模型:HuggingFaceM4/idefics-80b-instructHuggingFaceM4/idefics-9b-instruct

这些检查点是基础模型在监督和指令微调数据集混合上微调的结果,这提高了下游性能,同时使模型在对话设置中更易于使用。

对话用例的使用和提示与使用基础模型非常相似

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
>>> from accelerate.test_utils.testing import get_backend

>>> device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")
< > 在 GitHub 上更新