Transformers 文档

使用 IDEFICS 完成图像任务

Hugging Face's logo
加入 Hugging Face 社区

并获取增强型文档体验

开始使用

使用 IDEFICS 完成图像任务

虽然可以通过微调专门的模型来处理单个任务,但最近出现并越来越流行的一种替代方法是使用大型模型来完成各种任务,而无需微调。例如,大型语言模型可以处理诸如摘要、翻译、分类等 NLP 任务。这种方法不再局限于单一模态(如文本),在本指南中,我们将说明如何使用名为 IDEFICS 的大型多模态模型来解决图像-文本任务。

IDEFICS 是一种基于 Flamingo 的开放访问视觉和语言模型,Flamingo 是由 DeepMind 开发的最先进的视觉语言模型。该模型接受任意图像和文本输入序列,并生成连贯的文本作为输出。它可以回答有关图像的问题,描述视觉内容,根据多个图像创作故事,等等。IDEFICS 有两个变体——800 亿参数90 亿参数,这两个变体都可以在 🤗 Hub 上获得。对于每个变体,您还可以找到针对对话用例调整的模型的微调指令版本。

这个模型非常通用,可以用于各种图像和多模态任务。但是,作为一个大型模型,它需要大量的计算资源和基础设施。您可以自行决定这种方法是否比为每个单独任务微调专门模型更适合您的用例。

在本指南中,您将学习如何

在开始之前,请确保您已安装所有必要的库。

pip install -q bitsandbytes sentencepiece accelerate transformers
要使用模型检查点的非量化版本运行以下示例,您至少需要 20GB 的 GPU 内存。

加载模型

让我们先加载模型的 90 亿参数检查点

>>> checkpoint = "HuggingFaceM4/idefics-9b"

与其他 Transformers 模型一样,您需要从检查点加载处理器和模型本身。IDEFICS 处理器将 LlamaTokenizer 和 IDEFICS 图像处理器封装到一个单独的处理器中,以处理为模型准备文本和图像输入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

device_map 设置为 "auto" 将自动确定如何以最优化的方式加载和存储模型权重,具体取决于现有设备。

量化模型

如果高内存 GPU 的可用性是一个问题,您可以加载模型的量化版本。要以 4 位精度加载模型和处理器,请将 BitsAndBytesConfig 传递给 from_pretrained 方法,模型将在加载时动态压缩。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

现在您已经以推荐的方式加载了模型,让我们继续探索您可以使用 IDEFICS 完成的任务。

图像字幕

图像字幕是预测给定图像的字幕的任务。一个常见的应用是帮助视障人士在不同的情况下进行导航,例如在线浏览图像内容。

为了说明这项任务,获取要加字幕的图像,例如

Image of a puppy in a flower bed

Hendo Wang 拍摄。

IDEFICS 接受文本和图像提示。但是,要为图像加字幕,您不必为模型提供文本提示,只需要预处理的输入图像。如果没有文本提示,模型将从 BOS(序列开始)标记开始生成文本,从而创建字幕。

作为模型的图像输入,您可以使用图像对象 (PIL.Image) 或可以从中检索图像的 URL。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

在调用 generate 时包含 bad_words_ids 是一个好主意,以避免在增加 max_new_tokens 时出现错误:当模型没有生成图像时,模型将希望生成新的 <image><fake_token_around_image> 标记。您可以在本指南中即时设置它,或如文本生成策略指南中所述,将其存储在 GenerationConfig 中。

提示图像字幕

您可以通过提供文本提示来扩展图像字幕,模型将在给定图像的情况下继续生成文本。让我们再看一张图像来进行说明

Image of the Eiffel Tower at night

Denys Nevozhai 拍摄。

文本和图像提示可以作为单个列表传递给模型的处理器,以创建适当的输入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少样本提示

虽然 IDEFICS 展示了出色的零样本结果,但您的任务可能需要特定的字幕格式,或者可能存在其他限制或要求,从而增加了任务的复杂性。少样本提示可用于启用上下文学习。通过在提示中提供示例,您可以引导模型生成模仿给定示例格式的结果。

让我们使用之前埃菲尔铁塔的图像作为模型的示例,并构建一个提示来向模型展示,除了学习图像中的物体是什么,我们还希望获得一些关于它的有趣信息。然后,让我们看看,如果我们能为自由女神像的图像获得相同的响应格式

Image of the Statue of Liberty

Juan Mayobre 拍摄。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

请注意,仅仅从一个示例(即 1-shot)中,模型就学会了如何执行任务。对于更复杂的任务,您可以随意尝试使用更多示例(例如,3-shot、5-shot 等)。

视觉问答

视觉问答 (VQA) 是根据图像回答开放式问题的任务。与图像字幕类似,它可以用于辅助功能应用,还可以用于教育(对视觉材料进行推理)、客户服务(根据图像询问产品)以及图像检索。

让我们为这项任务获取一张新图像

Image of a couple having a picnic

Jarritos Mexican Soda 拍摄。

您可以通过提供适当的指令来引导模型从图像字幕转向视觉问答。

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

图像分类

IDEFICS 能够将图像分类到不同的类别中,而无需在包含来自这些特定类别的标记示例的数据上进行显式训练。给定类别列表并利用其图像和文本理解能力,模型可以推断出图像可能属于哪个类别。

例如,我们有这张蔬菜摊的图片

Image of a vegetable stand

照片来自 Peter Wendt.

我们可以指示模型将图像分类到我们拥有的类别之一中。

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的示例中,我们指示模型将图像分类到单个类别中,但是,您也可以提示模型进行排名分类。

图像引导文本生成

对于更具创造性的应用,您可以使用图像引导文本生成来根据图像生成文本。这对于创建产品描述、广告、场景描述等很有用。

让我们提示 IDEFICS 根据一张简单的红色门图片写一个故事。

Image of a red door with a pumpkin on the steps

照片来自 Craig Tidball.

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

看起来 IDEFICS 注意到了门前台阶上的南瓜,并写了一个关于幽灵的万圣节恐怖故事。

对于像这样的更长的输出,您将从调整文本生成策略中获益匪浅。这可以帮助您显着提高生成的输出质量。查看 文本生成策略 以了解更多信息。

批量模式下运行推理

所有前面的部分都展示了 IDEFICS 对单个示例的应用。以非常类似的方式,您可以通过传递提示列表来对一批示例运行推理。

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

用于对话的 IDEFICS 指令

对于对话用例,您可以在 🤗 Hub 上找到经过微调的指令版模型:HuggingFaceM4/idefics-80b-instructHuggingFaceM4/idefics-9b-instruct

这些检查点是通过在监督和指令微调数据集的混合上微调各自的基础模型而产生的,这提高了下游性能,同时使模型在对话设置中更易于使用。

对话用例的使用和提示与使用基础模型非常相似。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")
< > 在 GitHub 上更新