开源 AI 食谱文档

Smol 多模态 RAG：在 Colab 免费层级 GPU 上使用 ColSmolVLM 和 SmolVLM 构建

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

Smol 多模态 RAG：在 Colab 免费层级 GPU 上使用 ColSmolVLM 和 SmolVLM 构建

作者：Sergio Paniego

在本笔记本中，我们 smol 🤏 并演示如何通过集成 ColSmolVLM 进行文档检索和 SmolVLM 作为视觉语言模型 (VLM) 来构建 多模态检索增强生成 (RAG) 系统。这些轻量级模型使我们能够在消费级 GPU 甚至 Google Colab 免费层级上运行功能齐全的多模态 RAG 系统。

本笔记本是 多模态 RAG 食谱 系列的第三部分。如果您是该主题的新手或想了解更多信息，请查看之前的这些食谱

让我们深入了解并构建一个强大而紧凑的 RAG 系统！🚀

multimodal_rag_using_document_retrieval_and_smol_vlm (2).png

1. 安装依赖项

让我们开始安装我们项目所需的基本库！🚀

对于本笔记本，我们需要下载 byaldi 的 正在进行中的 PR。一旦 PR 合并，这些安装步骤可以相应更新。

!pip install -q git+https://github.com/sergiopaniego/byaldi.git@colsmolvlm-support

2. 加载数据集 📁

在本笔记本中，我们将使用来自 Our World in Data 的图表和地图，这是一个开放访问平台，提供丰富的数据和可视化。我们的重点将放在预期寿命数据上，该数据提供了关于全球预期寿命随时间推移的趋势的见解。

为了简化访问并保持 smol 🤏，我们已将此数据的一个子集整理成 Hugging Face 上托管的数据集。这个小型集合非常适合演示，但在实际应用中，您可以扩展到更大的数据集以增强系统的性能。

引用

Saloni Dattani, Lucas Rodés-Guirao, Hannah Ritchie, Esteban Ortiz-Ospina and Max Roser (2023) - “Life Expectancy” Published online at OurWorldinData.org. Retrieved from: 'https://ourworldindata.org/life-expectancy' [Online Resource]

from datasets import load_dataset

dataset = load_dataset("sergiopaniego/ourworldindata_example", split="train")

下载可视化数据后，我们将将其本地保存，以便为 RAG（检索增强生成）系统做好准备。此步骤至关重要，因为它使文档检索模型 (ColSmolVLM) 能够有效地索引、处理和操作可视化内容。正确的索引确保了系统执行期间的无缝集成和检索。

import os
from PIL import Image


def save_images_to_local(dataset, output_folder="data/"):
    os.makedirs(output_folder, exist_ok=True)

    for image_id, image_data in enumerate(dataset):
        image = image_data["image"]

        if isinstance(image, str):
            image = Image.open(image)

        output_path = os.path.join(output_folder, f"image_{image_id}.png")

        image.save(output_path, format="PNG")

        print(f"Image saved in: {output_path}")


save_images_to_local(dataset)

现在，让我们加载图像以浏览数据集，并快速了解我们将要使用的可视化内容。此步骤有助于我们熟悉数据，并确保一切都为下一阶段做好准备。

import os
from PIL import Image


def load_png_images(image_folder):
    png_files = [f for f in os.listdir(image_folder) if f.endswith(".png")]
    all_images = {}

    for image_id, png_file in enumerate(png_files):
        image_path = os.path.join(image_folder, png_file)
        image = Image.open(image_path)
        all_images[image_id] = image

    return all_images


all_images = load_png_images("/content/data/")

让我们可视化一些样本，以了解数据的结构！这将帮助我们掌握我们将要使用的内容的格式和布局。 👀

>>> import matplotlib.pyplot as plt

>>> fig, axes = plt.subplots(1, 5, figsize=(20, 15))

>>> for i, ax in enumerate(axes.flat):
...     img = all_images[i]
...     ax.imshow(img)
...     ax.axis("off")

>>> plt.tight_layout()
>>> plt.show()

3. 初始化 ColSmolVLM 多模态文档检索模型 🤖

现在我们的数据集已准备就绪，是时候初始化 文档检索模型 了，它将从原始图像中提取相关信息，并根据我们的查询返回适当的文档。该模型通过实现精确的信息检索，在增强我们系统的对话能力方面发挥着至关重要的作用。

对于此任务，我们将使用 Byaldi，这是一个旨在简化多模态 RAG 管道的库。 Byaldi 提供了 API，用于集成多模态检索器和视觉语言模型，以实现高效的检索增强生成工作流程。

在本笔记本中，我们将特别关注 ColSmolVLM。

ColPali architecture

此外，您可以浏览 ViDore（视觉文档检索基准），以查看性能最佳的检索器的实际应用。

首先，我们将从检查点加载模型。

from byaldi import RAGMultiModalModel

docs_retrieval_model = RAGMultiModalModel.from_pretrained("vidore/colsmolvlm-alpha")

接下来，我们将通过指定图像存储文件夹，使用文档检索模型索引我们的文档。此过程允许模型有效地组织和处理文档，确保可以根据我们的查询快速检索它们。

docs_retrieval_model.index(
    input_path="data/", index_name="image_index", store_collection_with_index=False, overwrite=True
)

4. 使用文档检索模型检索文档 🤔

现在文档检索模型已初始化，我们可以通过提交问题并获取可能包含答案的相关文档来测试其功能。

该模型将按相关性对结果进行排序，首先返回最相关的文档。

让我们试一试，看看它的表现如何！

text_query = "What is the overall trend in life expectancy across different countries and regions?"

results = docs_retrieval_model.search(text_query, k=1)
results

让我们看一下检索到的文档，并检查模型是否已将我们的查询与最佳结果正确匹配。

>>> result_image = all_images[results[0]["doc_id"]]
>>> result_image

5. 初始化用于问答的视觉语言模型 🙋

接下来，我们将初始化用于问答的 视觉语言模型 (VLM) 。对于此任务，我们将使用 SmolVLM。

SmolVLM architecture

通过查看 OpenVLMLeaderboard 此处，及时了解开放视觉语言模型的最新进展。

首先，我们将从预训练检查点加载模型，并将其传输到 GPU 以获得最佳性能。您可以在此处浏览完整的模型集合。

from transformers import Idefics3ForConditionalGeneration, AutoProcessor
import torch


model_id = "HuggingFaceTB/SmolVLM-Instruct"
vl_model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="eager",
)
vl_model.eval()

接下来，我们将初始化视觉语言模型 (VLM) 处理器。

vl_model_processor = AutoProcessor.from_pretrained(model_id)

6. 组装 VLM 模型并测试系统 🔧

加载完所有组件后，我们就可以组装系统进行测试了。首先，我们将通过向系统提供检索到的图像和用户的查询来设置聊天结构。此步骤是高度可定制的，提供了根据您的需求调整交互的灵活性，并能够试验不同的输入和输出。

chat_template = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": text_query},
        ],
    }
]

现在，让我们应用此聊天模板来设置系统，以便与模型进行交互。

text = vl_model_processor.apply_chat_template(chat_template, add_generation_prompt=True)

接下来，我们将处理输入，以确保它们格式正确并准备好与视觉语言模型 (VLM) 一起使用。此步骤对于使模型能够根据提供的数据生成准确的响应至关重要。

inputs = vl_model_processor(
    text=text,
    images=[result_image],
    return_tensors="pt",
)
inputs = inputs.to("cuda")

我们现在准备好生成答案了！让我们看看系统如何使用处理后的输入，根据用户查询和检索到的图像提供响应。

generated_ids = vl_model.generate(**inputs, max_new_tokens=500)

模型生成输出后，我们将对其进行后处理以生成最终答案。

generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

output_text = vl_model_processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

>>> print(output_text[0])

The overall trend in life expectancy across different countries and regions is an increase over time.

正如我们所见，SmolVLM 能够正确回答查询！ 🎉

现在，让我们看一下 SmolVLM 的内存消耗，以了解其资源使用情况。

>>> print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
>>> print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

GPU allocated memory: 8.32 GB
GPU reserved memory: 10.38 GB

7. 全部组装起来！ 🧑‍🏭️

现在，让我们创建一个包含整个管道的方法，以便我们在未来的应用中轻松重用它。

def answer_with_multimodal_rag(
    vl_model, docs_retrieval_model, vl_model_processor, all_images, text_query, retrival_top_k, max_new_tokens
):
    results = docs_retrieval_model.search(text_query, k=retrival_top_k)
    result_image = all_images[results[0]["doc_id"]]

    chat_template = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": text_query}],
        }
    ]

    # Prepare the inputs
    text = vl_model_processor.apply_chat_template(chat_template, add_generation_prompt=True)
    inputs = vl_model_processor(
        text=text,
        images=[result_image],
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Generate text from the vl_model
    generated_ids = vl_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

    # Decode the generated text
    output_text = vl_model_processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text

让我们看看完整的 RAG 系统是如何运行的！

>>> output_text = answer_with_multimodal_rag(
...     vl_model=vl_model,
...     docs_retrieval_model=docs_retrieval_model,
...     vl_model_processor=vl_model_processor,
...     all_images=all_images,
...     text_query="What is the overall trend in life expectancy across different countries and regions?",
...     retrival_top_k=1,
...     max_new_tokens=500,
... )
>>> print(output_text[0])

The overall trend in life expectancy across different countries and regions is an increase over time.

🏆 我们现在拥有一个完全可操作的 smol RAG 管道，它集成了 smol 文档检索模型 和 smol 视觉语言模型，并经过优化，可在单个消费级 GPU 上运行！这种强大的组合使我们能够根据用户查询和相关文档生成有见地的响应，从而提供无缝的多模态体验。

8. 我们可以更 smoler 吗？ 🤏

我们现在有一个完全可操作的系统，但我们可以更 smoler 吗？答案是肯定的！我们将使用 SmolVLM 模型的量化版本，以进一步降低系统的资源需求。

为了充分体验消耗的差异，我建议重新初始化系统并运行所有单元格，但实例化 VLM 模型的单元格除外。这样，您可以清楚地观察到使用量化模型的影响。

让我们首先安装 bitsandbytes。

!pip install -q -U bitsandbytes

让我们创建 BitsAndBytesConfig 配置，以在量化的 int-4 配置中加载我们的模型，这将有助于减少模型的内存占用并提高性能。

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

接下来，我们可以使用刚刚创建的量化配置加载模型

from transformers import Idefics3ForConditionalGeneration, AutoProcessor

model_id = "HuggingFaceTB/SmolVLM-Instruct"
vl_model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id, quantization_config=bnb_config, _attn_implementation="eager", device_map="auto"
)

vl_model_processor = AutoProcessor.from_pretrained(model_id)

最后，让我们测试量化模型的功能

>>> output_text = answer_with_multimodal_rag(
...     vl_model=vl_model,
...     docs_retrieval_model=docs_retrieval_model,
...     vl_model_processor=vl_model_processor,
...     all_images=all_images,
...     text_query="What is the overall trend in life expectancy across different countries and regions?",
...     retrival_top_k=1,
...     max_new_tokens=500,
... )
>>> print(output_text[0])

The overall trend in life expectancy across different countries and regions is an increase over time.

模型工作正常！ 🎉 现在，让我们看一下内存消耗。在下面，您可以看到结果——我们已成功地进一步降低了内存使用率！ 🚀

>>> print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
>>> print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

GPU allocated memory: 5.44 GB
GPU reserved memory: 7.86 GB

下面是一个表格，比较了食谱中另外两个多模态 RAG 笔记本与此处描述的两个版本的内存消耗。正如您所看到的，与其他系统相比，这些系统在资源需求方面要小一个数量级。

笔记本	GPU 分配内存 (GB)	GPU 保留内存 (GB)
带量化的 Smol 多模态 RAG	5.44 GB	7.86 GB
Smol 多模态 RAG	8.32 GB	10.38 GB
使用 ColQwen2、Reranker 的多模态 RAG 和消费级 GPU 上的量化 VLM	13.93 GB	14.59 GB
使用文档检索 (ColPali) 的多模态 RAG 和视觉语言模型 (VLM)	22.63 GB	37.16 GB

8. 继续旅程 🧑‍🎓️

如果您很高兴继续探索，请务必查看我们之前的指南 使用文档检索 (ColPali) 和视觉语言模型 (VLM) 的多模态检索增强生成 (RAG) 以及 reranker 指南 在消费级 GPU 上使用 ColQwen2、Reranker 和量化 VLM 的多模态 RAG 的结论和见解。

祝您实验愉快！ 🧑‍🔬

< > 更新在 GitHub 上

←在消费级 GPU 上使用 TRL 微调 SmolVLM 在消费级 GPU 上使用 TRL 和直接偏好优化 (DPO) 微调 SmolVLM→