视觉文档检索

文档除了文本之外，如果包含图表、表格和图像，则可能包含多模态数据。从这些文档中检索信息具有挑战性，因为单独的文本检索模型无法处理视觉数据，而图像检索模型则缺乏粒度和文档处理能力。

视觉文档检索可以帮助从所有类型的文档中检索信息，包括多模态检索增强生成 (RAG)。这些模型接受文档（作为图像）和文本，并计算它们之间的相似度分数。

本指南演示了如何使用 ColPali 索引和检索文档。

对于大规模用例，您可能需要使用矢量数据库来索引和检索文档。

确保已安装 Transformers 和 Datasets。

pip install -q datasets transformers

我们将索引一个与不明飞行物目击相关的文档数据集。我们过滤掉缺少感兴趣列的示例。它包含多列，我们对 specific_detail_query 列感兴趣，其中包含文档的简短摘要，以及包含文档的 image 列。

from datasets import load_dataset

dataset = load_dataset("davanstrien/ufo-ColPali")
dataset = dataset["train"]
dataset = dataset.filter(lambda example: example["specific_detail_query"] is not None)
dataset

Dataset({
    features: ['image', 'raw_queries', 'broad_topical_query', 'broad_topical_explanation', 'specific_detail_query', 'specific_detail_explanation', 'visual_element_query', 'visual_element_explanation', 'parsed_into_json'],
    num_rows: 2172
})

让我们加载模型和分词器。

import torch
from transformers import ColPaliForRetrieval, ColPaliProcessor

model_name = "vidore/colpali-v1.2-hf"

processor = ColPaliProcessor.from_pretrained(model_name)

model = ColPaliForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
).eval()

将文本查询传递给处理器，并从模型返回索引的文本嵌入。对于图像到文本搜索，请在 ColPaliProcessor 中用 images 参数替换 text 参数以传递图像。

inputs = processor(text="a document about Mars expedition").to("cuda")
with torch.no_grad():
  text_embeds = model(**inputs, return_tensors="pt").embeddings

离线索引图像，并在推理期间返回查询文本嵌入以获取其最接近的图像嵌入。

通过使用 map 将图像和图像嵌入写入数据集来存储它们，如下所示。添加一个包含索引嵌入的 embeddings 列。ColPali 嵌入占用大量存储空间，因此将它们从 GPU 中移除并以 NumPy 向量的形式存储在 CPU 中。

ds_with_embeddings = dataset.map(lambda example: {'embeddings': model(**processor(images=example["image"]).to("cuda"), return_tensors="pt").embeddings.to(torch.float32).detach().cpu().numpy()})

对于在线推理，创建一个函数以批处理方式搜索图像嵌入，并检索 k 个最相关的图像。下面的函数返回给定索引数据集、文本嵌入、前 k 个结果数和批处理大小的数据集中的索引及其分数。

def find_top_k_indices_batched(dataset, text_embedding, processor, k=10, batch_size=4):
    scores_and_indices = []

    for start_idx in range(0, len(dataset), batch_size):

        end_idx = min(start_idx + batch_size, len(dataset))
        batch = dataset[start_idx:end_idx]        
        batch_embeddings = [torch.tensor(emb[0], dtype=torch.float32) for emb in batch["embeddings"]]
        scores = processor.score_retrieval(text_embedding.to("cpu").to(torch.float32), batch_embeddings)

        if hasattr(scores, "tolist"):
            scores = scores.tolist()[0]

        for i, score in enumerate(scores):
            scores_and_indices.append((score, start_idx + i))

    sorted_results = sorted(scores_and_indices, key=lambda x: -x[0])

    topk = sorted_results[:k]
    indices = [idx for _, idx in topk]
    scores = [score for score, _ in topk]

    return indices, scores

生成文本嵌入并将其传递给上述函数以返回数据集索引和分数。

with torch.no_grad():
  text_embeds = model(**processor(text="a document about Mars expedition").to("cuda"), return_tensors="pt").embeddings
indices, scores = find_top_k_indices_batched(ds_with_embeddings, text_embeds, processor, k=3, batch_size=4)
print(indices, scores)

([440, 442, 443],
 [14.370786666870117,
  13.675487518310547,
  12.9899320602417])

显示图像以查看与火星相关的文档。

for i in indices:
  display(dataset[i]["image"])

< > 在 GitHub 上更新