DePlot

概述

DePlot 是在论文 DePlot: One-shot visual language reasoning by plot-to-table translation 中提出的，作者包括 Fangyu Liu、Julian Martin Eisenschlos、Francesco Piccinno、Syrine Krichene、Chenxi Pang、Kenton Lee、Mandar Joshi、Wenhu Chen、Nigel Collier 和 Yasemin Altun。

该论文的摘要如下：

图表等视觉语言在人类世界中无处不在。理解图表需要强大的推理能力。先前的最先进 (SOTA) 模型至少需要数万个训练示例，并且它们的推理能力仍然非常有限，尤其是在复杂的人工编写的查询上。本文提出了首个视觉语言推理的单次解决方案。我们将视觉语言推理的挑战分解为两个步骤：（1）图表到文本的翻译，以及（2）对翻译文本的推理。此方法的关键是一个模态转换模块，名为 DePlot，它将图表图像转换为线性化表格。DePlot 的输出可以直接用于提示预训练的大型语言模型 (LLM)，从而利用 LLM 的少样本推理能力。为了获得 DePlot，我们通过建立统一的任务格式和指标来标准化图表到表格的任务，并在该任务上端到端地训练 DePlot。然后，DePlot 可以与 LLM 以即插即用的方式脱机使用。与在超过 28k 个数据点上微调的 SOTA 模型相比，仅使用一次性提示的 DePlot+LLM 在图表 QA 任务的人工编写查询上实现了比微调 SOTA 提高 24.0% 的性能。

DePlot 是一个使用 Pix2Struct 架构训练的模型。您可以在 Pix2Struct 文档中找到有关 Pix2Struct 的更多信息。DePlot 是 Pix2Struct 架构的视觉问答子集。它在图像上渲染输入问题并预测答案。

使用示例

目前，DePlot 有一个可用的检查点

google/deplot：在 ChartQA 数据集上微调的 DePlot

from transformers import AutoProcessor, Pix2StructForConditionalGeneration
import requests
from PIL import Image

model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot")
processor = AutoProcessor.from_pretrained("google/deplot")
url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt")
predictions = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(predictions[0], skip_special_tokens=True))

微调

要微调 DePlot，请参考 pix2struct 微调笔记本。对于 Pix2Struct 模型，我们发现使用 Adafactor 和余弦学习率调度器微调模型可以更快地收敛

from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)

DePlot 是一个使用 Pix2Struct 架构训练的模型。有关 API 参考，请参阅 Pix2Struct 文档。

< > 更新在 GitHub 上

Transformers

DePlot

概述

使用示例

微调