TAPAS

概述

TAPAS 模型由 Jonathan Herzig、Paweł Krzysztof Nowak、Thomas Müller、Francesco Piccinno 和 Julian Martin Eisenschlos 在TAPAS: 通过预训练进行弱监督表格解析中提出。它是一个基于 BERT 的模型，专门为回答表格数据相关问题而设计（并进行预训练）。与 BERT 相比，TAPAS 使用相对位置嵌入，并具有 7 种标记类型来编码表格结构。TAPAS 在大型数据集上通过掩码语言建模（MLM）目标进行预训练，该数据集包含来自英文维基百科的数百万个表格和相应的文本。

对于问答，TAPAS 在顶部有 2 个头：一个单元格选择头和一个聚合头，用于（可选）在所选单元格中执行聚合（例如计数或求和）。TAPAS 已在多个数据集上进行了微调：

SQA（微软的序列问答）
WTQ（斯坦福大学的维基表格问题）
WikiSQL（Salesforce 提供）。

它在 SQA 和 WTQ 上均取得了最先进的性能，同时在 WikiSQL 上的性能与 SOTA 相当，但架构更简单。

论文摘要如下：

在表格上回答自然语言问题通常被视为一项语义解析任务。为了减轻完整逻辑形式的收集成本，一种流行的方法侧重于弱监督，即使用指代而非逻辑形式。然而，从弱监督中训练语义解析器存在困难，此外，生成的逻辑形式仅用作检索指代之前的中间步骤。在本文中，我们提出了 TAPAS，一种无需生成逻辑形式即可在表格上进行问答的方法。TAPAS 从弱监督中训练，并通过选择表格单元格并可选地对该选择应用相应的聚合运算符来预测指代。TAPAS 扩展了 BERT 的架构以将表格编码为输入，从维基百科爬取的文本片段和表格的有效联合预训练进行初始化，并进行端到端训练。我们对三个不同的语义解析数据集进行了实验，发现 TAPAS 通过将 SQA 的最先进准确率从 55.1 提高到 67.2，并与 WIKISQL 和 WIKITQ 的最先进水平持平，从而超越或媲美语义解析模型，但模型架构更简单。我们还发现，在我们的设置中，从 WIKISQL 到 WIKITQ 的迁移学习（这是微不足道的）产生了 48.7 的准确率，比最先进水平高出 4.2 个百分点。

此外，作者通过创建数百万自动生成的训练示例的平衡数据集，进一步预训练了 TAPAS 以识别**表格蕴含**，这些示例在微调之前的中间步骤中学习。TAPAS 的作者将这种进一步的预训练称为中间预训练（因为 TAPAS 首先在 MLM 上进行预训练，然后在另一个数据集上进行预训练）。他们发现中间预训练进一步提高了 SQA 的性能，实现了新的最先进水平，并在 TabFact（一个包含 16k 维基百科表格的用于表格蕴含的大规模数据集，一个二元分类任务）上实现了最先进水平。欲了解更多详情，请参阅他们的后续论文：Julian Martin Eisenschlos、Syrine Krichene 和 Thomas Müller 的使用中间预训练理解表格。

TAPAS 架构。摘自原始博客文章。

此模型由 nielsr 贡献。此模型的 TensorFlow 版本由 kamalkraj 贡献。原始代码可在此处找到。

使用技巧

TAPAS 默认使用相对位置嵌入（在表格的每个单元格处重新开始位置嵌入）。请注意，这是在 TAPAS 原始论文发表后添加的功能。据作者称，这通常会带来稍好的性能，并允许在不耗尽嵌入的情况下编码更长的序列。这反映在 TapasConfig 的 reset_position_index_per_cell 参数中，该参数默认设置为 True。在 hub 上可用的模型默认版本都使用相对位置嵌入。您仍然可以通过在调用 from_pretrained() 方法时传入额外的参数 revision="no_reset" 来使用具有绝对位置嵌入的模型。请注意，通常建议在右侧而不是左侧填充输入。
TAPAS 基于 BERT，因此 TAPAS-base 例如对应于 BERT-base 架构。当然，TAPAS-large 将带来最佳性能（论文中报告的结果来自 TAPAS-large）。各种大小模型的性能结果显示在原始 GitHub 仓库中。
TAPAS 具有在 SQA 上微调的检查点，能够在会话设置中回答与表格相关的问题。这意味着您可以提出后续问题，例如与前一个问题相关的“他多大了？”。请注意，在会话设置中，TAPAS 的前向传播略有不同：在这种情况下，您必须将每个表格-问题对逐一输入到模型中，以便 prev_labels 令牌类型 ID 可以被模型对前一个问题的预测 labels 覆盖。有关更多信息，请参阅“用法”部分。
TAPAS 与 BERT 类似，因此依赖于掩码语言建模（MLM）目标。因此，它在预测掩码令牌和一般 NLU 方面效率高，但不适合文本生成。采用因果语言建模（CLM）目标训练的模型在这方面表现更好。请注意，TAPAS 可以用作 EncoderDecoderModel 框架中的编码器，以将其与 GPT-2 等自回归文本解码器结合使用。

用法：微调

在这里，我们解释了如何在你自己的数据集上微调 TapasForQuestionAnswering。

步骤 1：选择使用 TAPAS 的 3 种方式之一 - 或进行实验

基本上，有 3 种不同的方式可以微调 TapasForQuestionAnswering，对应于 Tapas 被微调的不同数据集。

SQA：如果你对在会话设置中提问与表格相关的后续问题感兴趣。例如，如果你首先问“第一个演员的名字是什么？”，然后你可以问一个后续问题，例如“他多大了？”。在这里，问题不涉及任何聚合（所有问题都是单元格选择问题）。
WTQ：如果你不感兴趣在会话设置中提问，而只是提问与表格相关的问题，这些问题可能涉及聚合，例如计算行数、求和单元格值或平均单元格值。你就可以问“C罗职业生涯中进球总数是多少？”。这种情况也称为**弱监督**，因为模型本身必须仅根据问题的答案学习适当的聚合运算符（SUM/COUNT/AVERAGE/NONE）。
WikiSQL-supervised：此数据集基于 WikiSQL，模型在训练期间被赋予了真实聚合运算符。这也被称为**强监督**。在这里，学习适当的聚合运算符要容易得多。

总结一下：

任务	示例数据集	描述
对话式	SQA	对话式，仅限单元格选择问题
聚合的弱监督	WTQ	问题可能涉及聚合，模型必须仅根据答案进行学习
聚合的强监督	WikiSQL-supervised	问题可能涉及聚合，模型必须根据黄金聚合运算符进行学习

Pytorch

隐藏 Pytorch 内容

使用预训练的基座和从中心随机初始化的分类头初始化模型，可以按如下所示进行。

>>> from transformers import TapasConfig, TapasForQuestionAnswering

>>> # for example, the base sized model with default SQA configuration
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base")

>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

当然，你不必一定遵循 TAPAS 微调的三种方式之一。你也可以在初始化 TapasConfig 时，通过定义你想要的任何超参数来尝试，然后根据该配置创建一个 TapasForQuestionAnswering。例如，如果你的数据集既有对话式问题，又有可能涉及聚合的问题，那么你可以这样做。下面是一个例子：

>>> from transformers import TapasConfig, TapasForQuestionAnswering

>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

TensorFlow

隐藏 TensorFlow 内容

使用预训练的基础模型和从中心随机初始化的分类头初始化模型，可以按照以下所示进行。请务必安装 tensorflow_probability 依赖项。

>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # for example, the base sized model with default SQA configuration
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base")

>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

当然，你不必一定遵循 TAPAS 微调的三种方式之一。你也可以通过在初始化 TapasConfig 时定义任何你想要的超参数来进行实验，然后根据该配置创建一个 TFTapasForQuestionAnswering。例如，如果你的数据集既包含对话式问题，也包含可能涉及聚合的问题，那么你可以这样做。下面是一个示例：

>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

你也可以从一个已经微调过的检查点开始。这里需要注意的是，WTQ 上已经微调过的检查点由于 L2 损失有些脆弱而存在一些问题。更多信息请参见此处。

有关 HuggingFace 中心上所有预训练和微调的 TAPAS 检查点的列表，请参见此处。

第二步：以 SQA 格式准备数据

其次，无论您上面选择了什么，您都应该将数据集准备成 SQA 格式。该格式是一个 TSV/CSV 文件，包含以下列：

id: 可选，表格-问题对的 id，用于记录。
annotator: 可选，标注表格-问题对的人员 id，用于记录。
position: 整数，指示问题是与表格相关的第几个问题（第一、第二、第三……）。仅在会话设置（SQA）中需要。如果您选择 WTQ/WikiSQL-supervised，则不需要此列。
question: 字符串
table_file: 字符串，包含表格数据的 csv 文件名
answer_coordinates: 一个或多个元组的列表（每个元组都是单元格坐标，即属于答案的行、列对）
answer_text: 一个或多个字符串的列表（每个字符串都是答案的一部分的单元格值）
aggregation_label: 聚合运算符的索引。仅在聚合强监督（WikiSQL-supervised 案例）中需要
float_answer: 问题的浮点答案，如果有的话（如果没有则为 np.nan）。仅在聚合弱监督（如 WTQ 和 WikiSQL）中需要

表格本身应存在于一个文件夹中，每个表格都是一个单独的 CSV 文件。请注意，TAPAS 算法的作者使用了一些自动化逻辑的转换脚本将其他数据集（WTQ、WikiSQL）转换为 SQA 格式。作者在此处解释了这一点。与 HuggingFace 实现兼容的此脚本的转换版本可在此处找到。有趣的是，这些转换脚本并不完美（answer_coordinates 和 float_answer 字段是根据 answer_text 填充的），这意味着 WTQ 和 WikiSQL 的结果实际上可以改进。

步骤 3：使用 TapasTokenizer 将数据转换为张量

Pytorch

隐藏 Pytorch 内容

第三，鉴于您已经以 TSV/CSV 格式（以及包含表格数据的相应 CSV 文件）准备了数据，您可以使用 TapasTokenizer 将表格-问题对转换为 input_ids、attention_mask、token_type_ids 等。同样，根据您上面选择的三种情况中的哪一种，TapasForQuestionAnswering 需要不同的输入才能进行微调：

任务	所需输入
对话式	`input_ids`, `attention_mask`, `token_type_ids`, `labels`
聚合的弱监督	`input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer`
聚合的强监督	`input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`

TapasTokenizer 根据 TSV 文件的 answer_coordinates 和 answer_text 列创建 labels、numeric_values 和 numeric_values_scale。float_answer 和 aggregation_labels 已经存在于步骤 2 的 TSV 文件中。这是一个示例：

>>> from transformers import TapasTokenizer
>>> import pandas as pd

>>> model_name = "google/tapas-base"
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(
...     table=table,
...     queries=queries,
...     answer_coordinates=answer_coordinates,
...     answer_text=answer_text,
...     padding="max_length",
...     return_tensors="pt",
... )
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}

请注意，TapasTokenizer 期望表格数据是**纯文本**。您可以在数据帧上使用 .astype(str) 将其转换为纯文本数据。当然，这仅展示了如何编码单个训练示例。建议创建数据加载器以迭代批次。

>>> import torch
>>> import pandas as pd

>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"


>>> class TableDataset(torch.utils.data.Dataset):
...     def __init__(self, data, tokenizer):
...         self.data = data
...         self.tokenizer = tokenizer

...     def __getitem__(self, idx):
...         item = data.iloc[idx]
...         table = pd.read_csv(table_csv_path + item.table_file).astype(
...             str
...         )  # be sure to make your table data text only
...         encoding = self.tokenizer(
...             table=table,
...             queries=item.question,
...             answer_coordinates=item.answer_coordinates,
...             answer_text=item.answer_text,
...             truncation=True,
...             padding="max_length",
...             return_tensors="pt",
...         )
...         # remove the batch dimension which the tokenizer adds by default
...         encoding = {key: val.squeeze(0) for key, val in encoding.items()}
...         # add the float_answer which is also required (weak supervision for aggregation case)
...         encoding["float_answer"] = torch.tensor(item.float_answer)
...         return encoding

...     def __len__(self):
...         return len(self.data)


>>> data = pd.read_csv(tsv_path, sep="\t")
>>> train_dataset = TableDataset(data, tokenizer)
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)

TensorFlow

隐藏 TensorFlow 内容

第三，鉴于您已经以 TSV/CSV 格式（以及相应的包含表格数据的 CSV 文件）准备了数据，您可以使用 TapasTokenizer 将表格-问题对转换为 input_ids、attention_mask、token_type_ids 等。同样，根据您上面选择的三种情况中的哪一种，TFTapasForQuestionAnswering 需要不同的输入才能进行微调：

任务	所需输入
对话式	`input_ids`, `attention_mask`, `token_type_ids`, `labels`
聚合的弱监督	`input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer`
聚合的强监督	`input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`

>>> from transformers import TapasTokenizer
>>> import pandas as pd

>>> model_name = "google/tapas-base"
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(
...     table=table,
...     queries=queries,
...     answer_coordinates=answer_coordinates,
...     answer_text=answer_text,
...     padding="max_length",
...     return_tensors="tf",
... )
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}

>>> import tensorflow as tf
>>> import pandas as pd

>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"


>>> class TableDataset:
...     def __init__(self, data, tokenizer):
...         self.data = data
...         self.tokenizer = tokenizer

...     def __iter__(self):
...         for idx in range(self.__len__()):
...             item = self.data.iloc[idx]
...             table = pd.read_csv(table_csv_path + item.table_file).astype(
...                 str
...             )  # be sure to make your table data text only
...             encoding = self.tokenizer(
...                 table=table,
...                 queries=item.question,
...                 answer_coordinates=item.answer_coordinates,
...                 answer_text=item.answer_text,
...                 truncation=True,
...                 padding="max_length",
...                 return_tensors="tf",
...             )
...             # remove the batch dimension which the tokenizer adds by default
...             encoding = {key: tf.squeeze(val, 0) for key, val in encoding.items()}
...             # add the float_answer which is also required (weak supervision for aggregation case)
...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer, dtype=tf.float32)
...             yield encoding["input_ids"], encoding["attention_mask"], encoding["numeric_values"], encoding[
...                 "numeric_values_scale"
...             ], encoding["token_type_ids"], encoding["labels"], encoding["float_answer"]

...     def __len__(self):
...         return len(self.data)


>>> data = pd.read_csv(tsv_path, sep="\t")
>>> train_dataset = TableDataset(data, tokenizer)
>>> output_signature = (
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
...     tf.TensorSpec(shape=(512, 7), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
... )
>>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)

请注意，此处我们独立编码每个表格-问题对。只要您的数据集**不是对话式**的，这就可以了。如果您的数据集涉及对话式问题（例如 SQA 中），则应首先按表格（按其 position 索引的顺序）将 queries、answer_coordinates 和 answer_text 分组在一起，并批量编码每个表格及其问题。这将确保 prev_labels 令牌类型（请参阅 TapasTokenizer 的文档）设置正确。有关更多信息，请参阅此笔记本。有关使用 TensorFlow 模型的更多信息，请参阅此笔记本。

**第四步：训练（微调）模型

Pytorch

隐藏 Pytorch 内容

然后，您可以按照以下方式微调 TapasForQuestionAnswering（此处以聚合弱监督为例）：

>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW

>>> # this is the default WTQ configuration
>>> config = TapasConfig(
...     num_aggregation_labels=4,
...     use_answer_as_supervision=True,
...     answer_loss_cutoff=0.664694,
...     cell_selection_preference=0.207951,
...     huber_loss_delta=0.121194,
...     init_cell_selection_weights_to_zero=True,
...     select_one_column=True,
...     allow_empty_column_selection=False,
...     temperature=0.0352513,
... )
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> optimizer = AdamW(model.parameters(), lr=5e-5)

>>> model.train()
>>> for epoch in range(2):  # loop over the dataset multiple times
...     for batch in train_dataloader:
...         # get the inputs;
...         input_ids = batch["input_ids"]
...         attention_mask = batch["attention_mask"]
...         token_type_ids = batch["token_type_ids"]
...         labels = batch["labels"]
...         numeric_values = batch["numeric_values"]
...         numeric_values_scale = batch["numeric_values_scale"]
...         float_answer = batch["float_answer"]

...         # zero the parameter gradients
...         optimizer.zero_grad()

...         # forward + backward + optimize
...         outputs = model(
...             input_ids=input_ids,
...             attention_mask=attention_mask,
...             token_type_ids=token_type_ids,
...             labels=labels,
...             numeric_values=numeric_values,
...             numeric_values_scale=numeric_values_scale,
...             float_answer=float_answer,
...         )
...         loss = outputs.loss
...         loss.backward()
...         optimizer.step()

TensorFlow

隐藏 TensorFlow 内容

然后，您可以按照以下方式微调 TFTapasForQuestionAnswering（此处以聚合弱监督为例）：

>>> import tensorflow as tf
>>> from transformers import TapasConfig, TFTapasForQuestionAnswering

>>> # this is the default WTQ configuration
>>> config = TapasConfig(
...     num_aggregation_labels=4,
...     use_answer_as_supervision=True,
...     answer_loss_cutoff=0.664694,
...     cell_selection_preference=0.207951,
...     huber_loss_delta=0.121194,
...     init_cell_selection_weights_to_zero=True,
...     select_one_column=True,
...     allow_empty_column_selection=False,
...     temperature=0.0352513,
... )
>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

>>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

>>> for epoch in range(2):  # loop over the dataset multiple times
...     for batch in train_dataloader:
...         # get the inputs;
...         input_ids = batch[0]
...         attention_mask = batch[1]
...         token_type_ids = batch[4]
...         labels = batch[-1]
...         numeric_values = batch[2]
...         numeric_values_scale = batch[3]
...         float_answer = batch[6]

...         # forward + backward + optimize
...         with tf.GradientTape() as tape:
...             outputs = model(
...                 input_ids=input_ids,
...                 attention_mask=attention_mask,
...                 token_type_ids=token_type_ids,
...                 labels=labels,
...                 numeric_values=numeric_values,
...                 numeric_values_scale=numeric_values_scale,
...                 float_answer=float_answer,
...             )
...         grads = tape.gradient(outputs.loss, model.trainable_weights)
...         optimizer.apply_gradients(zip(grads, model.trainable_weights))

用法：推理

Pytorch

隐藏 Pytorch 内容

在这里，我们解释如何使用 TapasForQuestionAnswering 或 TFTapasForQuestionAnswering 进行推理（即对新数据进行预测）。对于推理，只需向模型提供 input_ids、attention_mask 和 token_type_ids（您可以使用 TapasTokenizer 获取这些信息）即可获得 logits。接下来，您可以使用方便的 ~models.tapas.tokenization_tapas.convert_logits_to_predictions 方法将这些 logits 转换为预测坐标和可选的聚合索引。

然而，请注意，推理**不同**，这取决于设置是否是对话式的。在非对话式设置中，推理可以并行处理批处理中的所有表格-问题对。这是一个示例：

>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd

>>> model_name = "google/tapas-base-finetuned-wtq"
>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
...     inputs, outputs.logits.detach(), outputs.logits_aggregation.detach()
... )

>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
...     if len(coordinates) == 1:
...         # only a single cell:
...         answers.append(table.iat[coordinates[0]])
...     else:
...         # multiple cells
...         cell_values = []
...         for coordinate in coordinates:
...             cell_values.append(table.iat[coordinate])
...         answers.append(", ".join(cell_values))

>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
...     print(query)
...     if predicted_agg == "NONE":
...         print("Predicted answer: " + answer)
...     else:
...         print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69

TensorFlow

隐藏 TensorFlow 内容

在这里，我们解释了如何使用 TFTapasForQuestionAnswering 进行推理（即对新数据进行预测）。对于推理，只需向模型提供 input_ids、attention_mask 和 token_type_ids（您可以使用 TapasTokenizer 获取这些信息）即可获得 logits。接下来，您可以使用方便的 ~models.tapas.tokenization_tapas.convert_logits_to_predictions 方法将这些 logits 转换为预测坐标和可选的聚合索引。

然而，请注意，推理**不同**，这取决于设置是否是对话式的。在非对话式设置中，推理可以并行处理批处理中的所有表格-问题对。这是一个示例：

>>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
>>> import pandas as pd

>>> model_name = "google/tapas-base-finetuned-wtq"
>>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)

>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
>>> queries = [
...     "What is the name of the first actor?",
...     "How many movies has George Clooney played in?",
...     "What is the total number of movies?",
... ]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="tf")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
...     inputs, outputs.logits, outputs.logits_aggregation
... )

>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
...     if len(coordinates) == 1:
...         # only a single cell:
...         answers.append(table.iat[coordinates[0]])
...     else:
...         # multiple cells
...         cell_values = []
...         for coordinate in coordinates:
...             cell_values.append(table.iat[coordinate])
...         answers.append(", ".join(cell_values))

>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
...     print(query)
...     if predicted_agg == "NONE":
...         print("Predicted answer: " + answer)
...     else:
...         print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69

如果是对话式设置，那么每个表格-问题对必须**按顺序**提供给模型，以便 prev_labels 令牌类型可以被前一个表格-问题对的预测 labels 覆盖。同样，更多信息可以在此笔记本（适用于 PyTorch）和此笔记本（适用于 TensorFlow）中找到。

资源

TAPAS 特定输出

class transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput

< 来源 >

( 损失: typing.Optional[torch.FloatTensor] = None 对数: typing.Optional[torch.FloatTensor] = None 对数聚合: typing.Optional[torch.FloatTensor] = None 隐藏状态: typing.Optional[tuple[torch.FloatTensor]] = None 注意力: typing.Optional[tuple[torch.FloatTensor]] = None )

参数

损失 (torch.FloatTensor，形状为 (1,)，可选，当提供 labels (可能还有 answer, aggregation_labels, numeric_values 和 numeric_values_scale) 时返回) — 总损失，作为分层单元格选择对数似然损失和（可选）半监督回归损失以及（可选）聚合监督损失的总和。
对数 (torch.FloatTensor，形状为 (batch_size, sequence_length)) — 每个 token 的单元格选择头的预测分数。
聚合对数 (torch.FloatTensor, 可选, 形状为 (batch_size, num_aggregation_labels)) — 聚合头对每个聚合操作符的预测分数。
隐藏状态 (tuple[torch.FloatTensor]，可选，当传递 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每层输出处的隐藏状态以及可选的初始嵌入输出。
注意力 (tuple[torch.FloatTensor]，可选，当传递 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

在注意力 softmax 之后，用于计算自注意力头中的加权平均值的注意力权重。

TapasForQuestionAnswering 的输出类型。

TapasConfig

class transformers.TapasConfig

< 来源 >

( 词汇量 = 30522 隐藏大小 = 768 隐藏层数量 = 12 注意力头数量 = 12 中间大小 = 3072 隐藏激活 = 'gelu' 隐藏dropout概率 = 0.1 注意力probs dropout概率 = 0.1 最大位置嵌入 = 1024 类型词汇量大小 = [3, 256, 256, 2, 256, 256, 10] 初始化范围 = 0.02 层归一化eps = 1e-12 填充token id = 0 正标签权重 = 10.0 聚合标签数量 = 0 聚合损失权重 = 1.0 使用答案作为监督 = None 答案损失重要性 = 1.0 使用归一化答案损失 = False huber损失delta = None 温度 = 1.0 聚合温度 = 1.0 单元格使用gumbel = False 聚合使用gumbel = False 平均近似函数 = 'ratio' 单元格选择偏好 = None 答案损失截止 = None 最大行数 = 64 最大列数 = 32 每个单元格平均对数 = False 选择一列 = True 允许空列选择 = False 初始化单元格选择权重为零 = False 每单元格重置位置索引 = True 禁用每token损失 = False 聚合标签 = None 无聚合标签索引 = None **kwargs )

参数

词汇表大小 (int，可选，默认为 30522) — TAPAS 模型的词汇表大小。定义了调用 TapasModel 时传入的 inputs_ids 可以表示的不同 token 的数量。
隐藏层大小 (int，可选，默认为 768) — 编码器层和池化层的大小。
隐藏层数量 (int，可选，默认为 12) — Transformer 编码器中的隐藏层数量。
注意力头数量 (int，可选，默认为 12) — Transformer 编码器中每个注意力层的注意力头数量。
中间大小 (int，可选，默认为 3072) — Transformer 编码器中“中间”（通常称为前馈）层的大小。
hidden_act (str 或 Callable, 可选, 默认为 "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果为字符串，支持 "gelu"、"relu"、"swish" 和 "gelu_new"。
hidden_dropout_prob (float, 可选, 默认为 0.1) — 嵌入、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, 可选, 默认为 0.1) — 注意力概率的 dropout 比率。
max_position_embeddings (int, 可选, 默认为 1024) — 此模型可能使用的最大序列长度。通常设置为较大值以防万一（例如 512 或 1024 或 2048）。
type_vocab_sizes (list[int], 可选, 默认为 [3, 256, 256, 2, 256, 256, 10]) — 调用 TapasModel 时传入的 token_type_ids 的词汇表大小。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (float, 可选, 默认为 1e-12) — 层归一化层使用的 epsilon 值。
positive_label_weight (float, 可选, 默认为 10.0) — 正面标签的权重。
num_aggregation_labels (int, 可选, 默认为 0) — 要预测的聚合运算符的数量。
aggregation_loss_weight (float, 可选, 默认为 1.0) — 聚合损失的重要性权重。
use_answer_as_supervision (bool, 可选) — 是否将答案作为聚合示例的唯一监督。
answer_loss_importance (float, 可选, 默认为 1.0) — 回归损失的重要性权重。
use_normalized_answer_loss (bool, 可选, 默认为 False) — 是否通过预测值和期望值的最大值来归一化答案损失。
huber_loss_delta (float, 可选) — 用于计算回归损失的 Delta 参数。
temperature (float, 可选, 默认为 1.0) — 用于控制（或改变）单元格逻辑概率偏斜的值。
aggregation_temperature (float, 可选, 默认为 1.0) — 缩放聚合逻辑以控制概率的偏斜。
use_gumbel_for_cells (bool, 可选, 默认为 False) — 是否将 Gumbel-Softmax 应用于单元格选择。
use_gumbel_for_aggregation (bool, 可选, 默认为 False) — 是否将 Gumbel-Softmax 应用于聚合选择。
average_approximation_function (string, 可选, 默认为 "ratio") — 在弱监督情况下计算单元格预期平均值的方法。可以是 "ratio"、"first_order" 或 "second_order" 之一。
cell_selection_preference (float, 可选) — 模糊情况下的单元格选择偏好。仅适用于聚合弱监督（WTQ、WikiSQL）。如果聚合概率（不包括“NONE”运算符）的总质量高于此超参数，则会为示例预测聚合。
answer_loss_cutoff (float, 可选) — 忽略答案损失大于截止值的示例。
max_num_rows (int, 可选, 默认为 64) — 最大行数。
max_num_columns (int, 可选, 默认为 32) — 最大列数。
average_logits_per_cell (bool, 可选, 默认为 False) — 是否对每个单元格的逻辑值取平均。
select_one_column (bool, 可选, 默认为 True) — 是否限制模型只从单列中选择单元格。
allow_empty_column_selection (bool, 可选, 默认为 False) — 是否允许不选择任何列。
init_cell_selection_weights_to_zero (bool, 可选, 默认为 False) — 是否将单元格选择权重初始化为 0，以便初始概率为 50%。
reset_position_index_per_cell (bool, 可选, 默认为 True) — 是否在每个单元格重新开始位置索引（即使用相对位置嵌入）。
disable_per_token_loss (bool, 可选, 默认为 False) — 是否禁用单元格上的任何（强或弱）监督。
aggregation_labels (dict[int, label], 可选) — 用于聚合结果的聚合标签。例如，WTQ 模型具有以下聚合标签： {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
no_aggregation_label_index (int, 可选) — 如果聚合标签已定义且其中一个标签表示“无聚合”，则应将其设置为其索引。例如，WTQ 模型将“NONE”聚合标签设置为索引 0，因此对于这些模型，该值应设置为 0。

这是用于存储 TapasModel 配置的配置类。它用于根据指定参数实例化 TAPAS 模型，定义模型架构。使用默认值实例化配置将产生与 TAPAS google/tapas-base-finetuned-sqa 架构类似的配置。

配置对象继承自 PreTrainedConfig，可用于控制模型输出。有关这些方法的更多信息，请参阅 PretrainedConfig 的文档。

BERT 之外的超参数取自原始实现的 run_task_main.py 和 hparam_utils.py。原始实现可在 https://github.com/google-research/tapas/tree/master 找到。

示例

>>> from transformers import TapasModel, TapasConfig

>>> # Initializing a default (SQA) Tapas configuration
>>> configuration = TapasConfig()
>>> # Initializing a model from the configuration
>>> model = TapasModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

Transformers

TAPAS

概述

使用技巧

用法：微调

用法：推理

资源

TAPAS 特定输出

class transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput

TapasConfig

class transformers.TapasConfig

TapasTokenizer

class transformers.TapasTokenizer

__call__

convert_logits_to_predictions

save_vocabulary

TapasModel

class transformers.TapasModel

forward

TapasForMaskedLM

class transformers.TapasForMaskedLM

forward

TapasForSequenceClassification

class transformers.TapasForSequenceClassification

forward

TapasForQuestionAnswering

class transformers.TapasForQuestionAnswering

forward

TFTapasModel

class transformers.TFTapasModel

调用

TFTapasForMaskedLM

class transformers.TFTapasForMaskedLM

调用

TFTapasForSequenceClassification

class transformers.TFTapasForSequenceClassification

调用

TFTapasForQuestionAnswering

class transformers.TFTapasForQuestionAnswering

调用

call