Transformers 文档
处理器
并获得增强的文档体验
开始使用
Processors
Processors can mean two different things in the Transformers library
- the objects that pre-process inputs for multi-modal models such as Wav2Vec2 (speech and text) or CLIP (text and vision)
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
Multi-modal processors
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, vision and audio). This is handled by objects called processors, which group together two or more processing objects such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).
Those processors inherit from the following base class that implements the saving and loading functionality
This is a mixin used to provide saving/loading functionality for all processor classes.
apply_chat_template
< source >( conversation: list[dict[str, str]] | list[list[dict[str, str]]] chat_template: str | None = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.AllKwargsForChatTemplate] )
Similar to the apply_chat_template method on tokenizers, this method applies a Jinja template to input conversations to turn them into a single tokenizable string.
The input is expected to be in the following format, where each message content is a list consisting of text and optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form pixel_values when return_dict=True. If not provided, one will get only the formatted text, optionally tokenized text.
conversation = [ { “role”: “user”, “content”: [ {“type”: “image”, “url”: “https://www.ilankelman.org/stopsigns/australia.jpg”}, {“type”: “text”, “text”: “Please describe this image in detail.”}, ], }, ]
This method forwards all its arguments to PreTrainedTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.
Checks the passed argument’s class against the expected transformers class. In case of an unexpected mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class is returned.
This method forwards all its arguments to PreTrainedTokenizer’s decode(). Please refer to the docstring of this method for more information.
from_args_and_dict
< source >( args processor_dict: dict **kwargs ) → ~processing_utils.ProcessingMixin
参数
- processor_dict (
dict[str, Any]) — Dictionary that will be used to instantiate the processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the~processing_utils.ProcessingMixin.to_dictmethod. - kwargs (
dict[str, Any]) — Additional parameters from which to initialize the processor object.
返回
~processing_utils.ProcessingMixin
The processor object instantiated from those parameters.
Instantiates a type of ~processing_utils.ProcessingMixin from a Python dictionary of parameters.
from_pretrained
< source >( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs )
参数
- pretrained_model_name_or_path (
stroros.PathLike) — This can be either:- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co.
- a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g.,
./my_model_directory/. - a path or url to a saved feature extractor JSON file, e.g.,
./my_model_directory/preprocessor_config.json.
- **kwargs — Additional keyword arguments passed along to both from_pretrained() and
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.
实例化与预训练模型关联的处理器。
This class method is simply calling the feature extractor from_pretrained(), image processor ImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrainedmethods. Please refer to the docstrings of the methods above for more information.
get_processor_dict
< source >( pretrained_model_name_or_path: str | os.PathLike **kwargs ) → tuple[Dict, Dict]
参数
- pretrained_model_name_or_path (
stroros.PathLike) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. - subfolder (
str, optional, defaults to"") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
返回
tuple[Dict, Dict]
The dictionary(ies) that will be used to instantiate the processor object.
From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a processor of type ~processing_utils.ProcessingMixin using from_args_and_dict.
post_process_image_text_to_text
< source >( generated_outputs skip_special_tokens = True **kwargs ) → list[str]
对 vlm 的输出进行后处理以解码文本。
post_process_multimodal_output
< source >( generated_outputs skip_special_tokens = True generation_mode = None **kwargs ) → list[str]
参数
- generated_outputs (
torch.Tensor或np.ndarray) — 模型generate函数的输出。输出应为形状为(batch_size, sequence_length)或(sequence_length,)的张量。 - skip_special_tokens (
bool, 可选, 默认为True) — 是否从输出中移除特殊标记。传递给 tokenizer 的batch_decode方法的参数。 - generation_mode (
str, 可选) — 指定要输出的模态的生成模式,可以是["text", "image", "audio"]之一。 - **kwargs — 传递给 tokenizer 的
batch_decode方法的附加参数。
返回
list[str]
解码后的文本。
对多模态模型的输出进行后处理,以返回请求的模态输出。如果模型无法生成请求的模态,将引发错误。
push_to_hub
< source >( repo_id: str commit_message: str | None = None commit_description: str | None = None private: bool | None = None token: bool | str | None = None revision: str | None = None create_pr: bool = False max_shard_size: int | str | None = '50GB' tags: list[str] | None = None )
参数
- repo_id (
str) — 你想将处理器推送到哪个仓库。当推送到某个组织时,它应该包含你的组织名称。 - commit_message (
str, 可选) — 提交时使用的消息。默认为"Upload processor"。 - commit_description (
str, 可选) — 将要创建的提交的描述 - private (
bool, 可选) — 是否使仓库私有。如果为None(默认),则仓库将公开,除非组织默认是私有的。如果仓库已存在,则忽略此值。 - token (
bool或str, 可选) — 用于远程文件的 HTTP bearer 授权的 token。如果为True(默认),将使用运行hf auth login时生成的 token(存储在~/.huggingface)。 - revision (
str, 可选) — 要推送上传文件的分支。 - create_pr (
bool, 可选, 默认为False) — 是否创建 PR 来上传文件,而不是直接提交。 - max_shard_size (
int或str, 可选, 默认为"50GB") — 仅适用于模型。检查点分片前的最大大小。分片检查点将每个大小都小于此。如果表示为字符串,则需要数字后跟单位(例如"5MB")。 - tags (
list[str], 可选) — 推送到 Hub 的标签列表。
将处理器文件上传到 🤗 Model Hub。
示例
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google-bert/bert-base-cased")
# Push the processor to your namespace with the name "my-finetuned-bert".
processor.push_to_hub("my-finetuned-bert")
# Push the processor to an organization with the name "my-finetuned-bert".
processor.push_to_hub("huggingface/my-finetuned-bert")register_for_auto_class
< source >( auto_class = 'AutoProcessor' )
使用给定的自动类注册此类。这应该只用于自定义特征提取器,因为库中的提取器已与 AutoProcessor 映射。
save_pretrained
< source >( save_directory push_to_hub: bool = False **kwargs )
参数
- save_directory (
str或os.PathLike) — 保存特征提取器 JSON 文件和 tokenizer 文件的目录(如果目录不存在,则会创建)。 - push_to_hub (
bool, 可选, 默认为False) — 保存模型后是否将其推送到 Hugging Face model hub。您可以使用repo_id指定要推送的仓库(默认情况下为save_directory的名称,位于您的命名空间下)。 - kwargs (
dict[str, Any], 可选) — 传递给 push_to_hub() 方法的附加关键字参数。
Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
将此实例序列化为 Python 字典。
to_json_file
< source >( json_file_path: str | os.PathLike )
将此实例保存到 JSON 文件。
将此实例序列化为 JSON 字符串。
已弃用的处理器
所有处理器都遵循相同的架构,即 DataProcessor 的架构。处理器返回一个 InputExample 列表。这些 InputExample 可以转换为 InputFeatures 以便输入到模型中。
用于序列分类数据集的数据转换器的基类。
获取用于开发集的 InputExample 集合。
get_example_from_tensor_dict
< source >( tensor_dict )
从字典中获取示例。
获取此数据集的标签列表。
获取用于测试集的 InputExample 集合。
获取用于训练集的 InputExample 集合。
一些 tensorflow_datasets 数据集与 GLUE 数据集的格式不同。此方法将示例转换为正确的格式。
class transformers.InputExample
< source >( guid: str text_a: str text_b: str | None = None label: str | None = None )
参数
- guid — Unique id for the example.
- text_a — string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.
- text_b — (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.
- label — (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.
简单的序列分类的单个训练/测试示例。
将此实例序列化为 JSON 字符串。
class transformers.InputFeatures
< source >( input_ids: list attention_mask: list[int] | None = None token_type_ids: list[int] | None = None label: int | float | None = None )
参数
- input_ids — Indices of input sequence tokens in the vocabulary.
- attention_mask — Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]: Usually1for tokens that are NOT MASKED,0for MASKED (padded) tokens. - token_type_ids — (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
- label — (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
单个数据的特征集。属性名称与模型相应输入的名称相同。
将此实例序列化为 JSON 字符串。
GLUE
General Language Understanding Evaluation (GLUE) 是一个基准测试,用于评估模型在各种 NLU 任务上的性能。该基准测试随论文 GLUE: A multi-task benchmark and analysis platform for natural language understanding 一起发布。
此库包含总共 10 个用于以下任务的处理器:MRPC、MNLI、MNLI(不匹配)、CoLA、SST2、STSB、QQP、QNLI、RTE 和 WNLI。
这些处理器是
~data.processors.utils.MrpcProcessor~data.processors.utils.MnliProcessor~data.processors.utils.MnliMismatchedProcessor~data.processors.utils.Sst2Processor~data.processors.utils.StsbProcessor~data.processors.utils.QqpProcessor~data.processors.utils.QnliProcessor~data.processors.utils.RteProcessor~data.processors.utils.WnliProcessor
此外,还可以使用以下方法从数据文件中加载值并将其转换为 InputExample 列表。
transformers.glue_convert_examples_to_features
< source >( examples: list tokenizer: PythonBackend max_length: int | None = None task = None label_list = None output_mode = None )
将数据文件加载为 InputFeatures 列表
XNLI
XNLI(跨语言自然语言推断数据集)是一个评估跨语言文本表示质量的基准。XNLI 是一个基于 MultiNLI 的众包数据集:文本对被标记了 15 种不同语言的文本推断标注(包括英语等高资源语言和斯瓦希里语等低资源语言)。
它与论文 XNLI: Evaluating Cross-lingual Sentence Representations 一同发布
本库提供了加载 XNLI 数据的处理器
~data.processors.utils.XnliProcessor
请注意,由于测试集中提供了黄金标签,因此评估是在测试集上进行的。
在 run_xnli.py 脚本中提供了一个使用这些处理器的示例。
SQuAD
斯坦福问答数据集 (SQuAD) 是一个评估模型在问答任务上性能的基准。提供两个版本:v1.1 和 v2.0。第一个版本 (v1.1) 与论文 SQuAD: 100,000+ Questions for Machine Comprehension of Text 一同发布。第二个版本 (v2.0) 与论文 Know What You Don’t Know: Unanswerable Questions for SQuAD 一同发布。
本库托管了两个版本的处理器
处理器
这些处理器是
~data.processors.utils.SquadV1Processor~data.processors.utils.SquadV2Processor
它们都继承自抽象类 ~data.processors.utils.SquadProcessor
SQuAD 数据集的处理器。由 SquadV1Processor 和 SquadV2Processor 覆盖,分别用于 SQuAD 的 1.1 版和 2.0 版。
get_dev_examples
< source >( data_dir filename = None )
从数据目录中返回评估示例。
get_examples_from_dataset
< source >( dataset evaluate = False )
使用 TFDS 数据集创建 SquadExample 列表。
get_train_examples
< source >( data_dir filename = None )
从数据目录中返回训练示例。
此外,还可以使用以下方法将 SQuAD 示例转换为 ~data.processors.utils.SquadFeatures,这些示例可用作模型输入。
transformers.squad_convert_examples_to_features
< source >( examples tokenizer max_seq_length doc_stride max_query_length is_training padding_strategy = 'max_length' return_dataset = False threads = 1 tqdm_enabled = True )
参数
- examples —
SquadExample的列表 - tokenizer — PreTrainedTokenizer 的子类实例
- max_seq_length — 输入的最大序列长度。
- doc_stride — 当上下文过大并被分割成多个特征时使用的步长。
- max_query_length — 查询的最大长度。
- is_training — 指定是为模型评估还是模型训练创建特征。
- padding_strategy — 默认为 “max_length”。要使用的填充策略
- return_dataset — 默认为 False。也可以是 ‘pt’。如果为 ‘pt’:返回一个 torch.data.TensorDataset。
- threads — 多处理线程。
将示例列表转换为可以直接输入模型的特征列表。它依赖于模型,并利用分词器的许多特性来创建模型的输入。
示例
processor = SquadV2Processor()
examples = processor.get_dev_examples(data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length,
doc_stride=args.doc_stride,
max_query_length=args.max_query_length,
is_training=not evaluate,
)这些处理器以及前面提到的方法都可以与包含数据的文件以及 tensorflow_datasets 包一起使用。示例如下。
示例用法
下面是使用处理器和转换方法处理数据文件的示例
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)使用 tensorflow_datasets 和使用数据文件一样简单
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)这些处理器在 run_squad.py 脚本中也有另一个示例。
在 GitHub 上更新