Transformers 文档

BARThez

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

此模型于 2020-10-23 发布,并于 2020-11-27 添加到 Hugging Face Transformers。

PyTorch

BARThez

BARThez 是一个为法语任务设计的 BART 模型。与现有的法语 BERT 模型不同,BARThez 包含一个预训练的编码器-解码器,使其能够生成文本。该模型也作为多语言变体 mBARThez 提供,通过在法语语料库上继续预训练多语言 BART 来实现。

您可以在 BARThez 集合中找到所有原始的 BARThez 检查点。

此模型由 moussakam 贡献。更多使用示例请参考 BART 文档。

下面的示例演示了如何使用 PipelineAutoModel 和命令行来预测 <mask> 标记。

流水线
自动模型
Transformers CLI
import torch
from transformers import pipeline

pipeline = pipeline(
    task="fill-mask",
    model="moussaKam/barthez",
    dtype=torch.float16,
    device=0
)
pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.")

BarthezTokenizer

class transformers.BarthezTokenizer

< >

( vocab: str | dict | list | None = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True **kwargs )

参数

  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
  • cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
  • mask_token (str, optional, defaults to "<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
  • vocab_file (str, optional) — SentencePiece 文件(通常带有 .spm 扩展名),其中包含实例化分词器所需的词汇表。
  • vocab (str, dict or list, optional) — 自定义词汇表字典。如果未提供,词汇表将从 vocab_file 加载。
  • add_prefix_space (bool, optional, defaults to True) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word.

改编自 CamembertTokenizerBartTokenizer。构造一个“快速”的 BARThez 分词器。基于 SentencePiece

此分词器继承自 PreTrainedTokenizerFast,其中包含了大部分主要方法。用户应参考此父类以获取有关这些方法的更多信息。

BarthezTokenizerFast

class transformers.BarthezTokenizer

< >

( vocab: str | dict | list | None = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True **kwargs )

参数

  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
  • cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
  • mask_token (str, optional, defaults to "<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
  • vocab_file (str, optional) — SentencePiece 文件(通常带有 .spm 扩展名),其中包含实例化分词器所需的词汇表。
  • vocab (str, dict or list, optional) — 自定义词汇表字典。如果未提供,词汇表将从 vocab_file 加载。
  • add_prefix_space (bool, optional, defaults to True) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word.

改编自 CamembertTokenizerBartTokenizer。构造一个“快速”的 BARThez 分词器。基于 SentencePiece

此分词器继承自 PreTrainedTokenizerFast,其中包含了大部分主要方法。用户应参考此父类以获取有关这些方法的更多信息。

在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.