Transformers 文档
BARThez
并获得增强的文档体验
开始使用
此模型于 2020-10-23 发布,并于 2020-11-27 添加到 Hugging Face Transformers。
BARThez
BARThez 是一个为法语任务设计的 BART 模型。与现有的法语 BERT 模型不同,BARThez 包含一个预训练的编码器-解码器,使其能够生成文本。该模型也作为多语言变体 mBARThez 提供,通过在法语语料库上继续预训练多语言 BART 来实现。
您可以在 BARThez 集合中找到所有原始的 BARThez 检查点。
下面的示例演示了如何使用 Pipeline、AutoModel 和命令行来预测 <mask> 标记。
import torch
from transformers import pipeline
pipeline = pipeline(
task="fill-mask",
model="moussaKam/barthez",
dtype=torch.float16,
device=0
)
pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.")BarthezTokenizer
class transformers.BarthezTokenizer
< 源码 >( vocab: str | dict | list | None = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True **kwargs )
参数
- bos_token (
str, optional, defaults to"<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token. - eos_token (
str, optional, defaults to"</s>") — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token. - sep_token (
str, optional, defaults to"</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - cls_token (
str, optional, defaults to"<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - unk_token (
str, optional, defaults to"<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str, optional, defaults to"<pad>") — The token used for padding, for example when batching sequences of different lengths. - mask_token (
str, optional, defaults to"<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - vocab_file (
str, optional) — SentencePiece 文件(通常带有 .spm 扩展名),其中包含实例化分词器所需的词汇表。 - vocab (
str,dictorlist, optional) — 自定义词汇表字典。如果未提供,词汇表将从 vocab_file 加载。 - add_prefix_space (
bool, optional, defaults toTrue) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word.
改编自 CamembertTokenizer 和 BartTokenizer。构造一个“快速”的 BARThez 分词器。基于 SentencePiece。
此分词器继承自 PreTrainedTokenizerFast,其中包含了大部分主要方法。用户应参考此父类以获取有关这些方法的更多信息。
BarthezTokenizerFast
class transformers.BarthezTokenizer
< 源码 >( vocab: str | dict | list | None = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True **kwargs )
参数
- bos_token (
str, optional, defaults to"<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token. - eos_token (
str, optional, defaults to"</s>") — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token. - sep_token (
str, optional, defaults to"</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - cls_token (
str, optional, defaults to"<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - unk_token (
str, optional, defaults to"<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str, optional, defaults to"<pad>") — The token used for padding, for example when batching sequences of different lengths. - mask_token (
str, optional, defaults to"<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - vocab_file (
str, optional) — SentencePiece 文件(通常带有 .spm 扩展名),其中包含实例化分词器所需的词汇表。 - vocab (
str,dictorlist, optional) — 自定义词汇表字典。如果未提供,词汇表将从 vocab_file 加载。 - add_prefix_space (
bool, optional, defaults toTrue) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word.
改编自 CamembertTokenizer 和 BartTokenizer。构造一个“快速”的 BARThez 分词器。基于 SentencePiece。
此分词器继承自 PreTrainedTokenizerFast,其中包含了大部分主要方法。用户应参考此父类以获取有关这些方法的更多信息。