Transformers 文档
Wav2Vec2
并获得增强的文档体验
开始使用
该模型于 2020 年 6 月 20 日发布,并于 2021 年 2 月 2 日添加到 Hugging Face Transformers。
Wav2Vec2
概述
Wav2Vec2 模型由 Alexei Baevski、Henry Zhou、Abdelrahman Mohamed 和 Michael Auli 在 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 中提出。
论文摘要如下:
我们首次证明,仅从语音音频中学习强大的表示,然后对转录语音进行微调,可以超越最好的半监督方法,同时概念上更简单。wav2vec 2.0 在潜在空间中屏蔽语音输入,并解决基于联合学习的潜在表示量化的对比任务。使用 Librispeech 所有标记数据进行的实验在 clean/other 测试集上取得了 1.8/3.3 的 WER。当标记数据量减少到一小时时,wav2vec 2.0 在 100 小时子集上优于先前的 SOTA,同时使用的标记数据量减少了 100 倍。仅使用十分钟的标记数据并在 53k 小时未标记数据上进行预训练,仍能达到 4.8/8.2 的 WER。这证明了在有限标记数据下进行语音识别的可行性。
此模型由 patrickvonplaten 贡献。
注意:Meta (FAIR) 发布了新版本的 Wav2Vec2-BERT 2.0 - 它在 4.5M 小时的音频上进行了预训练。我们特别建议将其用于微调任务,例如,遵循 此指南。
使用技巧
- Wav2Vec2 是一种语音模型,它接受一个浮点数组,对应于语音信号的原始波形。
- Wav2Vec2 模型使用连接主义时间分类 (CTC) 进行训练,因此模型输出必须使用 Wav2Vec2CTCTokenizer 进行解码。
使用 Flash Attention 2
Flash Attention 2 是一个更快、更优化的模型版本。
安装
首先,请检查您的硬件是否与 Flash Attention 2 兼容。兼容硬件的最新列表可在 官方文档 中找到。
接下来,安装 最新版本的 Flash Attention 2
pip install -U flash-attn --no-build-isolation
用法
要使用 Flash Attention 2 加载模型,我们可以将参数 attn_implementation="flash_attention_2" 传递给 .from_pretrained。我们还将以半精度(例如 torch.float16)加载模型,因为它几乎不会降低音频质量,但能显著降低内存使用并加快推理速度。
>>> from transformers import Wav2Vec2Model
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-960h-lv60-self", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
...预期加速
下面是比较 facebook/wav2vec2-large-960h-lv60-self 模型在 transformers 中的原生实现与 flash-attention-2 和 sdpa(scale-dot-product-attention)版本之间纯推理时间的预期加速图表。我们显示了在 librispeech_asr clean 验证集上获得的平均加速。

资源
Hugging Face 官方和社区(用 🌎 表示)资源列表,可帮助您开始使用 Wav2Vec2。如果您有兴趣提交资源以包含在此处,请随时发起拉取请求,我们将进行审核!资源最好能展示新内容,而不是重复现有资源。
- 关于如何利用预训练的 Wav2Vec2 模型进行情感分类的笔记本。🌎
- Wav2Vec2ForCTC 支持此示例脚本和笔记本。
- 音频分类任务指南
- 关于在 🤗 Transformers 中使用 n-gram 提升 Wav2Vec2 的博客文章。
- 关于如何使用 🤗 Transformers 微调 Wav2Vec2 进行英语 ASR 的博客文章。
- 关于使用 🤗 Transformers 微调 XLS-R 进行多语言 ASR 的博客文章。
- 关于如何通过使用 Wav2Vec2 转录音频从任何视频创建 YouTube 字幕的笔记本。🌎
- Wav2Vec2ForCTC 支持关于如何微调英语语音识别模型和如何微调任何语言的语音识别模型的笔记本。
- 自动语音识别任务指南
🚀 部署
- 关于如何使用 Hugging Face 的 Transformers 和 Amazon SageMaker 部署 Wav2Vec2 进行自动语音识别的博客文章。
Wav2Vec2Config
class transformers.Wav2Vec2Config
< source >( vocab_size = 32 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout = 0.1 activation_dropout = 0.1 attention_dropout = 0.1 feat_proj_dropout = 0.0 feat_quantizer_dropout = 0.0 final_dropout = 0.1 layerdrop = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-05 feat_extract_norm = 'group' feat_extract_activation = 'gelu' conv_dim = (512, 512, 512, 512, 512, 512, 512) conv_stride = (5, 2, 2, 2, 2, 2, 2) conv_kernel = (10, 3, 3, 3, 3, 2, 2) conv_bias = False num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 do_stable_layer_norm = False apply_spec_augment = True mask_time_prob = 0.05 mask_time_length = 10 mask_time_min_masks = 2 mask_feature_prob = 0.0 mask_feature_length = 10 mask_feature_min_masks = 0 num_codevectors_per_group = 320 num_codevector_groups = 2 contrastive_logits_temperature = 0.1 num_negatives = 100 codevector_dim = 256 proj_codevector_dim = 256 diversity_loss_weight = 0.1 ctc_loss_reduction = 'sum' ctc_zero_infinity = False use_weighted_layer_sum = False classifier_proj_size = 256 tdnn_dim = (512, 512, 512, 512, 1500) tdnn_kernel = (5, 3, 3, 1, 1) tdnn_dilation = (1, 2, 3, 1, 1) xvector_output_dim = 512 pad_token_id = 0 bos_token_id = 1 eos_token_id = 2 add_adapter = False adapter_kernel_size = 3 adapter_stride = 2 num_adapter_layers = 3 output_hidden_size = None adapter_attn_dim = None **kwargs )
参数
- vocab_size (
int, optional, defaults to 32) — Wav2Vec2 模型的词汇表大小。定义了调用 Wav2Vec2Model 时传递的inputs_ids可以表示的不同 token 的数量。模型的词汇表大小。定义了传递给 Wav2Vec2Model 前向方法的 inputs_ids 可以表示的不同 token。 - hidden_size (
int, optional, defaults to 768) — 编码器层和池化器层的维度。 - num_hidden_layers (
int, optional, defaults to 12) — Transformer 编码器中的隐藏层数量。 - num_attention_heads (
int, optional, defaults to 12) — Transformer 编码器中每个注意力层的注意力头数量。 - intermediate_size (
int, optional, defaults to 3072) — Transformer 编码器中“中间”(即前馈)层的维度。 - hidden_act (
strorfunction, optional, defaults to"gelu") — 编码器和池化器中的非线性激活函数(函数或字符串)。如果是字符串,支持"gelu"、"relu"、"selu"和"gelu_new"。 - hidden_dropout (
float, optional, defaults to 0.1) — 嵌入层、编码器和池化器中所有全连接层的丢弃概率。 - activation_dropout (
float, optional, defaults to 0.1) — 全连接层内部激活函数的丢弃比例。 - attention_dropout (
float, optional, defaults to 0.1) — 注意力概率的丢弃比例。 - final_dropout (
float, optional, defaults to 0.1) — Wav2Vec2ForCTC 最终投影层的丢弃概率。 - layerdrop (
float, optional, defaults to 0.1) — LayerDrop 概率。有关更多详细信息,请参阅 [LayerDrop 论文](参见 https://huggingface.co/papers/1909.11556)。 - initializer_range (
float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。 - layer_norm_eps (
float, optional, defaults to 1e-12) — 层归一化层使用的 epsilon。 - feat_extract_norm (
str, optional, defaults to"group") — 应用于特征编码器中 1D 卷积层的归一化类型。对于仅对第一个 1D 卷积层进行组归一化,使用"group";对于对所有 1D 卷积层进行层归一化,使用"layer"。 - feat_proj_dropout (
float, optional, defaults to 0.0) — 特征编码器输出的丢弃概率。 - feat_extract_activation (
str,optional, defaults to“gelu”) -- 特征提取器中 1D 卷积层的非线性激活函数(函数或字符串)。如果是字符串,支持“gelu”、“relu”、“selu”和“gelu_new”`。 - feat_quantizer_dropout (
float, optional, defaults to 0.0) — 量化特征编码器状态的丢弃概率。 - conv_dim (
tuple[int]orlist[int], optional, defaults to(512, 512, 512, 512, 512, 512, 512)) — 一个整数元组或列表,定义特征编码器中每个 1D 卷积层的输入和输出通道数。conv_dim 的长度定义了 1D 卷积层的数量。 - conv_stride (
tuple[int]orlist[int], optional, defaults to(5, 2, 2, 2, 2, 2, 2)) — 一个整数元组或列表,定义特征编码器中每个 1D 卷积层的步幅。conv_stride 的长度定义了卷积层的数量,并且必须与 conv_dim 的长度匹配。 - conv_kernel (
tuple[int]orlist[int], optional, defaults to(10, 3, 3, 3, 3, 3, 3)) — 一个整数元组或列表,定义特征编码器中每个 1D 卷积层的核大小。conv_kernel 的长度定义了卷积层的数量,并且必须与 conv_dim 的长度匹配。 - conv_bias (
bool, optional, defaults toFalse) — 1D 卷积层是否具有偏差。 - num_conv_pos_embeddings (
int, optional, defaults to 128) — 卷积位置嵌入的数量。定义了 1D 卷积位置嵌入层的核大小。 - num_conv_pos_embedding_groups (
int, optional, defaults to 16) — 1D 卷积位置嵌入层的组数。 - do_stable_layer_norm (
bool, optional, defaults toFalse) — 是否应用 Transformer 编码器的稳定层归一化架构。do_stable_layer_norm is True对应于在注意力层之前应用层归一化,而do_stable_layer_norm is False对应于在注意力层之后应用层归一化。 - apply_spec_augment (
bool, optional, defaults toTrue) — 是否对特征编码器的输出应用 SpecAugment 数据增强。参考 SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition。 - mask_time_prob (
float, optional, defaults to 0.05) — 沿时间轴上所有特征向量被屏蔽的百分比(介于 0 和 1 之间)。屏蔽过程沿该轴生成“mask_time_problen(time_axis)/mask_time_length”个独立掩码。如果从每个特征向量被选为要屏蔽的向量跨度起始位置的概率来推理,则 mask_time_prob 应为 `prob_vector_startmask_time_length。请注意,重叠可能会降低实际被屏蔽向量的百分比。仅当apply_spec_augment is True` 时相关。 - mask_time_length (
int, optional, defaults to 10) — 沿时间轴的向量跨度长度。 - mask_time_min_masks (
int, optional, defaults to 2), — 沿时间轴生成的长度为mask_feature_length的最小掩码数量,每个时间步都生成,与mask_feature_prob无关。仅当 ”mask_time_prob*len(time_axis)/mask_time_length < mask_time_min_masks” 时相关。 - mask_feature_prob (
float, optional, defaults to 0.0) — 沿特征轴上所有特征向量被屏蔽的百分比(介于 0 和 1 之间)。屏蔽过程沿该轴生成“mask_feature_problen(feature_axis)/mask_time_length”个独立掩码。如果从每个特征向量被选为要屏蔽的向量跨度起始位置的概率来推理,则 mask_feature_prob 应为 `prob_vector_startmask_feature_length。请注意,重叠可能会降低实际被屏蔽向量的百分比。仅当apply_spec_augment is True` 时相关。 - mask_feature_length (
int, optional, defaults to 10) — 沿特征轴的向量跨度长度。 - mask_feature_min_masks (
int, optional, defaults to 0), — The minimum number of masks of lengthmask_feature_lengthgenerated along the feature axis, each time step, irrespectively ofmask_feature_prob. Only relevant if ”mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks” - num_codevectors_per_group (
int, optional, defaults to 320) — Number of entries in each quantization codebook (group). - num_codevector_groups (
int, optional, defaults to 2) — Number of codevector groups for product codevector quantization. - contrastive_logits_temperature (
float, optional, defaults to 0.1) — The temperature kappa in the contrastive loss. - feat_quantizer_dropout (
float, optional, defaults to 0.0) — The dropout probability for the output of the feature encoder that’s used by the quantizer. - num_negatives (
int, optional, defaults to 100) — Number of negative samples for the contrastive loss. - codevector_dim (
int, optional, defaults to 256) — Dimensionality of the quantized feature vectors. - proj_codevector_dim (
int, optional, defaults to 256) — Dimensionality of the final projection of both the quantized and the transformer features. - diversity_loss_weight (
int, optional, defaults to 0.1) — The weight of the codebook diversity loss component. - ctc_loss_reduction (
str, optional, defaults to"sum") — Specifies the reduction to apply to the output oftorch.nn.CTCLoss. Only relevant when training an instance of Wav2Vec2ForCTC. - ctc_zero_infinity (
bool, optional, defaults toFalse) — Whether to zero infinite losses and the associated gradients oftorch.nn.CTCLoss. Infinite losses mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance of Wav2Vec2ForCTC. - use_weighted_layer_sum (
bool, optional, defaults toFalse) — Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an instance of Wav2Vec2ForSequenceClassification. - classifier_proj_size (
int, optional, defaults to 256) — Dimensionality of the projection before token mean-pooling for classification. - tdnn_dim (
tuple[int]orlist[int], optional, defaults to(512, 512, 512, 512, 1500)) — A tuple of integers defining the number of output channels of each 1D convolutional layer in the TDNN module of the XVector model. The length of tdnn_dim defines the number of TDNN layers. - tdnn_kernel (
tuple[int]orlist[int], optional, defaults to(5, 3, 3, 1, 1)) — A tuple of integers defining the kernel size of each 1D convolutional layer in the TDNN module of the XVector model. The length of tdnn_kernel has to match the length of tdnn_dim. - tdnn_dilation (
tuple[int]orlist[int], optional, defaults to(1, 2, 3, 1, 1)) — A tuple of integers defining the dilation factor of each 1D convolutional layer in TDNN module of the XVector model. The length of tdnn_dilation has to match the length of tdnn_dim. - xvector_output_dim (
int, optional, defaults to 512) — Dimensionality of the XVector embedding vectors. - add_adapter (
bool, optional, defaults toFalse) — Whether a convolutional network should be stacked on top of the Wav2Vec2 Encoder. Can be very useful for warm-starting Wav2Vec2 for SpeechEncoderDecoder models. - adapter_kernel_size (
int, optional, defaults to 3) — Kernel size of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True. - adapter_stride (
int, optional, defaults to 2) — Stride of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True. - num_adapter_layers (
int, optional, defaults to 3) — Number of convolutional layers that should be used in the adapter network. Only relevant ifadd_adapter is True. - adapter_attn_dim (
int, optional) — Dimension of the attention adapter weights to be used in each attention block. An example of a model using attention adapters is facebook/mms-1b-all. - output_hidden_size (
int, optional) — Dimensionality of the encoder output layer. If not defined, this defaults to hidden-size. Only relevant ifadd_adapter is True.
This is the configuration class to store the configuration of a Wav2Vec2Model. It is used to instantiate an Wav2Vec2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Wav2Vec2 facebook/wav2vec2-base-960h architecture.
配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
示例
>>> from transformers import Wav2Vec2Config, Wav2Vec2Model
>>> # Initializing a Wav2Vec2 facebook/wav2vec2-base-960h style configuration
>>> configuration = Wav2Vec2Config()
>>> # Initializing a model (with random weights) from the facebook/wav2vec2-base-960h style configuration
>>> model = Wav2Vec2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configWav2Vec2CTCTokenizer
class transformers.Wav2Vec2CTCTokenizer
< source >( vocab_file bos_token = '<s>' eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' word_delimiter_token = '|' replace_word_delimiter_char = ' ' do_lower_case = False target_lang = None **kwargs )
参数
- vocab_file (
str) — File containing the vocabulary. - bos_token (
str, optional, defaults to"<s>") — The beginning of sentence token. - eos_token (
str, optional, defaults to"</s>") — The end of sentence token. - unk_token (
str, optional, defaults to"<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str, optional, defaults to"<pad>") — The token used for padding, for example when batching sequences of different lengths. - word_delimiter_token (
str, optional, defaults to"|") — The token used for defining the end of a word. - do_lower_case (
bool, optional, defaults toFalse) — Whether or not to accept lowercase input and lowercase the output when decoding. - target_lang (
str, optional) — A target language the tokenizer should set by default.target_langhas to be defined for multi-lingual, nested vocabulary such as facebook/mms-1b-all. - **kwargs — Additional keyword arguments passed along to PreTrainedTokenizer
Constructs a Wav2Vec2CTC tokenizer.
此分词器继承自 PreTrainedTokenizer,其中包含一些主要方法。用户应参考超类以了解更多关于这些方法的信息。
__call__
< source >( text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None text_pair_target: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] | None = None add_special_tokens: bool = True padding: bool | str | PaddingStrategy = False truncation: bool | str | TruncationStrategy | None = None max_length: int | None = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: int | None = None padding_side: str | None = None return_tensors: str | TensorType | None = None return_token_type_ids: bool | None = None return_attention_mask: bool | None = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True tokenizer_kwargs: dict[str, Any] | None = None **kwargs ) → BatchEncoding
参数
- text (
str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences). - text_pair (
str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences). - text_target (
str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences). - text_pair_target (
str,list[str],list[list[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences). - tokenizer_kwargs (
dict[str, Any], optional) — Additional kwargs to pass to the tokenizer. These will be merged with the explicit parameters and other kwargs, with explicit parameters taking precedence. - add_special_tokens (
bool, optional, defaults toTrue) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokensfunction, which defines which tokens are automatically added to the input ids. This is useful if you want to addbosoreostokens automatically. - padding (
bool,stror PaddingStrategy, optional, defaults toFalse) — Activates and controls padding. Accepts the following values:Trueor'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).'max_length': Pad to a maximum length specified with the argumentmax_lengthor to the maximum acceptable input length for the model if that argument is not provided.Falseor'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
- truncation (
bool,stror TruncationStrategy, optional, defaults toFalse) — Activates and controls truncation. Accepts the following values:Trueor'longest_first': Truncate to a maximum length specified with the argumentmax_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first': Truncate to a maximum length specified with the argumentmax_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second': Truncate to a maximum length specified with the argumentmax_lengthor to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.Falseor'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
- max_length (
int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. - stride (
int, optional, defaults to 0) — If set to a number along withmax_length, the overflowing tokens returned whenreturn_overflowing_tokens=Truewill contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. - is_split_into_words (
bool, optional, defaults toFalse) — 输入是否已经经过预分词(例如,按单词分割)。如果设置为True,则分词器假定输入已经按单词分割(例如,通过空格分割),并将对其进行分词。这对于 NER 或 token classification 很有用。 - pad_to_multiple_of (
int, optional) — 如果设置,将序列填充为所提供值的倍数。要求激活padding。这对于在计算能力>= 7.5(Volta) 的 NVIDIA 硬件上启用 Tensor Cores 特别有用。 - padding_side (
str, optional) — 模型应在哪个方向应用填充。应在 ['right', 'left'] 中选择。默认值取自分词器类的同名属性。 - return_tensors (
stror TensorType, optional) — 如果设置,将返回张量而不是 Python 整数列表。可接受的值为:'pt': 返回 PyTorchtorch.Tensor对象。'np': 返回 Numpynp.ndarray对象。
- return_token_type_ids (
bool, optional) — 是否返回 token type IDs。如果保留默认值,将根据特定分词器的默认设置(由return_outputs属性定义)返回 token type IDs。 - return_attention_mask (
bool, optional) — 是否返回注意力掩码。如果保留默认值,将根据特定分词器的默认设置(由return_outputs属性定义)返回注意力掩码。 - return_overflowing_tokens (
bool, optional, defaults toFalse) — 是否返回溢出(overflowing)的 token 序列。如果提供了输入 ID 对序列(或一批对序列)且truncation_strategy = longest_first或True,则会引发错误而不是返回溢出的 token。 - return_special_tokens_mask (
bool, optional, defaults toFalse) — 是否返回特殊 token 掩码信息。 - return_offsets_mapping (
bool, optional, defaults toFalse) — 是否为每个 token 返回(char_start, char_end)。这仅适用于继承自 PreTrainedTokenizerFast 的快速分词器;如果使用 Python 分词器,此方法将引发
NotImplementedError。 - return_length (
bool, optional, defaults toFalse) — 是否返回编码输入的长度。 - verbose (
bool, optional, defaults toTrue) — 是否打印更多信息和警告。 - **kwargs — 传递给
self.tokenize()方法的参数
一个 BatchEncoding 对象,包含以下字段:
-
input_ids — 要输入到模型中的标记 ID 列表。
-
token_type_ids — 要输入到模型中的标记类型 ID 列表(当
return_token_type_ids=True或如果 *“token_type_ids”* 在self.model_input_names中时)。 -
attention_mask — 指定模型应关注哪些标记的索引列表(当
return_attention_mask=True或如果 *“attention_mask”* 在self.model_input_names中时)。 -
overflowing_tokens — 溢出标记序列列表(当指定
max_length且return_overflowing_tokens=True时)。 -
num_truncated_tokens — 截断标记的数量(当指定
max_length且return_overflowing_tokens=True时)。 -
special_tokens_mask — 0 和 1 的列表,其中 1 表示添加的特殊标记,0 表示常规序列标记(当
add_special_tokens=True且return_special_tokens_mask=True时)。 -
length — 输入的长度(当
return_length=True时)
将一个或多个序列或一对或多对序列标记化并准备用于模型的主要方法。
decode
< source >( token_ids: typing.Union[int, list[int], numpy.ndarray, ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool | None = None output_char_offsets: bool = False output_word_offsets: bool = False **kwargs ) → str or Wav2Vec2CTCTokenizerOutput
参数
- token_ids (
Union[int, list[int], np.ndarray, torch.Tensor]) — tokenized input ids 列表。可通过__call__方法获取。 - skip_special_tokens (
bool, optional, defaults toFalse) — 解码时是否删除特殊 token。 - clean_up_tokenization_spaces (
bool, optional) — 是否清理分词产生的多余空格。 - output_char_offsets (
bool, optional, defaults toFalse) — 是否输出字符偏移。字符偏移可与采样率和模型下采样率结合使用,以计算转录字符的时间戳。请参阅下面的示例,以便更好地理解如何使用
output_char_offsets。 - output_word_offsets (
bool, optional, defaults toFalse) — 是否输出单词偏移。单词偏移可与采样率和模型下采样率结合使用,以计算转录单词的时间戳。请参阅下面的示例,以便更好地理解如何使用
output_word_offsets。 - kwargs (附加关键字参数, optional) — 将传递给底层模型特定的解码方法。
返回
str or Wav2Vec2CTCTokenizerOutput
解码后的句子列表。当 output_char_offsets == True 或 output_word_offsets == True 时,将是一个 Wav2Vec2CTCTokenizerOutput。
使用分词器和词汇表将 ID 序列转换为字符串,可以选择移除特殊标记并清理分词空间。
类似于执行 self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))。
示例
>>> # Let's see how to retrieve time steps for a model
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>> # import model, feature extractor, tokenizer
>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
>>> # load first sample of English common_voice
>>> dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)
>>> # forward sample through model to get greedily predicted transcription ids
>>> input_values = feature_extractor(sample["audio"]["array"], return_tensors="pt").input_values
>>> logits = model(input_values).logits[0]
>>> pred_ids = torch.argmax(logits, axis=-1)
>>> # retrieve word stamps (analogous commands for `output_char_offsets`)
>>> outputs = tokenizer.decode(pred_ids, output_word_offsets=True)
>>> # compute `time_offset` in seconds as product of downsampling ratio and sampling_rate
>>> time_offset = model.config.inputs_to_logits_ratio / feature_extractor.sampling_rate
>>> word_offsets = [
... {
... "word": d["word"],
... "start_time": round(d["start_offset"] * time_offset, 2),
... "end_time": round(d["end_offset"] * time_offset, 2),
... }
... for d in outputs.word_offsets
... ]
>>> # compare word offsets with audio `en_train_0/common_voice_en_19121553.mp3` online on the dataset viewer:
>>> # https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/en
>>> word_offsets[:3]
[{'word': 'THE', 'start_time': 0.7, 'end_time': 0.78}, {'word': 'TRICK', 'start_time': 0.88, 'end_time': 1.08}, {'word': 'APPEARS', 'start_time': 1.2, 'end_time': 1.64}]batch_decode
< source >( sequences: typing.Union[list[int], list[list[int]], numpy.ndarray, ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool | None = None output_char_offsets: bool = False output_word_offsets: bool = False **kwargs ) → list[str] or Wav2Vec2CTCTokenizerOutput
参数
- sequences (
Union[list[int], list[list[int]], np.ndarray, torch.Tensor]) — tokenized input ids 列表。可通过__call__方法获取。 - skip_special_tokens (
bool, optional, defaults toFalse) — 解码时是否删除特殊 token。 - clean_up_tokenization_spaces (
bool, optional) — 是否清理分词产生的多余空格。 - output_char_offsets (
bool, optional, defaults toFalse) — 是否输出字符偏移。字符偏移可与采样率和模型下采样率结合使用,以计算转录字符的时间戳。请参阅 decode() 的示例,以便更好地理解如何使用
output_char_offsets。 batch_decode() 的批处理输出以相同方式工作。 - output_word_offsets (
bool, optional, defaults toFalse) — 是否输出单词偏移。单词偏移可与采样率和模型下采样率结合使用,以计算转录单词的时间戳。请参阅 decode() 的示例,以便更好地理解如何使用
output_word_offsets。 batch_decode() 的批处理输出以相同方式工作。 - kwargs (附加关键字参数, optional) — 将传递给底层模型特定的解码方法。
返回
list[str] or Wav2Vec2CTCTokenizerOutput
解码后的句子列表。当 output_char_offsets == True 或 output_word_offsets == True 时,将是一个 Wav2Vec2CTCTokenizerOutput。
通过调用 decode 将标记 ID 列表的列表转换为字符串列表。
设置嵌套多语言字典的目标语言。
Wav2Vec2FeatureExtractor
class transformers.Wav2Vec2FeatureExtractor
< source >( feature_size = 1 sampling_rate = 16000 padding_value = 0.0 return_attention_mask = False do_normalize = True **kwargs )
参数
- feature_size (
int, optional, defaults to 1) — 提取特征的维度。 - sampling_rate (
int, optional, defaults to 16000) — 音频文件应数字化的采样率,以赫兹 (Hz) 表示。 - padding_value (
float, optional, defaults to 0.0) — 用于填充值的数值。 - do_normalize (
bool, optional, defaults toTrue) — 是否对输入进行零均值单位方差归一化。归一化有助于显著提高某些模型的性能,例如 wav2vec2-lv60。 - return_attention_mask (
bool, optional, defaults toFalse) — call() 是否应返回attention_mask。如果保留默认值,将根据特定 feature_extractor 的默认设置返回注意力掩码。已设置
config.feat_extract_norm == "group"的 Wav2Vec2 模型,例如 wav2vec2-base,**未**使用attention_mask进行训练。对于此类模型,input_values应简单地用 0 填充,不应传递attention_mask。对于已设置
config.feat_extract_norm == "layer"的 Wav2Vec2 模型,例如 wav2vec2-lv60,应传递attention_mask以进行批处理推理。
构造一个 Wav2Vec2 特征提取器。
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
< source >( raw_speech: numpy.ndarray | list[float] | list[numpy.ndarray] | list[list[float]] padding: bool | str | transformers.utils.generic.PaddingStrategy = False max_length: int | None = None truncation: bool = False pad_to_multiple_of: int | None = None return_attention_mask: bool | None = None return_tensors: str | transformers.utils.generic.TensorType | None = None sampling_rate: int | None = None **kwargs )
参数
- raw_speech (
np.ndarray,list[float],list[np.ndarray],list[list[float]]) — 要填充的序列或序列批次。每个序列可以是 numpy 数组、浮点值列表、numpy 数组列表或浮点值列表列表。必须是单声道音频,而不是立体声,即每个时间步只有一个浮点值。 - padding (
bool,stror PaddingStrategy, optional, defaults toFalse) — 选择一种策略来填充返回的序列(根据模型的填充方向和填充索引):True或'longest': 填充到批次中最长的序列(如果只提供一个序列,则不填充)。'max_length': 填充到由参数max_length指定的最大长度,如果未提供该参数,则填充到模型可接受的最大输入长度。False或'do_not_pad'(默认): 不填充(即可以输出长度不同的序列批次)。
- max_length (
int, optional) — 返回列表的最大长度,以及可选的填充长度(见上文)。 - truncation (
bool) — 激活截断,将长于 max_length 的输入序列截断为 max_length。 - pad_to_multiple_of (
int, optional) — 如果设置,将序列填充为所提供值的倍数。这对于在计算能力
>= 7.5(Volta) 的 NVIDIA 硬件上或受益于序列长度为 128 倍数的 TPU 上启用 Tensor Cores 特别有用。 - return_attention_mask (
bool, optional) — 是否返回注意力掩码。如果保留默认值,将根据特定 feature_extractor 的默认设置返回注意力掩码。已设置
config.feat_extract_norm == "group"的 Wav2Vec2 模型,例如 wav2vec2-base,**未**使用attention_mask进行训练。对于此类模型,input_values应简单地用 0 填充,不应传递attention_mask。对于已设置
config.feat_extract_norm == "layer"的 Wav2Vec2 模型,例如 wav2vec2-lv60,应传递attention_mask以进行批处理推理。 - return_tensors (
stror TensorType, optional) — 如果设置,将返回张量而不是 Python 整数列表。可接受的值为:'pt': 返回 PyTorchtorch.Tensor对象。'np': 返回 Numpynp.ndarray对象。
- sampling_rate (
int, optional) —raw_speech输入的采样率。强烈建议在调用 forward 时传递sampling_rate以防止静默错误。 - padding_value (
float, optional, defaults to 0.0) —
对一个或多个序列进行特征化并为模型准备的主方法。
Wav2Vec2Processor
class transformers.Wav2Vec2Processor
< source >( feature_extractor tokenizer )
Constructs a Wav2Vec2Processor which wraps a feature extractor and a tokenizer into a single processor.
Wav2Vec2Processor offers all the functionalities of Wav2Vec2FeatureExtractor and Wav2Vec2CTCTokenizer. See the ~Wav2Vec2FeatureExtractor and ~Wav2Vec2CTCTokenizer for more information.
__call__
< source >( audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], list['torch.Tensor'], NoneType] = None text: str | list[str] | None = None **kwargs: typing_extensions.Unpack[transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2ProcessorKwargs] )
参数
- audio (
Union[numpy.ndarray, torch.Tensor, list, list], optional) — 要准备的音频或音频批次。每个音频可以是 NumPy 数组或 PyTorch 张量。对于 NumPy 数组/PyTorch 张量,每个音频的形状应为 (C, T),其中 C 是通道数,T 是音频的采样长度。 - text (
Union[str, list], optional) — 要编码的序列或序列批次。每个序列可以是字符串或字符串列表(预分词字符串)。如果传入预分词输入,请设置is_split_into_words=True以避免与批处理输入产生歧义。 - return_tensors (
stror TensorType, optional) — 如果设置,将返回特定框架的张量。可接受的值为:'pt': 返回 PyTorchtorch.Tensor对象。'np': 返回 NumPynp.ndarray对象。
pad
< source >( *args **kwargs )
参数
- input_features — 当第一个参数是包含张量批次的字典时,或者当
input_features参数存在时,它会传递给 Wav2Vec2FeatureExtractor.pad()。 - labels — 当
label参数存在时,它会传递给 PreTrainedTokenizer.pad()。
此方法在提取的特征和/或分词文本的批次上操作。它将所有参数转发给 Wav2Vec2FeatureExtractor.pad() 和/或 PreTrainedTokenizer.pad(),具体取决于输入模态,并返回它们的输出。如果同时传递两种模态,则会调用 Wav2Vec2FeatureExtractor.pad() 和 PreTrainedTokenizer.pad()。
from_pretrained
< source >( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs )
参数
- pretrained_model_name_or_path (
stroros.PathLike) — 可以是以下之一:- 一个字符串,即托管在 huggingface.co 上的模型仓库中的预训练 feature_extractor 的 *模型 id*。
- 一个 *目录* 的路径,其中包含使用 save_pretrained() 方法保存的 feature extractor 文件,例如
./my_model_directory/。 - 一个已保存的 feature extractor JSON *文件* 的路径或 URL,例如
./my_model_directory/preprocessor_config.json。
- **kwargs — 传递给 from_pretrained() 和
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained的额外关键字参数。
实例化与预训练模型关联的处理器。
This class method is simply calling the feature extractor from_pretrained(), image processor ImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrainedmethods. Please refer to the docstrings of the methods above for more information.
save_pretrained
< source >( save_directory push_to_hub: bool = False **kwargs )
参数
- save_directory (
stroros.PathLike) — 要保存 feature extractor JSON 文件和分词器文件的目录(如果目录不存在,将创建)。 - push_to_hub (
bool, optional, defaults toFalse) — 保存模型后是否将其推送到 Hugging Face 模型中心。您可以使用repo_id指定要推送到的仓库(默认为您命名空间中的save_directory名称)。 - kwargs (
dict[str, Any], optional) — 传递给 push_to_hub() 方法的额外关键字参数。
Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
This method forwards all its arguments to PreTrainedTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to PreTrainedTokenizer’s decode(). Please refer to the docstring of this method for more information.
Wav2Vec2ProcessorWithLM
class transformers.Wav2Vec2ProcessorWithLM
< source >( feature_extractor: FeatureExtractionMixin tokenizer: PreTrainedTokenizerBase decoder: BeamSearchDecoderCTC )
Constructs a Wav2Vec2ProcessorWithLM which wraps a feature extractor and a tokenizer into a single processor.
Wav2Vec2ProcessorWithLM offers all the functionalities of feature_extractor_class and tokenizer_class. See the ~feature_extractor_class and ~tokenizer_class for more information.
__call__
< source >( *args **kwargs )
参数
- return_tensors (
stror TensorType, optional) — 如果设置,将返回特定框架的张量。可接受的值为:'pt': 返回 PyTorchtorch.Tensor对象。'np': 返回 NumPynp.ndarray对象。
当在普通模式下使用时,此方法将其所有参数转发给 feature extractor 的 ~FeatureExtractionMixin.pad 并返回其输出。如果在 ~Wav2Vec2ProcessorWithLM.as_target_processor 上下文中使用,此方法将其所有参数转发给 Wav2Vec2CTCTokenizer 的 pad()。请参阅上述两个方法的文档字符串以获取更多信息。
from_pretrained
< source >( pretrained_model_name_or_path **kwargs )
参数
- pretrained_model_name_or_path (
stroros.PathLike) — 可以是以下之一:- 一个字符串,即托管在 huggingface.co 上的模型仓库中的预训练 feature_extractor 的 *模型 id*。
- 一个 *目录* 的路径,其中包含使用 save_pretrained() 方法保存的 feature extractor 文件,例如
./my_model_directory/。 - 一个已保存的 feature extractor JSON *文件* 的路径或 URL,例如
./my_model_directory/preprocessor_config.json。
- **kwargs — 传递给 SequenceFeatureExtractor 和 PreTrainedTokenizer 的额外关键字参数。
Instantiate a Wav2Vec2ProcessorWithLM from a pretrained Wav2Vec2 processor.
此类方法仅调用 feature extractor 的 from_pretrained()、Wav2Vec2CTCTokenizer 的 from_pretrained() 和
pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub。请参阅上述方法的文档字符串以获取更多信息。
batch_decode
< source >( logits: ndarray pool: multiprocessing.pool.Pool | None = None num_processes: int | None = None beam_width: int | None = None beam_prune_logp: float | None = None token_min_logp: float | None = None hotwords: collections.abc.Iterable[str] | None = None hotword_weight: float | None = None alpha: float | None = None beta: float | None = None unk_score_offset: float | None = None lm_score_boundary: bool | None = None output_word_offsets: bool = False n_best: int = 1 )
参数
- logits (
np.ndarray) — 模型的 logit 输出向量,表示每个 token 的对数概率。 - pool (
multiprocessing.Pool, optional) — 可选的用户管理进程池。如果未设置,将自动创建和关闭一个。进程池应在Wav2Vec2ProcessorWithLM之后实例化。否则,LM 将无法用于进程池的子进程。目前,只能使用在 ‘fork’ 上下文创建的进程池。如果传入 ‘spawn’ 进程池,它将被忽略并使用顺序解码。
- num_processes (
int, optional) — 如果未设置pool,函数应在其上并行化的进程数。默认为可用 CPU 数量。 - beam_width (
int, optional) — 解码中每一步的最大束数。默认为 pyctcdecode 的 DEFAULT_BEAM_WIDTH。 - beam_prune_logp (
int, optional) — 比最佳束差很多的束将被修剪。默认为 pyctcdecode 的 DEFAULT_PRUNE_LOGP。 - token_min_logp (
int, optional) — logp 低于此值的 token 将被跳过,除非它们是帧的 argmax。默认为 pyctcdecode 的 DEFAULT_MIN_TOKEN_LOGP。 - hotwords (
list[str], optional) — 具有额外重要性的单词列表,可以是 LM 的 OOV 词汇 - hotword_weight (
int, optional) — 热词重要性的权重因子。默认为 pyctcdecode 的 DEFAULT_HOTWORD_WEIGHT。 - alpha (
float, optional) — 浅层融合期间语言模型的权重 - beta (
float, optional) — 评分期间长度分数调整的权重 - unk_score_offset (
float, optional) — 未知 token 的对数分数偏移量 - lm_score_boundary (
bool, optional) — 在评分时 kenlm 是否应遵守边界 - output_word_offsets (
bool, optional, defaults toFalse) — 是否输出单词偏移量。单词偏移量可以与采样率和模型下采样率结合使用,以计算转录单词的时间戳。 - n_best (
int, optional, defaults to1) — 返回的最佳假设数量。如果n_best大于 1,则返回的text将是字符串列表的列表,logit_score将是浮点数列表的列表,lm_score将是浮点数列表的列表,其中外部列表的长度将对应于批次大小,内部列表的长度将对应于返回的假设数量。该值应 >= 1。请参阅 decode() 的示例,以更好地了解如何使用
output_word_offsets。 batch_decode() 以相同的方式处理批次输出。
Batch decode output logits to audio transcription with language model support.
此函数使用 Python 的多进程处理。目前,多进程仅在 Unix 系统上可用(请参阅此 issue)。
如果您正在解码多个批次,请考虑创建一个
Pool并将其传递给batch_decode。否则,batch_decode将非常慢,因为它会在每次调用时创建一个全新的Pool。请参阅下面的使用示例。
示例:请参阅 Decoding multiple audios。
decode
< source >( logits: ndarray beam_width: int | None = None beam_prune_logp: float | None = None token_min_logp: float | None = None hotwords: collections.abc.Iterable[str] | None = None hotword_weight: float | None = None alpha: float | None = None beta: float | None = None unk_score_offset: float | None = None lm_score_boundary: bool | None = None output_word_offsets: bool = False n_best: int = 1 )
参数
- logits (
np.ndarray) — 模型的 logit 输出向量,表示每个 token 的对数概率。 - beam_width (
int, optional) — 解码中每一步的最大束数。默认为 pyctcdecode 的 DEFAULT_BEAM_WIDTH。 - beam_prune_logp (
int, optional) — 用于修剪 log-probs 低于 best_beam_logp + beam_prune_logp 的束的阈值。该值应 <= 0。默认为 pyctcdecode 的 DEFAULT_PRUNE_LOGP。 - token_min_logp (
int, optional) — log-probs 低于 token_min_logp 的 token 将被跳过,除非它们具有发音的最大 log-prob。默认为 pyctcdecode 的 DEFAULT_MIN_TOKEN_LOGP。 - hotwords (
list[str], optional) — 具有额外重要性的单词列表,这些单词可能不在 LM 的词汇表中,例如 [“huggingface”] - hotword_weight (
int, optional) — 增强热词分数的权重乘数。默认为 pyctcdecode 的 DEFAULT_HOTWORD_WEIGHT。 - alpha (
float, optional) — 浅层融合期间语言模型的权重 - beta (
float, optional) — 评分期间长度分数调整的权重 - unk_score_offset (
float, optional) — Amount of log score offset for unknown tokens - lm_score_boundary (
bool, optional) — Whether to have kenlm respect boundaries when scoring - output_word_offsets (
bool, optional, defaults toFalse) — Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed words. - n_best (
int, optional, defaults to1) — Number of best hypotheses to return. Ifn_bestis greater than 1, the returnedtextwill be a list of strings,logit_scorewill be a list of floats, andlm_scorewill be a list of floats, where the length of these lists will correspond to the number of returned hypotheses. The value should be >= 1.Please take a look at the example below to better understand how to make use of
output_word_offsets.
Decode output logits to audio transcription with language model support.
示例
>>> # Let's see how to retrieve time steps for a model
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>> # import model, feature extractor, tokenizer
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> # load first sample of English common_voice
>>> dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)
>>> # forward sample through model to get greedily predicted transcription ids
>>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
>>> with torch.no_grad():
... logits = model(input_values).logits[0].cpu().numpy()
>>> # retrieve word stamps (analogous commands for `output_char_offsets`)
>>> outputs = processor.decode(logits, output_word_offsets=True)
>>> # compute `time_offset` in seconds as product of downsampling ratio and sampling_rate
>>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate
>>> word_offsets = [
... {
... "word": d["word"],
... "start_time": round(d["start_offset"] * time_offset, 2),
... "end_time": round(d["end_offset"] * time_offset, 2),
... }
... for d in outputs.word_offsets
... ]
>>> # compare word offsets with audio `en_train_0/common_voice_en_19121553.mp3` online on the dataset viewer:
>>> # https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/en
>>> word_offsets[:4]
[{'word': 'THE', 'start_time': 0.68, 'end_time': 0.78}, {'word': 'TRACK', 'start_time': 0.88, 'end_time': 1.1}, {'word': 'APPEARS', 'start_time': 1.18, 'end_time': 1.66}, {'word': 'ON', 'start_time': 1.86, 'end_time': 1.92}]Decoding multiple audios
If you are planning to decode multiple batches of audios, you should consider using batch_decode() and passing an instantiated multiprocessing.Pool. Otherwise, batch_decode() performance will be slower than calling decode() for each audio individually, as it internally instantiates a new Pool for every call. See the example below
>>> # Let's see how to use a user-managed pool for batch decoding multiple audios
>>> from multiprocessing import get_context
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
from accelerate import Accelerator
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>> device = Accelerator().device
>>> # import model, feature extractor, tokenizer
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to(device)
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> # load example dataset
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> def map_to_array(example):
... example["speech"] = example["audio"]["array"]
... return example
>>> # prepare speech data for batch inference
>>> dataset = dataset.map(map_to_array, remove_columns=["audio"])
>>> def map_to_pred(batch, pool):
... device = Accelerator().device
... inputs = processor(batch["speech"], sampling_rate=16_000, padding=True, return_tensors="pt")
... inputs = {k: v.to(device) for k, v in inputs.items()}
... with torch.no_grad():
... logits = model(**inputs).logits
... transcription = processor.batch_decode(logits.cpu().numpy(), pool).text
... batch["transcription"] = transcription
... return batch
>>> # note: pool should be instantiated *after* `Wav2Vec2ProcessorWithLM`.
>>> # otherwise, the LM won't be available to the pool's sub-processes
>>> # select number of processes and batch_size based on number of CPU cores available and on dataset size
>>> with get_context("fork").Pool(processes=2) as pool:
... result = dataset.map(
... map_to_pred, batched=True, batch_size=2, fn_kwargs={"pool": pool}, remove_columns=["speech"]
... )
>>> result["transcription"][:2]
['MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', "NOR IS MISTER COULTER'S MANNER LESS INTERESTING THAN HIS MATTER"]Wav2Vec2 specific outputs
class transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput
< source >( text: list[list[str]] | list[str] | str logit_score: list[list[float]] | list[float] | float = None lm_score: list[list[float]] | list[float] | float = None word_offsets: list[list[list[dict[str, int | str]]]] | list[list[dict[str, int | str]]] | list[dict[str, int | str]] = None )
参数
- text (list of
strorstr) — Decoded logits in text from. Usually the speech transcription. - logit_score (list of
floatorfloat) — Total logit score of the beams associated with produced text. - lm_score (list of
float) — Fused lm_score of the beams associated with produced text. - word_offsets (list of
list[dict[str, Union[int, str]]]orlist[dict[str, Union[int, str]]]) — Offsets of the decoded words. In combination with sampling rate and model downsampling rate word offsets can be used to compute time stamps for each word.
Output type of Wav2Vec2DecoderWithLM, with transcription.
class transformers.modeling_outputs.Wav2Vec2BaseModelOutput
< source >( last_hidden_state: torch.FloatTensor | None = None extract_features: torch.FloatTensor | None = None hidden_states: tuple[torch.FloatTensor, ...] | None = None attentions: tuple[torch.FloatTensor, ...] | None = None )
参数
- last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. - extract_features (
torch.FloatTensorof shape(batch_size, sequence_length, conv_dim[-1])) — Sequence of extracted feature vectors of the last convolutional layer of the model. - hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs.
- attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Base class for models that have been trained with the Wav2Vec2 loss objective.
class transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput
< source >( loss: torch.FloatTensor | None = None projected_states: torch.FloatTensor | None = None projected_quantized_states: torch.FloatTensor | None = None codevector_perplexity: torch.FloatTensor | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None contrastive_loss: torch.FloatTensor | None = None diversity_loss: torch.FloatTensor | None = None )
参数
- loss (
*optional*, returned whensample_negative_indicesare passed,torch.FloatTensorof shape(1,)) — Total loss as the sum of the contrastive loss (L_m) and the diversity loss (L_d) as stated in the official paper. - projected_states (
torch.FloatTensorof shape(batch_size, sequence_length, config.proj_codevector_dim)) — Hidden-states of the model projected to config.proj_codevector_dim that can be used to predict the masked projected quantized states. - projected_quantized_states (
torch.FloatTensorof shape(batch_size, sequence_length, config.proj_codevector_dim)) — Quantized extracted feature vectors projected to config.proj_codevector_dim representing the positive target vectors for contrastive loss. - codevector_perplexity (
torch.FloatTensorof shape(1,)) — The perplexity of the codevector distribution, used to measure the diversity of the codebook. - hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- attentions (
tuple[torch.FloatTensor] | None.attentions, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
- contrastive_loss (
*optional*, returned whensample_negative_indicesare passed,torch.FloatTensorof shape(1,)) — The contrastive loss (L_m) as stated in the official paper. - diversity_loss (
*optional*, returned whensample_negative_indicesare passed,torch.FloatTensorof shape(1,)) — The diversity loss (L_d) as stated in the official paper.
Output type of Wav2Vec2ForPreTraining, with potential hidden states and attentions.
Wav2Vec2Model
class transformers.Wav2Vec2Model
< source >( config: Wav2Vec2Config )
参数
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Wav2Vec2 Model outputting raw hidden-states without any specific head on top.
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None mask_time_indices: torch.FloatTensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None **kwargs ) → transformers.modeling_outputs.Wav2Vec2BaseModelOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.Tensorof shape(batch_size, sequence_length), optional) — Float values of input raw speech waveform. Values can be obtained by loading a.flacor.wavaudio file into an array of typelist[float], anumpy.ndarrayor atorch.Tensor, e.g. via the torchcodec library (pip install torchcodec) or the soundfile library (pip install soundfile). To prepare the array intoinput_values, the AutoProcessor should be used for padding and conversion into a tensor of typetorch.FloatTensor. See Wav2Vec2Processor.call() for details. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- mask_time_indices (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — Indices to mask extracted features for contrastive loss. When in training mode, model learns to predict masked extracted features in config.proj_codevector_dim space. - output_attentions (
bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. - output_hidden_states (
bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail. - return_dict (
bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
返回
transformers.modeling_outputs.Wav2Vec2BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Wav2Vec2BaseModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Wav2Vec2Config) and inputs.
-
last_hidden_state (
torch.FloatTensor, 形状为(batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。 -
extract_features (
torch.FloatTensor形状为(batch_size, sequence_length, conv_dim[-1])) — 模型最后一个卷积层的提取特征向量序列。 -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).模型在每个层输出的隐藏状态加上初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The Wav2Vec2Model forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
Wav2Vec2ForCTC
class transformers.Wav2Vec2ForCTC
< source >( config target_lang: str | None = None )
参数
- config (Wav2Vec2ForCTC) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- target_lang (
str, optional) — Language id of adapter weights. Adapter weights are stored in the format adapter..safetensors or adapter. .bin. Only relevant when using an instance of Wav2Vec2ForCTC with adapters. Uses ‘eng’ by default.
Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC).
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None labels: torch.Tensor | None = None **kwargs ) → transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.Tensorof shape(batch_size, sequence_length), optional) — Float values of input raw speech waveform. Values can be obtained by loading a.flacor.wavaudio file into an array of typelist[float], anumpy.ndarrayor atorch.Tensor, e.g. via the torchcodec library (pip install torchcodec) or the soundfile library (pip install soundfile). To prepare the array intoinput_values, the AutoProcessor should be used for padding and conversion into a tensor of typetorch.FloatTensor. See Wav2Vec2Processor.call() for details. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- output_attentions (
bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - return_dict (
bool, optional) — 是否返回 ModelOutput 而不是普通的元组。 - labels (
torch.LongTensorof shape(batch_size, target_length), optional) — 用于连接主义时间分类的标签。请注意,target_length必须小于或等于输出 logits 的序列长度。索引选择范围为[-100, 0, ..., config.vocab_size - 1]。所有设置为-100的标签将被忽略(屏蔽),损失仅针对[0, ..., config.vocab_size - 1]中的标签计算。
返回
transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Wav2Vec2Config) and inputs.
-
loss (
torch.FloatTensor形状为(1,),可选,当提供labels时返回) — 语言建模损失(用于下一个 token 预测)。 -
logits (形状为
(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor) — 语言建模头部的预测分数(SoftMax 之前的每个词汇标记的分数)。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The Wav2Vec2ForCTC forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoProcessor, Wav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
...
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
...load_adapter
< source >( target_lang: str force_load = True **kwargs )
参数
- target_lang (
str) — 必须是现有适配器权重的语言 ID。适配器权重以 adapter.<lang>.safetensors 或 adapter.<lang>.bin 的格式存储。 - force_load (
bool, defaults toTrue) — 即使target_lang与self.target_lang匹配,是否仍强制加载权重。 - cache_dir (
Union[str, os.PathLike], optional) — 如果不使用标准缓存,则为下载的预训练模型配置的缓存目录路径。 - force_download (
bool, optional, defaults toFalse) — 是否强制(重新)下载模型权重和配置文件,覆盖已存在的缓存版本。 - proxies (
dict[str, str], optional) — 按协议或端点使用的代理服务器字典,例如{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}。代理用于每个请求。 - local_files_only(
bool, optional, defaults toFalse) — 是否只查看本地文件(即不尝试下载模型)。 - token (
strorbool, optional) — 用于远程文件的 HTTP Bearer 授权令牌。如果为True或未指定,将使用运行hf auth login时生成的令牌(存储在~/.huggingface中)。 - revision (
str, optional, defaults to"main") — 要使用的特定模型版本。它可以是分支名称、标签名称或提交 ID,因为我们使用基于 git 的系统在 huggingface.co 上存储模型和其他工件,因此revision可以是 git 允许的任何标识符。要测试您在 Hub 上创建的拉取请求,您可以传入
revision="refs/pr/<pr_number>"。 - mirror (
str, optional) — 用于加快在中国下载速度的镜像源。如果您来自中国并且遇到访问问题,可以设置此选项来解决。请注意,我们不保证时效性或安全性。有关详细信息,请参阅镜像站点。
从预训练的适配器模型加载语言适配器模型。
激活特殊的 “离线模式” 以在防火墙环境中使用此方法。
示例
>>> from transformers import Wav2Vec2ForCTC, AutoProcessor
>>> ckpt = "facebook/mms-1b-all"
>>> processor = AutoProcessor.from_pretrained(ckpt)
>>> model = Wav2Vec2ForCTC.from_pretrained(ckpt, target_lang="eng")
>>> # set specific language
>>> processor.tokenizer.set_target_lang("spa")
>>> model.load_adapter("spa")Wav2Vec2ForSequenceClassification
class transformers.Wav2Vec2ForSequenceClassification
< source >( config model_args: ~utils.generic.ModelArgs | None = None adapter_args: ~utils.generic.AdapterArgs | None = None lora_args: ~utils.generic.LoRAArgs | None = None tokenizer_args: ~utils.generic.TokenizerArgs | None = None dataset_args: ~utils.generic.DatasetArgs | None = None data_args: ~utils.generic.DataArgs | None = None training_args: ~utils.generic.TrainingArgs | None = None generation_args: ~utils.generic.GenerationArgs | None = None vision_tower_args: ~utils.generic.VisionTowerArgs | None = None qlora_args: ~utils.generic.QLoRAArgs | None = None vision_tower_template_args: ~utils.generic.VisionTowerTemplateArgs | None = None video_tower_args: ~utils.generic.VideoTowerArgs | None = None vision_config: ~utils.generic.VisionConfig | None = None video_config: ~utils.generic.VideoConfig | None = None load_dataset: bool | None = None load_data_collator: bool | None = None load_processor: bool | None = None load_lora_adapter: bool | None = None load_adapter: bool | None = None load_qlora_adapter: bool | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_utils.PreTrainedModelKwargs] )
参数
- config (Wav2Vec2ForSequenceClassification) — 模型配置类,包含模型的所有参数。使用配置文件初始化模型不会加载与模型相关的权重,只会加载配置。要加载模型权重,请查看 from_pretrained() 方法。
Wav2Vec2 模型,顶部带有序列分类头(在池化输出上的线性层),用于 SUPERB 关键词识别等任务。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None labels: torch.Tensor | None = None **kwargs ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.FloatTensorof shape(batch_size, sequence_length)) — 输入原始语音波形的浮点值。可以通过将.flac或.wav音频文件加载到list[float]类型的数组、numpy.ndarray或torch.Tensor中来获取值,例如通过 torchcodec 库(pip install torchcodec)或 soundfile 库(pip install soundfile)。为了将数组准备成input_values,应该使用 AutoProcessor 进行填充并转换为torch.FloatTensor类型的张量。有关详细信息,请参阅 Wav2Vec2Processor.call()。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值选择范围为[0, 1]:- 1 表示**未被屏蔽**的标记,
- 0 表示**被屏蔽**的标记。
- output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - return_dict (
bool, optional) — 是否返回 ModelOutput 而不是普通的元组。 - labels (
torch.LongTensorof shape(batch_size,), optional) — 用于计算序列分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。如果config.num_labels == 1,则计算回归损失(均方误差损失);如果config.num_labels > 1,则计算分类损失(交叉熵损失)。
返回
transformers.modeling_outputs.SequenceClassifierOutput 或 tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Wav2Vec2Config) and inputs.
-
loss (形状为
(1,)的torch.FloatTensor,可选,当提供labels时返回) — 分类损失(如果 config.num_labels==1,则为回归损失)。 -
logits (形状为
(batch_size, config.num_labels)的torch.FloatTensor) — 分类(如果 config.num_labels==1,则为回归)分数(SoftMax 之前)。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The Wav2Vec2ForSequenceClassification forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
单标签分类示例
>>> import torch
>>> from transformers import AutoTokenizer, Wav2Vec2ForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
...
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", num_labels=num_labels)
>>> labels = torch.tensor([1])
>>> loss = model(**inputs, labels=labels).loss
>>> round(loss.item(), 2)
...多标签分类示例
>>> import torch
>>> from transformers import AutoTokenizer, Wav2Vec2ForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", problem_type="multi_label_classification")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained(
... "facebook/wav2vec2-base-960h", num_labels=num_labels, problem_type="multi_label_classification"
... )
>>> labels = torch.sum(
... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
... ).to(torch.float)
>>> loss = model(**inputs, labels=labels).lossWav2Vec2ForAudioFrameClassification
class transformers.Wav2Vec2ForAudioFrameClassification
< source >( config model_args: ~utils.generic.ModelArgs | None = None adapter_args: ~utils.generic.AdapterArgs | None = None lora_args: ~utils.generic.LoRAArgs | None = None tokenizer_args: ~utils.generic.TokenizerArgs | None = None dataset_args: ~utils.generic.DatasetArgs | None = None data_args: ~utils.generic.DataArgs | None = None training_args: ~utils.generic.TrainingArgs | None = None generation_args: ~utils.generic.GenerationArgs | None = None vision_tower_args: ~utils.generic.VisionTowerArgs | None = None qlora_args: ~utils.generic.QLoRAArgs | None = None vision_tower_template_args: ~utils.generic.VisionTowerTemplateArgs | None = None video_tower_args: ~utils.generic.VideoTowerArgs | None = None vision_config: ~utils.generic.VisionConfig | None = None video_config: ~utils.generic.VideoConfig | None = None load_dataset: bool | None = None load_data_collator: bool | None = None load_processor: bool | None = None load_lora_adapter: bool | None = None load_adapter: bool | None = None load_qlora_adapter: bool | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_utils.PreTrainedModelKwargs] )
参数
- config (Wav2Vec2ForAudioFrameClassification) — 模型配置类,包含模型的所有参数。使用配置文件初始化模型不会加载与模型相关的权重,只会加载配置。要加载模型权重,请查看 from_pretrained() 方法。
Wav2Vec2 模型,顶部带有帧分类头,用于说话人识别等任务。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None labels: torch.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None **kwargs ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.FloatTensorof shape(batch_size, sequence_length)) — 输入原始语音波形的浮点值。可以通过将.flac或.wav音频文件加载到list[float]类型的数组、numpy.ndarray或torch.Tensor中来获取值,例如通过 torchcodec 库(pip install torchcodec)或 soundfile 库(pip install soundfile)。为了将数组准备成input_values,应该使用 AutoProcessor 进行填充并转换为torch.FloatTensor类型的张量。有关详细信息,请参阅 Wav2Vec2Processor.call()。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值选择范围为[0, 1]:- 1 表示**未被屏蔽**的标记,
- 0 表示**被屏蔽**的标记。
- labels (
torch.LongTensorof shape(batch_size,), optional) — 用于计算序列分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。如果config.num_labels == 1,则计算回归损失(均方误差损失);如果config.num_labels > 1,则计算分类损失(交叉熵损失)。 - output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - return_dict (
bool, optional) — 是否返回 ModelOutput 而不是普通的元组。
返回
transformers.modeling_outputs.TokenClassifierOutput 或 tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Wav2Vec2Config) and inputs.
-
loss (形状为
(1,)的torch.FloatTensor,可选,当提供labels时返回) — 分类损失。 -
logits (形状为
(batch_size, sequence_length, config.num_labels)的torch.FloatTensor) — 分类分数(SoftMax 之前)。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The Wav2Vec2ForAudioFrameClassification forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForAudioFrameClassification
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForAudioFrameClassification.from_pretrained("facebook/wav2vec2-base-960h")
>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> probabilities = torch.sigmoid(logits[0])
>>> # labels is a one-hot array of shape (num_frames, num_speakers)
>>> labels = (probabilities > 0.5).long()
>>> labels[0].tolist()
...Wav2Vec2ForXVector
class transformers.Wav2Vec2ForXVector
< source >( config model_args: ~utils.generic.ModelArgs | None = None adapter_args: ~utils.generic.AdapterArgs | None = None lora_args: ~utils.generic.LoRAArgs | None = None tokenizer_args: ~utils.generic.TokenizerArgs | None = None dataset_args: ~utils.generic.DatasetArgs | None = None data_args: ~utils.generic.DataArgs | None = None training_args: ~utils.generic.TrainingArgs | None = None generation_args: ~utils.generic.GenerationArgs | None = None vision_tower_args: ~utils.generic.VisionTowerArgs | None = None qlora_args: ~utils.generic.QLoRAArgs | None = None vision_tower_template_args: ~utils.generic.VisionTowerTemplateArgs | None = None video_tower_args: ~utils.generic.VideoTowerArgs | None = None vision_config: ~utils.generic.VisionConfig | None = None video_config: ~utils.generic.VideoConfig | None = None load_dataset: bool | None = None load_data_collator: bool | None = None load_processor: bool | None = None load_lora_adapter: bool | None = None load_adapter: bool | None = None load_qlora_adapter: bool | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_utils.PreTrainedModelKwargs] )
参数
- config (Wav2Vec2ForXVector) — 模型配置类,包含模型的所有参数。使用配置文件初始化模型不会加载与模型相关的权重,只会加载配置。要加载模型权重,请查看 from_pretrained() 方法。
Wav2Vec2 模型,顶部带有 XVector 特征提取头,用于说话人验证等任务。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None labels: torch.Tensor | None = None **kwargs ) → transformers.modeling_outputs.XVectorOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.FloatTensorof shape(batch_size, sequence_length)) — 输入原始语音波形的浮点值。可以通过将.flac或.wav音频文件加载到list[float]类型的数组、numpy.ndarray或torch.Tensor中来获取值,例如通过 torchcodec 库(pip install torchcodec)或 soundfile 库(pip install soundfile)。为了将数组准备成input_values,应该使用 AutoProcessor 进行填充并转换为torch.FloatTensor类型的张量。有关详细信息,请参阅 Wav2Vec2Processor.call()。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值选择范围为[0, 1]:- 1 表示**未被屏蔽**的标记,
- 0 表示**被屏蔽**的标记。
- output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - return_dict (
bool, optional) — 是否返回 ModelOutput 而不是普通的元组。 - labels (
torch.LongTensorof shape(batch_size,), optional) — 用于计算序列分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。如果config.num_labels == 1,则计算回归损失(均方误差损失);如果config.num_labels > 1,则计算分类损失(交叉熵损失)。
返回
transformers.modeling_outputs.XVectorOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.XVectorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Wav2Vec2Config) and inputs.
-
loss (形状为
(1,)的torch.FloatTensor,可选,当提供labels时返回) — 分类损失。 -
logits (形状为
(batch_size, config.xvector_output_dim)的torch.FloatTensor) — AMSoftmax 之前的分类隐藏状态。 -
embeddings (形状为
(batch_size, config.xvector_output_dim)的torch.FloatTensor) — 用于基于向量相似性检索的话语嵌入。 -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).模型在每个层输出的隐藏状态加上初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The Wav2Vec2ForXVector forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForXVector
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForXVector.from_pretrained("facebook/wav2vec2-base-960h")
>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(
... [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True
... )
>>> with torch.no_grad():
... embeddings = model(**inputs).embeddings
>>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
>>> # the resulting embeddings can be used for cosine similarity-based retrieval
>>> cosine_sim = torch.nn.CosineSimilarity(dim=-1)
>>> similarity = cosine_sim(embeddings[0], embeddings[1])
>>> threshold = 0.7 # the optimal threshold is dataset-dependent
>>> if similarity < threshold:
... print("Speakers are not the same!")
>>> round(similarity.item(), 2)
...Wav2Vec2ForPreTraining
class transformers.Wav2Vec2ForPreTraining
< source >( config: Wav2Vec2Config )
参数
- config (Wav2Vec2Config) — 模型配置类,包含模型的所有参数。使用配置文件初始化模型不会加载与模型相关的权重,只会加载配置。要加载模型权重,请查看 from_pretrained() 方法。
Wav2Vec2 模型,顶部带有量化器和 VQ 头。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_values: torch.Tensor | None attention_mask: torch.Tensor | None = None mask_time_indices: torch.BoolTensor | None = None sampled_negative_indices: torch.BoolTensor | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None **kwargs ) → transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput or tuple(torch.FloatTensor)
参数
- input_values (
torch.Tensorof shape(batch_size, sequence_length), optional) — 输入原始语音波形的浮点值。可以通过将.flac或.wav音频文件加载到类型为list[float]、numpy.ndarray或torch.Tensor的数组中来获取这些值,例如通过 torchcodec 库(pip install torchcodec)或 soundfile 库(pip install soundfile)。为了将数组准备成input_values,应使用 AutoProcessor 进行填充并转换为torch.FloatTensor类型的张量。有关详细信息,请参阅 Wav2Vec2Processor.call()。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力操作的掩码。掩码值选择在[0, 1]之间:- 1 表示**未被掩码**的标记,
- 0 表示**被掩码**的标记。
- mask_time_indices (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — 用于掩盖提取特征以进行对比损失的索引。在训练模式下,模型学习预测 config.proj_codevector_dim 空间中被掩盖的提取特征。 - sampled_negative_indices (
torch.BoolTensorof shape(batch_size, sequence_length, num_negatives), optional) — 指示哪些量化目标向量被用作对比损失中的负采样向量的索引。预训练所必需的输入。 - output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - return_dict (
bool, optional) — 是否返回 ModelOutput 而不是普通的元组。
返回
transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput 或 tuple(torch.FloatTensor)
一个 transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput 或一个 torch.FloatTensor 元组(如果传入 return_dict=False 或 config.return_dict=False),包含取决于配置 (Wav2Vec2Config) 和输入的不同元素。
-
loss (
*optional*, 当传入sample_negative_indices时返回,torch.FloatTensorof shape(1,)) — 总损失,是对比损失 (L_m) 和多样性损失 (L_d) 的总和,如官方论文所述。 -
projected_states (
torch.FloatTensorof shape(batch_size, sequence_length, config.proj_codevector_dim)) — 模型投影到 config.proj_codevector_dim 的隐藏状态,可用于预测被掩盖的投影量化状态。 -
projected_quantized_states (
torch.FloatTensorof shape(batch_size, sequence_length, config.proj_codevector_dim)) — 投影到 config.proj_codevector_dim 的量化提取特征向量,表示对比损失的正目标向量。 -
codevector_perplexity (形状为
(1,)的torch.FloatTensor) — 码向量分布的困惑度,用于衡量码本的多样性。 -
hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入的输出,如果模型有嵌入层,+ 每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple[torch.FloatTensor] | None.attentions, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
-
contrastive_loss (
*optional*, 当传入sample_negative_indices时返回,torch.FloatTensorof shape(1,)) — 对比损失 (L_m),如官方论文所述。 -
diversity_loss (
*optional*, 当传入sample_negative_indices时返回,torch.FloatTensorof shape(1,)) — 多样性损失 (L_d),如官方论文所述。
Wav2Vec2ForPreTraining 的前向方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> import torch
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices, _sample_negative_indices
>>> from datasets import load_dataset
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
>>> model = Wav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-base")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> input_values = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt").input_values # Batch size 1
>>> # compute masked indices
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length).item()
>>> mask_time_indices = _compute_mask_indices(
... shape=(batch_size, sequence_length), mask_prob=0.2, mask_length=2
... )
>>> sampled_negative_indices = _sample_negative_indices(
... features_shape=(batch_size, sequence_length),
... num_negatives=model.config.num_negatives,
... mask_time_indices=mask_time_indices,
... )
>>> mask_time_indices = torch.tensor(data=mask_time_indices, device=input_values.device, dtype=torch.long)
>>> sampled_negative_indices = torch.tensor(
... data=sampled_negative_indices, device=input_values.device, dtype=torch.long
... )
>>> with torch.no_grad():
... outputs = model(input_values, mask_time_indices=mask_time_indices)
>>> # compute cosine similarity between predicted (=projected_states) and target (=projected_quantized_states)
>>> cosine_sim = torch.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states, dim=-1)
>>> # show that cosine similarity is much higher than random
>>> cosine_sim[mask_time_indices.to(torch.bool)].mean() > 0.5
tensor(True)
>>> # for contrastive loss training model should be put into train mode
>>> model = model.train()
>>> loss = model(
... input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
... ).loss