Wav2Vec2Phoneme

概述

Wav2Vec2Phoneme 模型在 Simple and Effective Zero-shot Cross-lingual Phoneme Recognition (Xu et al., 2021 中被提出，作者是 Qiantong Xu、Alexei Baevski、Michael Auli。

以下是论文的摘要

自训练、自监督预训练和无监督学习的最新进展使得无需任何标记数据即可实现性能良好的语音识别系统。然而，在许多情况下，存在可用于相关语言的标记数据，但这些方法并未利用这些数据。本文通过微调多语言预训练的 wav2vec 2.0 模型来转录看不见的语言，扩展了先前关于零样本跨语言迁移学习的工作。这是通过使用发音特征将训练语言的音素映射到目标语言来完成的。实验表明，这种简单的方法显着优于先前的工作，先前的工作引入了特定于任务的架构，并且仅使用了单语预训练模型的一部分。

相关检查点可以在 https://huggingface.co/models?other=phoneme-recognition 下找到。

此模型由 patrickvonplaten 贡献。

原始代码可以在这里找到。

使用技巧

Wav2Vec2Phoneme 使用与 Wav2Vec2 完全相同的架构
Wav2Vec2Phoneme 是一个语音模型，它接受对应于语音信号原始波形的浮点数组。
Wav2Vec2Phoneme 模型使用连接时序分类 (CTC) 进行训练，因此模型输出必须使用 Wav2Vec2PhonemeCTCTokenizer 进行解码。
Wav2Vec2Phoneme 可以一次在多种语言上进行微调，并在单个前向传递中将看不见的语言解码为音素序列
默认情况下，模型输出音素序列。为了将音素转换为单词序列，应该使用字典和语言模型。

Wav2Vec2Phoneme 的架构基于 Wav2Vec2 模型，有关 API 参考，请查看 Wav2Vec2 的文档页面，但 tokenizer 除外。

Wav2Vec2PhonemeCTCTokenizer

class transformers.Wav2Vec2PhonemeCTCTokenizer

< source >

( vocab_file bos_token = '<s>' eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' phone_delimiter_token = ' ' word_delimiter_token = None do_phonemize = True phonemizer_lang = 'en-us' phonemizer_backend = 'espeak' **kwargs )

参数

vocab_file (str) — 包含词汇表的文件。
bos_token (str, 可选, 默认为 "<s>") — 句子的开始标记。
eos_token (str, 可选, 默认为 "</s>") — 句子的结束标记。
unk_token (str, 可选, 默认为 "<unk>") — 未知标记。词汇表中没有的标记无法转换为 ID，而是设置为此标记。
pad_token (str, 可选, 默认为 "<pad>") — 用于填充的标记，例如在批量处理不同长度的序列时。
do_phonemize (bool, 可选, 默认为 True) — 分词器是否应该对输入进行音素化。仅当将音素序列传递给分词器时，才应将 do_phonemize 设置为 False。
phonemizer_lang (str, 可选, 默认为 "en-us") — 音素集的语言，分词器应将输入文本音素化为该语言的音素集。
phonemizer_backend (str, optional. 默认为 "espeak") — phonemizer 库应使用的后端音素化库。默认为 espeak-ng。有关更多信息，请参阅 phonemizer 包。
**kwargs — 传递给 PreTrainedTokenizer 的其他关键字参数

构建 Wav2Vec2PhonemeCTC 分词器。

此分词器继承自 PreTrainedTokenizer，其中包含一些主要方法。用户应参考超类以获取有关此类方法的更多信息。

call

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

参数

text (str, List[str], List[List[str]], optional) — 要编码的序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，则必须设置 is_split_into_words=True (以消除与序列批次的歧义)。
text_pair (str, List[str], List[List[str]], optional) — 要编码的第二个序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，则必须设置 is_split_into_words=True (以消除与序列批次的歧义)。
text_target (str, List[str], List[List[str]], optional) — 要编码为目标文本的序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，则必须设置 is_split_into_words=True (以消除与序列批次的歧义)。
text_pair_target (str, List[str], List[List[str]], optional) — 要编码为目标文本的第二个序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果序列以字符串列表（预分词）形式提供，则必须设置 is_split_into_words=True (以消除与序列批次的歧义)。
add_special_tokens (bool, optional, defaults to True) — 是否在编码序列时添加特殊 token。这将使用底层的 PretrainedTokenizerBase.build_inputs_with_special_tokens 函数，该函数定义了哪些 token 会自动添加到 input ids 中。如果你想自动添加 bos 或 eos token，这将非常有用。
padding (bool, str 或 PaddingStrategy, optional, defaults to False) — 激活和控制填充。接受以下值：
- True 或 'longest'：填充到批次中最长序列的长度（如果仅提供单个序列，则不填充）。
- 'max_length'：填充到 max_length 参数指定的长度，或者如果未提供该参数，则填充到模型可接受的最大输入长度。
- False 或 'do_not_pad' (默认)：不填充（即，可以输出包含不同长度序列的批次）。
truncation (bool, str 或 TruncationStrategy, optional, defaults to False) — 激活和控制截断。接受以下值：
- True 或 'longest_first'：截断为 max_length 参数指定的长度，或者如果未提供该参数，则截断为模型可接受的最大输入长度。这将逐个 token 进行截断，如果提供了一对序列（或一批序列对），则从序列对中最长的序列中删除 token。
- 'only_first'：截断为 max_length 参数指定的长度，或者如果未提供该参数，则截断为模型可接受的最大输入长度。如果提供了一对序列（或一批序列对），则仅截断序列对中的第一个序列。
- 'only_second'：截断为 max_length 参数指定的长度，或者如果未提供该参数，则截断为模型可接受的最大输入长度。如果提供了一对序列（或一批序列对），则仅截断序列对中的第二个序列。
- False 或 'do_not_truncate' (默认)：不截断（即，可以输出序列长度大于模型最大允许输入大小的批次）。
max_length (int, optional) — 控制截断/填充参数使用的最大长度。

如果未设置或设置为 None，则当截断/填充参数需要最大长度时，将使用预定义的模型最大长度。如果模型没有特定的最大输入长度（如 XLNet），则将停用截断/填充到最大长度的功能。
stride (int, optional, defaults to 0) — 如果与 max_length 一起设置为数字，则当 return_overflowing_tokens=True 时返回的溢出 token 序列将包含截断序列末尾的一些 token，以便在截断序列和溢出序列之间提供一些重叠。此参数的值定义了重叠 token 的数量。
is_split_into_words (bool, optional, defaults to False) — 输入是否已预分词（例如，拆分为单词）。如果设置为 True，则分词器假定输入已拆分为单词（例如，通过在空格上拆分），它将对这些单词进行分词。这对于 NER 或 token 分类非常有用。
pad_to_multiple_of (int, optional) — 如果设置，则将序列填充到提供的值的倍数。需要激活 padding。这对于在计算能力 >= 7.5 (Volta) 的 NVIDIA 硬件上启用 Tensor Core 的使用尤其有用。
padding_side (str, optional) — 模型应该在其上应用填充的一侧。应在 [‘right’, ‘left’] 之间选择。默认值从同名类属性中选取。
return_tensors (str 或 TensorType, optional) — 如果设置，将返回 tensor 而不是 python 整数列表。可接受的值为：
- 'tf'：返回 TensorFlow tf.constant 对象。
- 'pt'：返回 PyTorch torch.Tensor 对象。
- 'np'：返回 Numpy np.ndarray 对象。
return_token_type_ids (bool, optional) — 是否返回 token 类型 ID。如果保留为默认值，将根据特定分词器的默认值返回 token 类型 ID，默认值由 return_outputs 属性定义。

什么是 token 类型 IDs？
return_attention_mask (bool, optional) — 是否返回 attention mask。如果保留为默认值，将根据特定分词器的默认值返回 attention mask，默认值由 return_outputs 属性定义。

什么是 attention masks？
return_overflowing_tokens (bool, optional, defaults to False) — 是否返回溢出的 token 序列。如果提供了一对输入 id 序列（或一批序列对）且 truncation_strategy = longest_first 或 True，则会引发错误，而不是返回溢出的 token。
return_special_tokens_mask (bool, optional, defaults to False) — 是否返回特殊 token 掩码信息。
return_offsets_mapping (bool, optional, defaults to False) — 是否为每个 token 返回 (char_start, char_end)。

这仅在继承自 PreTrainedTokenizerFast 的快速分词器上可用，如果使用 Python 的分词器，此方法将引发 NotImplementedError。
verbose (bool, optional, defaults to True) — Whether or not to print more information and warnings.
**kwargs — passed to the self.tokenize() method

Returns

BatchEncoding

A BatchEncoding with the following fields

input_ids — List of token ids to be fed to a model.

What are input IDs?
token_type_ids — List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

What are token type IDs?
attention_mask — List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

What are attention masks?
overflowing_tokens — List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens — Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length — The length of the inputs (when return_length=True)

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

batch_decode

< source >

( sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None output_char_offsets: bool = False **kwargs ) → List[str] or ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

参数

sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]) — 分词后的输入 id 列表。可以使用 __call__ 方法获得。
skip_special_tokens (bool, optional, defaults to False) — 是否在解码时移除特殊 token，默认为 False。
clean_up_tokenization_spaces (bool, optional) — 是否清理分词空格。
output_char_offsets (bool, optional, defaults to False) — 是否输出字符偏移量。字符偏移量可以与采样率和模型下采样率结合使用，以计算转录字符的时间戳。

请查看 ~models.wav2vec2.tokenization_wav2vec2.decode 的示例，以更好地理解如何使用 output_word_offsets。 ~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode 的工作原理与音素和批量输出类似。
kwargs (额外关键字参数, optional) — 将会传递给底层模型特定的解码方法。

Returns

List[str] 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

解码后的句子。当 output_char_offsets == True 时，将为 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput。

通过调用 decode 将 token id 列表转换为字符串列表。

decode

< source >

( token_ids: typing.Union[int, typing.List[int], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None output_char_offsets: bool = False **kwargs ) → str 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

参数

token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]) — 分词后的输入 id 列表。可以使用 __call__ 方法获得。
skip_special_tokens (bool, optional, defaults to False) — 是否在解码时移除特殊 token，默认为 False。
clean_up_tokenization_spaces (bool, optional) — 是否清理分词空格。
output_char_offsets (bool, optional, defaults to False) — 是否输出字符偏移量。字符偏移量可以与采样率和模型下采样率结合使用，以计算转录字符的时间戳。

请查看 ~models.wav2vec2.tokenization_wav2vec2.decode 的示例，以更好地理解如何使用 output_word_offsets。 ~model.wav2vec2_phoneme.tokenization_wav2vec2_phoneme.batch_decode 对于音素和批量输出的工作方式相同。
kwargs (额外关键字参数, optional) — 将会传递给底层模型特定的解码方法。

Returns

str 或 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput

解码后的句子。当 output_char_offsets == True 时，将为 ~models.wav2vec2.tokenization_wav2vec2_phoneme.Wav2Vec2PhonemeCTCTokenizerOutput。

使用 tokenizer 和词汇表将 id 序列转换为字符串，可以选择移除特殊 token 和清理分词空格。

类似于执行 self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))。

phonemize

< source >

( text: str phonemizer_lang: typing.Optional[str] = None )

< > Update on GitHub

Transformers

Wav2Vec2Phoneme

概述

使用技巧

Wav2Vec2PhonemeCTCTokenizer

class transformers.Wav2Vec2PhonemeCTCTokenizer

__call__

batch_decode

decode

phonemize

call