Whisper

概述

Whisper 模型由 Alec Radford、Jong Wook Kim、Tao Xu、Greg Brockman、Christine McLeavey、Ilya Sutskever 在 Robust Speech Recognition via Large-Scale Weak Supervision 中提出。

该论文的摘要如下

我们研究了语音处理系统的能力，这些系统经过训练，可以简单地预测互联网上大量音频的文字记录。当扩展到 680,000 小时的多语言和多任务监督时，由此产生的模型可以很好地泛化到标准基准，并且在零样本迁移设置中，通常与先前的完全监督结果相媲美，而无需任何微调。与人类相比，这些模型接近他们的准确性和鲁棒性。我们正在发布模型和推理代码，作为进一步研究稳健语音处理的基础。

此模型由 Arthur Zucker 贡献。此模型的 Tensorflow 版本由 amyeroberts 贡献。原始代码可以在这里找到。

快速使用

您可以使用少于 4 行代码运行 Whisper，并在不到一分钟的时间内进行转录！

# pip install transformers torch

import torch
from transformers import pipeline

whisper = pipeline("automatic-speech-recognition", "openai/whisper-large-v3", torch_dtype=torch.float16, device="cuda:0")

transcription = whisper("<audio_file.mp3>")

print(transcription["text"])

瞧！您可以根据您的需求，在 Hugging Face Hub 上使用任何 Whisper 检查点替换模型，并使用相同的管道。

奖励：您可以将 "cuda" 替换为 "mps"，使其在 Mac 上无缝工作。

使用技巧

该模型通常表现良好，无需任何微调。
该架构遵循经典的编码器-解码器架构，这意味着它依赖于 generate() 函数进行推理。
可以使用 WhisperProcessor 为模型准备音频，并将预测的 ID 解码回文本。
要转换模型和处理器，我们建议使用以下方法

python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True

该脚本将自动从 OpenAI 检查点确定所有必要的参数。需要安装 tiktoken 库以执行从 OpenAI 分词器到 tokenizers 版本的转换。

推理

以下是使用预训练的 Whisper 模型转录音频样本的逐步指南

>>> from datasets import load_dataset
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration

>>> # Select an audio file and read it:
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> audio_sample = ds[0]["audio"]

>>> # Load the Whisper model in Hugging Face format:
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

>>> # Use the model and processor to transcribe the audio:
>>> input_features = processor(
...     audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
... ).input_features

>>> # Generate token ids
>>> predicted_ids = model.generate(input_features)

>>> # Decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

Whisper 与以下针对短格式和长格式生成的优化兼容

PyTorch 缩放点积注意力 (SDPA)：闪存注意力和内存高效的注意力内核。默认情况下为 torch>=2.1.1 启用。
Flash Attention 2：通过更好的并行性和工作分区改进闪存注意力的实现。
torch.compile：JIT 编译前向传递以分派到高效的融合内核。

例如，以下代码片段启用 SDPA 和 torch.compile，推理速度最多可提高 5 倍

>>> from datasets import load_dataset
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration

>>> # Select an audio file and read it:
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> audio_sample = ds[0]["audio"]

>>> # Load the Whisper model with SDPA attention
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", attn_implementation="sdpa")

>>> # Enable static cache and compile the forward pass
>>> model.generation_config.cache_implementation = "static"
>>> model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

>>> # Use the model and processor to transcribe the audio:
>>> input_features = processor(
...     audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
... ).input_features

>>> # Compile the forward pass
>>> for _ in range(2):
>>>     model.generate(input_features)

>>> # Generate token ids using compiled graph (fast!)
>>> predicted_ids = model.generate(input_features)

>>> # Decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> transcription[0]
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

有关每项优化的更多详细信息，请参阅上面链接的文档。

资源

官方 Hugging Face 和社区（🌎 表示）资源列表，可帮助您开始使用 Whisper。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！理想情况下，该资源应展示一些新内容，而不是重复现有资源。

微调 Whisper，在您自己的数据集上以获得更好的下游性能。
Distil-Whisper：速度快 6 倍，体积小 2 倍的英语蒸馏 Whisper 模型。我们发布了模型检查点和蒸馏代码。
一个分支，其中包含一个脚本，用于将 Hugging Face 格式的 Whisper 模型转换为 OpenAI 格式。🌎 使用示例

pip install -U openai-whisper
python convert_hf_to_openai.py \
    --checkpoint openai/whisper-tiny \
    --whisper_dump_path whisper-tiny-openai.pt

Transformers

Whisper

概述

快速使用

使用技巧

推理

资源

WhisperConfig

class transformers.WhisperConfig

WhisperTokenizer

class transformers.WhisperTokenizer

set_prefix_tokens

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

batch_decode

decode

basic_normalize

normalize

WhisperTokenizerFast

class transformers.WhisperTokenizerFast

set_prefix_tokens

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

batch_decode

decode

basic_normalize

normalize

WhisperFeatureExtractor

class transformers.WhisperFeatureExtractor

__call__

WhisperProcessor

class transformers.WhisperProcessor

__call__

from_pretrained

save_pretrained

batch_decode

decode

WhisperModel

class transformers.WhisperModel

forward

_mask_input_features

WhisperForConditionalGeneration

class transformers.WhisperForConditionalGeneration

forward

generate

WhisperForCausalLM

class transformers.WhisperForCausalLM

forward

WhisperForAudioClassification

class transformers.WhisperForAudioClassification

forward

TFWhisperModel

class transformers.TFWhisperModel

call

TFWhisperForConditionalGeneration

class transformers.TFWhisperForConditionalGeneration

call

FlaxWhisperModel

class transformers.FlaxWhisperModel

__call__

FlaxWhisperForConditionalGeneration

class transformers.FlaxWhisperForConditionalGeneration

__call__

FlaxWhisperForAudioClassification

class transformers.FlaxWhisperForAudioClassification

__call__

call

call

call

call

call