转录会议

在本节的最后一部分，我们将使用 Whisper 模型为两个或多个说话者之间的对话或会议生成转录。然后，我们将它与说话人分离模型配对，以预测“谁在何时说话”。通过将 Whisper 转录的时间戳与说话人分离模型的时间戳匹配，我们可以预测端到端的会议转录，其中包含每个说话者的完整格式化的开始/结束时间。这是您可能在 Otter.ai 等在线会议转录服务中看到的基本版本。

说话人分离

说话人分离（或发言人分割）的任务是获取未标记的音频输入并预测“谁在何时说话”。通过这样做，我们可以预测每个说话人轮次的开始/结束时间戳，对应于每个说话人开始说话和结束说话的时间。

🤗 Transformers 目前在库中没有包含说话人分离模型，但在 Hub 上有一些检查点可以相对容易地使用。在本示例中，我们将使用来自 pyannote.audio 的预训练说话人分离模型。让我们开始并使用 pip 安装该软件包

pip install --upgrade pyannote.audio

太棒了！此模型的权重托管在 Hugging Face Hub 上。要访问它们，我们首先必须同意说话人分离模型的使用条款：pyannote/speaker-diarization。然后是分割模型的使用条款：pyannote/segmentation。

完成后，我们可以在本地设备上加载预训练的说话人分离管道

from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization@2.1", use_auth_token=True
)

让我们在一个示例音频文件上试用一下！为此，我们将加载 LibriSpeech ASR 数据集的示例，该数据集由两个不同的说话者连接在一起以提供单个音频文件

from datasets import load_dataset

concatenated_librispeech = load_dataset(
    "sanchit-gandhi/concatenated_librispeech", split="train", streaming=True
)
sample = next(iter(concatenated_librispeech))

我们可以听一下音频，看看听起来如何

from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

酷！我们可以清楚地听到两个不同的说话者，大约在 15 秒处有一个过渡。让我们将此音频文件传递给说话人分离模型，以获取说话者的开始/结束时间。请注意，pyannote.audio 期望音频输入是形状为 (channels, seq_len) 的 PyTorch 张量，因此我们需要在运行模型之前执行此转换

import torch

input_tensor = torch.from_numpy(sample["audio"]["array"][None, :]).float()
outputs = diarization_pipeline(
    {"waveform": input_tensor, "sample_rate": sample["audio"]["sampling_rate"]}
)

outputs.for_json()["content"]

[{'segment': {'start': 0.4978125, 'end': 14.520937500000002},
  'track': 'B',
  'label': 'SPEAKER_01'},
 {'segment': {'start': 15.364687500000002, 'end': 21.3721875},
  'track': 'A',
  'label': 'SPEAKER_00'}]

这看起来很不错！我们可以看到，第一个说话者被预测为说话直到 14.5 秒标记，第二个说话者从 15.4 秒开始。现在我们需要获得我们的转录！

语音转录

在本单元中第三次，我们将使用 Whisper 模型作为我们的语音转录系统。具体来说，我们将加载 Whisper Base 检查点，因为它足够小，可以在合理的转录准确率下提供良好的推理速度。与之前一样，可以随意使用 Hub 上的任何语音识别检查点，包括 Wav2Vec2、MMS ASR 或其他 Whisper 检查点

from transformers import pipeline

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
)

让我们获取示例音频的转录，同时返回分段级别的时间戳，以便我们知道每个分段的开始/结束时间。您会记得在单元 5 中，我们需要传递参数 return_timestamps=True 以激活 Whisper 的时间戳预测任务

asr_pipeline(
    sample["audio"].copy(),
    generate_kwargs={"max_new_tokens": 256},
    return_timestamps=True,
)

{
    "text": " The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight. He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
    "chunks": [
        {"timestamp": (0.0, 3.56), "text": " The second and importance is as follows."},
        {
            "timestamp": (3.56, 7.84),
            "text": " Sovereignty may be defined to be the right of making laws.",
        },
        {
            "timestamp": (7.84, 13.88),
            "text": " In France, the king really exercises a portion of the sovereign power, since the laws have",
        },
        {"timestamp": (13.88, 15.48), "text": " no weight."},
        {
            "timestamp": (15.48, 19.44),
            "text": " He was in a favored state of mind, owing to the blight his wife's action threatened to",
        },
        {"timestamp": (19.44, 21.28), "text": " cast upon his entire future."},
    ],
}

好的！我们看到转录的每个分段都有开始和结束时间，说话者在 15.48 秒标记处发生变化。现在我们可以将此转录与我们从说话人分离模型获得的时间戳配对，以获得我们的最终转录。

Speechbox

为了获得最终的转录，我们将对齐来自说话人分离模型的时间戳和来自 Whisper 模型的时间戳。说话人分离模型预测第一个说话者在 14.5 秒结束，第二个说话者在 15.4 秒开始，而 Whisper 预测的分段边界分别为 13.88 秒、15.48 秒和 19.44 秒。由于来自 Whisper 的时间戳与来自说话人分离模型的时间戳并不完全匹配，我们需要找到哪些边界最接近 14.5 秒和 15.4 秒，并相应地按说话者分割转录。具体来说，我们将通过最小化两者之间的绝对距离来找到说话人分离和转录时间戳之间的最接近对齐。

幸运的是，我们可以使用 🤗 Speechbox 软件包来执行此对齐。首先，让我们从 main 使用 pip 安装 speechbox

pip install git+https://github.com/huggingface/speechbox

我们现在可以通过将说话人分离模型和 ASR 模型传递给 ASRDiarizationPipeline 类来实例化我们的组合说话人分离加转录管道

from speechbox import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

您还可以通过指定 Hub 上 ASR 模型的模型 ID，直接从预训练实例化 ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-base")

让我们将音频文件传递给复合管道，看看我们得到什么

pipeline(sample["audio"].copy())

[{'speaker': 'SPEAKER_01',
  'text': ' The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.',
  'timestamp': (0.0, 15.48)},
 {'speaker': 'SPEAKER_00',
  'text': " He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.",
  'timestamp': (15.48, 21.28)}]

太棒了！第一个说话者被分割为从 0 秒到 15.48 秒说话，第二个说话者从 15.48 秒到 21.28 秒说话，并附带每个说话者的相应转录。

我们可以通过定义两个辅助函数来更漂亮地格式化时间戳。第一个函数将时间戳元组转换为字符串，四舍五入到设定的十进制位数。第二个函数将说话人 ID、时间戳和文本信息组合到一行，并将每个说话人拆分到他们自己的行上，以便于阅读

def tuple_to_string(start_end_tuple, ndigits=1):
    return str((round(start_end_tuple[0], ndigits), round(start_end_tuple[1], ndigits)))


def format_as_transcription(raw_segments):
    return "\n\n".join(
        [
            chunk["speaker"] + " " + tuple_to_string(chunk["timestamp"]) + chunk["text"]
            for chunk in raw_segments
        ]
    )

让我们重新运行管道，这次根据我们刚刚定义的函数格式化转录

outputs = pipeline(sample["audio"].copy())

format_as_transcription(outputs)

SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws.
In France, the king really exercises a portion of the sovereign power, since the laws have no weight.

SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon
his entire future.

就这样！至此，我们既对输入音频进行了说话人分离和转录，又返回了按说话人分段的转录。虽然用于对齐说话人分离的时间戳和转录的时间戳的最小距离算法很简单，但它在实践中效果良好。如果您想探索更高级的组合时间戳的方法，ASRDiarizationPipeline 的源代码是一个很好的起点：speechbox/diarize.py

< > 在 GitHub 上更新

音频课程

转录会议

说话人分离

语音转录

Speechbox