使用推理端点快速部署 MusicGen

发布于 2023 年 8 月 4 日

在 GitHub 上更新

Vaibhav Srivastav

reach-vb

merve

MusicGen 是一个强大的音乐生成模型，它接收文本提示和可选旋律来输出音乐。这篇博文将指导您如何使用推理端点通过 MusicGen 生成音乐。

推理端点允许我们编写名为自定义处理程序的自定义推理函数。当模型不被 `transformers` 高级抽象 `pipeline` 开箱即用地支持时，这些函数特别有用。

`transformers` pipelines 提供了强大的抽象功能，可以运行基于 `transformers` 的模型进行推理。推理端点利用 pipeline API，只需点击几下即可轻松部署模型。然而，推理端点也可以用于部署没有 pipeline 的模型，甚至是非 Transformer 模型！这是通过我们称之为自定义处理程序的自定义推理函数实现的。

让我们以 MusicGen 为例演示这个过程。要为 MusicGen 实现自定义处理函数并部署它，我们需要：

复制我们要服务的 MusicGen 仓库，
在 `handler.py` 中编写自定义处理程序，在 `requirements.txt` 中编写所有依赖项，并将它们添加到复制的仓库中，
为该仓库创建推理端点。

或者，您也可以直接使用最终结果并部署我们的自定义 MusicGen 模型仓库，我们刚刚按照上述步骤操作了 :)

开始吧！

首先，我们将使用仓库复制器将 facebook/musicgen-large 仓库复制到我们自己的配置文件中。

然后，我们将 `handler.py` 和 `requirements.txt` 添加到复制的仓库中。首先，让我们看看如何使用 MusicGen 运行推理。

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")

inputs = processor(
    text=["80s pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

让我们听听它听起来怎么样

您可以选择性地使用音频片段来调整输出，即生成一个结合文本生成的音频与输入音频的补充片段。

from transformers import AutoProcessor, MusicgenForConditionalGeneration
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]

# take the first half of the audio sample
sample["array"] = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=sample["array"],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

让我们听听看

在这两种情况下，`model.generate` 方法都会生成音频，并遵循与文本生成相同的原理。您可以在我们的如何生成博客文章中阅读更多相关内容。

好的！有了上面概述的基本用法，让我们部署 MusicGen，既能娱乐又能盈利！

首先，我们将在 `handler.py` 中定义一个自定义处理程序。我们可以使用 Inference Endpoints 模板并用我们自定义的推理代码覆盖 `__init__` 和 `__call__` 方法。`__init__` 将初始化模型和处理器，而 `__call__` 将接收数据并返回生成的音乐。您可以在下方找到修改后的 `EndpointHandler` 类。👇

from typing import Dict, List, Any
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torch

class EndpointHandler:
    def __init__(self, path=""):
        # load model and processor from path
        self.processor = AutoProcessor.from_pretrained(path)
        self.model = MusicgenForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16).to("cuda")

    def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:
        """
        Args:
            data (:dict:):
                The payload with the text prompt and generation parameters.
        """
        # process input
        inputs = data.pop("inputs", data)
        parameters = data.pop("parameters", None)

        # preprocess
        inputs = self.processor(
            text=[inputs],
            padding=True,
            return_tensors="pt",).to("cuda")

        # pass inputs with all kwargs in data
        if parameters is not None:
            with torch.autocast("cuda"):
                outputs = self.model.generate(**inputs, **parameters)
        else:
            with torch.autocast("cuda"):
                outputs = self.model.generate(**inputs,)

        # postprocess the prediction
        prediction = outputs[0].cpu().numpy().tolist()

        return [{"generated_audio": prediction}]

为了简单起见，在这个例子中，我们只从文本生成音频，而不使用旋律进行条件化。接下来，我们将创建一个 `requirements.txt` 文件，其中包含运行推理代码所需的所有依赖项。

transformers==4.31.0
accelerate>=0.20.3

将这两个文件上传到我们的仓库就足以提供模型服务。

我们现在可以创建推理端点。前往推理端点页面并点击 部署您的第一个模型。在“模型仓库”字段中，输入您复制的仓库的标识符。然后选择所需的硬件并创建端点。任何至少拥有 16 GB RAM 的实例都应该适用于 musicgen-large。

创建端点后，它将自动启动并准备好接收请求。

我们可以使用下面的代码片段查询端点。

curl URL_OF_ENDPOINT \
-X POST \
-d '{"inputs":"happy folk song, cheerful and lively"}' \
-H "Authorization: {YOUR_TOKEN_HERE}" \
-H "Content-Type: application/json"

我们可以看到以下波形序列作为输出。

[{"generated_audio":[[-0.024490159,-0.03154691,-0.0079551935,-0.003828604, ...]]}]

听起来是这样的

您还可以使用 `huggingface-hub` Python 库的 `InferenceClient` 类来调用端点。

from huggingface_hub import InferenceClient

client = InferenceClient(model = URL_OF_ENDPOINT)
response = client.post(json={"inputs":"an alt rock song"})
# response looks like this b'[{"generated_text":[[-0.182352,-0.17802449, ...]]}]

output = eval(response)[0]["generated_audio"]

您可以按照您喜欢的方式将生成的序列转换为音频。您可以使用 Python 中的 `scipy` 将其写入 .wav 文件。

import scipy
import numpy as np

# output is [[-0.182352,-0.17802449, ...]]
scipy.io.wavfile.write("musicgen_out.wav", rate=32000, data=np.array(output[0]))

瞧！

在下面的演示中尝试使用该端点。

结论

在这篇博文中，我们展示了如何使用带自定义推理处理程序的推理端点部署 MusicGen。同样的技巧可以用于 Hub 中任何没有关联 pipeline 的其他模型。您只需覆盖 `handler.py` 中的 `Endpoint Handler` 类，并添加 `requirements.txt` 以反映项目的依赖项即可。

使用 Inference Endpoints 实现极速 Whisper 转录

作者 2025 年 5 月 13 日 • 74

使用 🤗 Transformers 为低资源 ASR 微调 W2V2-Bert

作者 2024 年 1 月 19 日 • 39

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论