Transformers 文档

Qwen2.5-Omni

Transformers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Qwen2.5-Omni

概述

Qwen2.5-Omni 模型是阿里巴巴集团通义团队在Qwen2.5-Omni 技术报告中提出的统一多模态模型。

该技术报告的摘要如下：

我们推出 Qwen2.5-Omni，这是一款端到端的多模态模型，旨在感知包括文本、图像、音频和视频在内的多种模态，同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流式处理，音频和视觉编码器均采用分块处理方法。这种策略有效地解耦了长序列多模态数据的处理，将感知职责分配给多模态编码器，并将长序列的建模委托给大型语言模型。这种分工通过共享注意力机制增强了不同模态的融合。为了同步视频输入与音频的时间戳，我们以交错方式组织音频和视频序列，并提出了一种新颖的位置嵌入方法，名为 TMRoPE（时间对齐多模态 RoPE）。为了同时生成文本和语音，并避免两种模态之间的干扰，我们提出了 Thinker-Talker 架构。在这个框架中，Thinker 作为大型语言模型负责文本生成，而 Talker 是一种双轨自回归模型，直接利用来自 Thinker 的隐藏表示来输出音频标记。Thinker 和 Talker 模型都设计为以端到端方式进行训练和推理。为了以流式方式解码音频标记，我们引入了一个限制感受野的滑动窗口 DiT，旨在减少初始包延迟。Qwen2.5-Omni 在图像和音频能力方面均优于同等大小的 Qwen2-VL 和 Qwen2-Audio。此外，Qwen2.5-Omni 在 Omni-Bench 等多模态基准测试中取得了最先进的性能。值得注意的是，Qwen2.5-Omni 是第一个在端到端语音指令遵循方面达到与文本输入能力相当性能的开源模型，MMLU 和 GSM8K 等基准测试证明了这一点。至于语音生成，Qwen2.5-Omni 的流式 Talker 在鲁棒性和自然度方面优于大多数现有的流式和非流式替代方案。

注意事项

使用 Qwen2_5OmniForConditionalGeneration 生成音频和文本输出。要只生成一种输出类型，文本专用请使用 Qwen2_5OmniThinkerForConditionalGeneration，音频专用请使用 Qwen2_5OmniTalkersForConditionalGeneration。
目前，Qwen2_5OmniForConditionalGeneration 进行音频生成仅支持单个批处理大小。
如果处理视频输入时出现内存不足错误，请减少 processor.max_pixels。默认情况下，最大值设置得非常大，除非分辨率超过 processor.max_pixels，否则高分辨率视觉效果将不会被调整大小。
处理器拥有自己的 apply_chat_template() 方法，可将聊天消息转换为模型输入。

使用示例

Qwen2.5-Omni 可以在 Huggingface Hub 上找到。

单媒体推理

该模型可以接受文本、图像、音频和视频作为输入。以下是推理的示例代码。

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversations = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "What cant you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.device)

# Generation params for audio or text can be different and have to be prefixed with `thinker_` or `talker_`
text_ids, audio = model.generate(**inputs, use_audio_in_video=True, thinker_do_sample=False, talker_do_sample=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)
print(text)

仅文本生成

为了只生成文本输出并通过不加载音频生成模型来节省计算量，我们可以使用 Qwen2_5OmniThinkerForConditionalGeneration 模型。

from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversations = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "What cant you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.device)


text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)
print(text)

批量混合媒体推理

使用 Qwen2_5OmniThinkerForConditionalGeneration 模型时，该模型可以批量处理由文本、图像、音频和视频等各种类型的混合样本组成的输入。这是一个示例。

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

# Conversation with video only
conversation1 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/path/to/video.mp4"},
        ]
    }
]

# Conversation with audio only
conversation2 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "/path/to/audio.wav"},
        ]
    }
]

# Conversation with pure text
conversation3 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "who are you?"}],
    }
]


# Conversation with mixed media
conversation4 = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "path": "/path/to/image.jpg"},
            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "audio", "path": "/path/to/audio.wav"},
            {"type": "text", "text": "What are the elements can you see and hear in these medias?"},
        ],
    }
]

conversations = [conversation1, conversation2, conversation3, conversation4]

inputs = processor.apply_chat_template(
    conversations,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=1,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
).to(model.thinker.device)

text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(text)

使用技巧

图像分辨率权衡

该模型支持多种分辨率输入。默认情况下，它使用原始分辨率进行输入，但更高分辨率可以提升性能，代价是增加计算量。用户可以设置最小和最大像素数，以根据其需求实现最佳配置。

min_pixels = 128*28*28
max_pixels = 768*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min_pixels, max_pixels=max_pixels)

音频输出提示

如果用户需要音频输出，系统提示必须设置为“您是通义团队开发的虚拟人 Qwen，能够感知听觉和视觉输入，并生成文本和语音。”，否则音频输出可能无法按预期工作。

{
    "role": "system",
    "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}

是否使用音频输出

该模型支持文本和音频输出。如果用户不需要音频输出，可以在 from_pretrained 函数中设置 enable_audio_output。此选项将节省约 ~2GB 的 GPU 内存，但 generate 函数的 return_audio 选项将只能设置为 False。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    enable_audio_output=False,
)

为了获得灵活的体验，我们建议用户在通过 from_pretrained 函数初始化模型时将 enable_audio_output 设置为 True，然后在调用 generate 函数时决定是否返回音频。当 return_audio 设置为 False 时，模型将仅返回文本输出以更快地获取文本响应。

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
    enable_audio_output=True,
)
...
text_ids = model.generate(**inputs, return_audio=False)

改变输出音频的音色类型

Qwen2.5-Omni 支持改变输出音频的音色。用户可以使用 generate 函数的 spk 参数指定音色类型。"Qwen/Qwen2.5-Omni-7B" 检查点支持两种音色类型：Chelsie 和 Ethan，其中 Chelsie 为女声，Ethan 为男声。默认情况下，如果未指定 spk，则默认音色类型为 Chelsie。

text_ids, audio = model.generate(**inputs, spk="Chelsie")

text_ids, audio = model.generate(**inputs, spk="Ethan")

使用 Flash-Attention 2 加速生成

首先，请确保安装最新版本的 Flash Attention 2

pip install -U flash-attn --no-build-isolation

此外，您应该拥有与 FlashAttention 2 兼容的硬件。有关更多信息，请参阅Flash Attention 存储库的官方文档。FlashAttention-2 只能在模型以 torch.float16 或 torch.bfloat16 加载时使用。

要使用 FlashAttention-2 加载和运行模型，请在加载模型时添加 attn_implementation="flash_attention_2"

from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-Omni-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Qwen2_5OmniConfig

class transformers.Qwen2_5OmniConfig

< 来源 >

( thinker_config = None talker_config = None token2wav_config = None enable_audio_output: bool = True **kwargs )

参数

thinker_config (dict, 可选) — 基础思考者子模型的配置。
talker_config (dict, 可选) — 基础说话者子模型的配置。
token2wav_config (dict, 可选) — 基础编解码器子模型的配置。
enable_audio_output (bool, 可选, 默认为 True) — 是否启用音频输出并加载说话者和 token2wav 模块。

这是用于存储 Qwen2_5OmniForConditionalGeneration 配置的配置类。它用于根据指定的子模型配置实例化 Qwen2.5Omni 模型，定义模型架构。

使用默认值实例化配置将生成与 Qwen/Qwen2.5-Omni-7B 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import (
...     Qwen2_5OmniThinkerConfig,
...     Qwen2_5OmniTalkerConfig,
...     Qwen2_5OmniToken2WavConfig,
...     Qwen2_5OmniForConditionalGeneration,
...     Qwen2_5OmniConfig,
... )

>>> # Initializing sub-modules configurations.
>>> thinker_config = Qwen2_5OmniThinkerConfig()
>>> talker_config = Qwen2_5OmniTalkerConfig()
>>> token2wav_config = Qwen2_5OmniToken2WavConfig()


>>> # Initializing a module style configuration
>>> configuration = Qwen2_5OmniConfig.from_sub_model_configs(
...     thinker_config, talker_config, token2wav_config
... )

>>> # Initializing a model (with random weights)
>>> model = Qwen2_5OmniForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Qwen2_5OmniProcessor

class transformers.Qwen2_5OmniProcessor

< 来源 >

( image_processor = None video_processor = None feature_extractor = None tokenizer = None chat_template = None )

参数

image_processor (Qwen2VLImageProcessor, 可选) — 图像处理器。
video_processor (Qwen2VLVideoProcessor, 可选) — 视频处理器。
feature_extractor (WhisperFeatureExtractor, 可选) — 音频特征提取器。
tokenizer (Qwen2TokenizerFast, 可选) — 文本分词器。
chat_template (Optional[str], 可选) — 用于格式化对话的 Jinja 模板。如果未提供，则使用默认聊天模板。

构建一个 Qwen2.5Omni 处理器。Qwen2_5OmniProcessor 提供了 Qwen2VLImageProcessor、WhisperFeatureExtractor 和 Qwen2TokenizerFast 的所有功能。有关更多信息，请参阅 __call__() 和 decode()。

批量解码

< 来源 >

( *args **kwargs )

此方法将其所有参数转发给 Qwen2TokenizerFast 的 batch_decode()。有关更多信息，请参阅此方法的文档字符串。

解码

< 来源 >

( *args **kwargs )

此方法将其所有参数转发至 Qwen2TokenizerFast 的 decode()。有关更多信息，请参阅此方法的文档字符串。

获取分块索引

< 来源 >

( token_indices: ndarray tokens_per_chunk: int ) → list[tuple[int, int]]

参数

token_indices (np.ndarray) — 单调递增的标记索引值列表。
t_ntoken_per_chunk (int) — 每块标记的数量（用作块大小阈值）。

list[tuple[int, int]]

一个元组列表，每个元组代表 token_indices 中块的开始（包含）和结束（不包含）索引。

根据标记值范围将标记索引列表拆分为块。

给定标记索引列表，返回 (start, end) 索引元组列表，表示列表中标记值落在连续 t_ntoken_per_chunk 范围内的切片。

例如，如果 t_ntoken_per_chunk 为 1000，则函数将创建这样的块：

第一个块包含标记值 < 1000，
第二个块包含值 >= 1000 且 < 2000，依此类推。

Qwen2_5OmniForConditionalGeneration

class transformers.Qwen2_5OmniForConditionalGeneration

< 来源 >

( config )

参数

config (Qwen2_5OmniForConditionalGeneration) — 模型的配置类，包含模型的所有参数。使用配置文件初始化并不会加载与模型相关的权重，只加载配置。要加载模型权重，请查看 from_pretrained() 方法。

完整的 Qwen2.5Omni 模型，一个由 3 个子模型组成的多模态模型

Qwen2_5OmniThinkerForConditionalGeneration: 一个因果自回归转换器，接受文本、音频、图像、视频作为输入并预测文本标记。
Qwen2_5OmniTalkerForConditionalGeneration: 一个因果自回归转换器，接受思考者的隐藏状态和响应作为输入并预测语音标记。
Qwen2_5OmniToken2WavModel：一个 DiT 模型，以语音 token 作为输入，预测梅尔频谱图，以及一个 BigVGAN 声码器，以梅尔频谱图作为输入，预测波形。

此模型继承自 PreTrainedModel。请查看超类的文档，了解该库为其所有模型实现的一般方法（例如下载或保存、调整输入嵌入大小、修剪头等）。

此模型也是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档，了解与一般用法和行为相关的所有事项。

Transformers

Qwen2.5-Omni

概述

注意事项

使用示例

单媒体推理

仅文本生成

批量混合媒体推理

使用技巧

图像分辨率权衡

音频输出提示

是否使用音频输出

改变输出音频的音色类型

使用 Flash-Attention 2 加速生成

Qwen2_5OmniConfig

class transformers.Qwen2_5OmniConfig

Qwen2_5OmniProcessor

class transformers.Qwen2_5OmniProcessor

批量解码

解码

获取分块索引

Qwen2_5OmniForConditionalGeneration

class transformers.Qwen2_5OmniForConditionalGeneration

_forward_unimplemented

Qwen2_5OmniPreTrainedModelForConditionalGeneration

class transformers.Qwen2_5OmniPreTrainedModelForConditionalGeneration

获取分块索引

get_rope_index

Qwen2_5OmniThinkerConfig

class transformers.Qwen2_5OmniThinkerConfig

Qwen2_5OmniThinkerForConditionalGeneration

class transformers.Qwen2_5OmniThinkerForConditionalGeneration

forward

get_audio_features

get_image_features

get_video_features

Qwen2_5OmniThinkerTextModel

class transformers.Qwen2_5OmniThinkerTextModel

forward

Qwen2_5OmniTalkerConfig

class transformers.Qwen2_5OmniTalkerConfig

Qwen2_5OmniTalkerForConditionalGeneration

class transformers.Qwen2_5OmniTalkerForConditionalGeneration

forward

Qwen2_5OmniTalkerModel

class transformers.Qwen2_5OmniTalkerModel

forward

Qwen2_5OmniToken2WavConfig

class transformers.Qwen2_5OmniToken2WavConfig

Qwen2_5OmniToken2WavModel

class transformers.Qwen2_5OmniToken2WavModel

forward

Qwen2_5OmniToken2WavDiTModel

class transformers.Qwen2_5OmniToken2WavDiTModel

Qwen2_5OmniToken2WavBigVGANModel

class transformers.Qwen2_5OmniToken2WavBigVGANModel