Qwen2Audio

概述

Qwen2-Audio是Qwen团队推出的大型音频-语言模型新系列。Qwen2-Audio能够接收各种音频信号输入，并根据语音指令执行音频分析或直接文本响应。我们引入了两种不同的音频交互模式

语音聊天：用户可以自由地与Qwen2-Audio进行语音交互，无需文本输入
音频分析：用户可以在交互过程中提供音频和文本指令进行分析

该模型由Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou在Qwen2-Audio 技术报告中提出。

论文摘要如下：

我们介绍了Qwen-Audio的最新进展，这是一个名为Qwen2-Audio的大型音频-语言模型，它能够接收各种音频信号输入，并根据语音指令执行音频分析或直接文本响应。与复杂的层次化标签不同，我们通过利用自然语言提示进行不同的数据和任务，简化了预训练过程，并进一步扩大了数据量。我们提升了Qwen2-Audio的指令遵循能力，并实现了语音聊天和音频分析两种不同的音频交互模式。在语音聊天模式下，用户可以自由地与Qwen2-Audio进行语音交互，无需文本输入。在音频分析模式下，用户可以在交互过程中提供音频和文本指令进行分析。请注意，我们不使用任何系统提示来切换语音聊天和音频分析模式。Qwen2-Audio能够智能地理解音频内容，并遵循语音命令进行适当响应。例如，在一个同时包含声音、多说话人对话和语音命令的音频片段中，Qwen2-Audio可以直接理解命令并提供对音频的解释和响应。此外，DPO优化了模型在事实性和遵循预期行为方面的性能。根据AIR-Bench的评估结果，Qwen2-Audio在以音频为中心的指令遵循能力测试中，表现优于之前的SOTA模型，如Gemini-1.5-pro。Qwen2-Audio是开源的，旨在促进多模态语言社区的进步。

使用提示

Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct可在Huggingface Hub上找到

[!注意] 当使用除“eager”之外的所有注意力实现时，`head_mask`参数将被忽略。如果你有`head_mask`并且希望它生效，请使用`XXXModel.from_pretrained(model_id, attn_implementation="eager")`加载模型。

推理

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True, device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt").to(model.device)

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# We can also omit the audio_bos and audio_eos tokens
prompt = "<|AUDIO|>Generate the caption in English:"
inputs = processor(text=prompt, audios=audio, return_tensors="pt").to(model.device)

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

下面，我们演示如何使用Qwen2-Audio-7B-Instruct进行推理，支持语音聊天和音频分析两种模式。请注意，我们使用ChatML格式进行对话，在此演示中，我们展示如何利用apply_chat_template来实现此目的。

语音聊天推理

在语音聊天模式下，用户可以自由地与Qwen2-Audio进行语音交互，无需文本输入

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()),
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

音频分析推理

在音频分析中，用户可以同时提供音频和文本指令进行分析

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()),
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

批量推理

我们也支持批量推理

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()),
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Qwen2AudioConfig

class transformers.Qwen2AudioConfig

< 来源 >

( audio_config = None text_config = None audio_token_index = 151646 **kwargs )

参数

audio_config (Union[AutoConfig, dict], 可选, 默认为CLIPVisionConfig) — 音频主干网络的配置对象或字典。
text_config (Union[AutoConfig, dict], 可选, 默认为LlamaConfig) — 文本主干网络的配置对象或字典。
audio_token_index (int, 可选, 默认为 151646) — 用于编码音频提示的音频标记索引。

这是用于存储Qwen2AudioForConditionalGeneration配置的配置类。它用于根据指定的参数实例化Qwen2-Audio模型，定义模型架构。使用默认值实例化配置将生成类似于Qwen2-Audio的配置。

例如 Qwen/Qwen2-Audio-7B

配置对象继承自PretrainedConfig，可用于控制模型输出。有关更多信息，请参阅PretrainedConfig的文档。

示例

>>> from transformers import Qwen2AudioForConditionalGeneration, Qwen2AudioConfig, Qwen2AudioEncoderConfig, Qwen2Config

>>> # Initializing a Qwen2AudioEncoder config
>>> audio_config = Qwen2AudioEncoderConfig()

>>> # Initializing a Qwen2 config
>>> text_config = Qwen2Config()

>>> # Initializing a Qwen2Audio configuration
>>> configuration = Qwen2AudioConfig(audio_config, text_config)

>>> # Initializing a model from the qwen2-audio style configuration
>>> model = Qwen2AudioForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Qwen2AudioEncoderConfig

class transformers.Qwen2AudioEncoderConfig

< 来源 >

( num_mel_bins = 128 encoder_layers = 32 encoder_attention_heads = 20 encoder_ffn_dim = 5120 encoder_layerdrop = 0.0 d_model = 1280 dropout = 0.0 attention_dropout = 0.0 activation_function = 'gelu' activation_dropout = 0.0 scale_embedding = False initializer_range = 0.02 max_source_positions = 1500 **kwargs )

参数

num_mel_bins (int, 可选, 默认为 128) — 每个输入特征中使用的梅尔特征数量。应与Qwen2AudioProcessor类中使用的值对应。
encoder_layers (int, 可选, 默认为 32) — 编码器层数。
encoder_attention_heads (int, 可选, 默认为 20) — Transformer编码器中每个注意力层的注意力头数量。
encoder_ffn_dim (int, 可选, 默认为 5120) — 编码器中“中间”（通常称为前馈）层的维度。
encoder_layerdrop (float, 可选, 默认为 0.0) — 编码器的LayerDrop概率。更多详情请参阅[LayerDrop论文](参见https://huggingface.co/papers/1909.11556)。
d_model (int, 可选, 默认为 1280) — 各层的维度。
dropout (float, 可选, 默认为 0.0) — 嵌入、编码器和池化器中所有全连接层的丢弃概率。
attention_dropout (float, 可选, 默认为 0.0) — 注意力概率的丢弃率。
activation_function (str, 可选, 默认为"gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，支持"gelu"、"relu"、"silu"和"gelu_new"。
activation_dropout (float, 可选, 默认为 0.0) — 全连接层内部激活的丢弃率。
scale_embedding (bool, 可选, 默认为False) — 通过除以sqrt(d_model)来缩放嵌入。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的截断正态分布初始化器的标准差。
max_source_positions (int, 可选, 默认为 1500) — 此模型可能使用的对数梅尔滤波器组特征的最大序列长度。

这是用于存储Qwen2AudioEncoder配置的配置类。它用于根据指定的参数实例化Qwen2-Audio音频编码器，定义模型架构。使用默认值实例化配置将生成类似于Qwen2-Audio架构的音频编码器的配置。

例如 Qwen/Qwen2-Audio-7B

配置对象继承自PretrainedConfig，可用于控制模型输出。有关更多信息，请参阅PretrainedConfig的文档。

示例

>>> from transformers import Qwen2AudioEncoderConfig, Qwen2AudioEncoder

>>> # Initializing a Qwen2AudioEncoderConfig
>>> configuration = Qwen2AudioEncoderConfig()

>>> # Initializing a Qwen2AudioEncoder (with random weights)
>>> model = Qwen2AudioEncoder(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Qwen2AudioProcessor

class transformers.Qwen2AudioProcessor

< 来源 >

参数

feature_extractor (WhisperFeatureExtractor, 可选) — 特征提取器是必需的输入。
tokenizer (Qwen2TokenizerFast, 可选) — 分词器是必需的输入。
chat_template (Optional[str], 可选) — 用于格式化对话的Jinja模板。如果未提供，则使用默认的聊天模板。
audio_token (str, 可选, 默认为"<|AUDIO|>") — 用于音频标记的标记。
audio_bos_token (str, 可选, 默认为"<|audio_bos|>") — 用于音频bos标记的标记。
audio_eos_token (str, 可选, 默认为"<|audio_eos|>") — 用于音频eos标记的标记。

构建一个Qwen2Audio处理器，它将Qwen2Audio特征提取器和Qwen2Audio分词器包装到一个单独的处理器中。

Qwen2AudioProcessor提供了WhisperFeatureExtractor和Qwen2TokenizerFast的所有功能。更多信息请参阅__call__()和decode()。

批量解码

< 来源 >

( *args **kwargs )

此方法将其所有参数转发至 Qwen2TokenizerFast 的 batch_decode()。有关更多信息，请参阅此方法的文档字符串。

decode

< source >

( *args **kwargs )

此方法将其所有参数转发至 Qwen2TokenizerFast 的 decode()。有关更多信息，请参阅此方法的文档字符串。

Qwen2AudioEncoder

class transformers.Qwen2AudioEncoder

< source >

( config: Qwen2AudioEncoderConfig )

参数

config (Qwen2AudioEncoderConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化并不会加载与模型相关的权重，只加载配置。请查阅 from_pretrained() 方法以加载模型权重。

Qwen2Audio 的音频模型，没有任何头部或顶层投影。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并查阅 PyTorch 文档以了解所有与一般用法和行为相关的事宜。

forward

< source >

( input_features attention_mask = None head_mask = None output_attentions = None output_hidden_states = None return_dict = None )

参数

input_features (torch.LongTensor，形状为 (batch_size, feature_size, sequence_length)) — 从原始语音波形中提取的 Mel 特征的浮点值。原始语音波形可以通过将 .flac 或 .wav 音频文件加载到类型为 list[float] 或 numpy.ndarray 的数组中来获得，例如通过 soundfile 库（pip install soundfile）。为了将数组准备为 input_features，应使用 AutoFeatureExtractor 来提取 Mel 特征、填充并转换为 torch.FloatTensor 类型张量。参见 call()
attention_mask (torch.Tensor), *可选*) -- Qwen2Audio 不支持对 input_features` 进行掩码，此参数保留用于兼容性，但未使用。默认情况下，输入 log mel 频谱图中的静音会被忽略。
head_mask (torch.Tensor，形状为 (encoder_layers, encoder_attention_heads)，可选) — 用于使注意力模块的选定头部无效的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示头部未被掩码，
- 0 表示头部被掩码。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。更多详细信息请参阅返回张量中的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。更多详细信息请参阅返回张量中的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。

Qwen2AudioForConditionalGeneration

class transformers.Qwen2AudioForConditionalGeneration

< source >

( config: Qwen2AudioConfig )

参数

config (Qwen2AudioConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化并不会加载与模型相关的权重，只加载配置。请查阅 from_pretrained() 方法以加载模型权重。

QWEN2AUDIO 模型，由音频主干和语言模型组成。

此模型继承自 PreTrainedModel。请查阅超类文档，了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并查阅 PyTorch 文档以了解所有与一般用法和行为相关的事宜。

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None input_features: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None feature_attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.qwen2_audio.modeling_qwen2_audio.Qwen2AudioCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 词汇表中输入序列 token 的索引。默认情况下会忽略填充。

可以使用 AutoTokenizer 获取索引。更多详细信息请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
input_features (torch.FloatTensor，形状为 (batch_size, feature_size, feature_sequence_length)) — 从原始语音波形中提取的浮点值 Mel 特征。原始语音波形可以通过将 .flac 或 .wav 音频文件加载到类型为 list[float] 或 numpy.ndarray 的数组中来获得，例如通过 soundfile 库（pip install soundfile）。为了将数组准备为 input_features，应使用 AutoFeatureExtractor 来提取 Mel 特征、填充并转换为 torch.FloatTensor 类型张量。参见 call()
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length)，可选) — 避免对填充 token 索引执行注意力的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示 token 未被掩码，
- 0 表示 token 被掩码。
什么是注意力掩码？
feature_attention_mask (torch.Tensor，形状为 (batch_size, feature_sequence_length)) — 避免对填充特征索引执行注意力的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示 token 未被掩码，
- 0 表示 token 被掩码。
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 每个输入序列 token 在位置嵌入中的位置索引。选择范围为 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (~cache_utils.Cache，可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码。这通常包括模型在解码上一阶段返回的 past_key_values，当 use_cache=True 或 config.use_cache=True 时。

允许两种格式：
- 一个 Cache 实例，参见我们的 kv 缓存指南；
- 长度为 config.n_layers 的 tuple(torch.FloatTensor) 元组，每个元组包含 2 个形状为 (batch_size, num_heads, sequence_length, embed_size_per_head) 的张量。这也称为旧版缓存格式。
模型将输出与输入相同的缓存格式。如果没有传入 past_key_values，将返回旧版缓存格式。

如果使用 past_key_values，用户可以选择只输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将它们的过去键值状态提供给此模型的 token），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)，可选) — （可选）您可以选择直接传入嵌入表示，而不是传入 input_ids。如果您希望对如何将 input_ids 索引转换为相关向量有比模型内部嵌入查找矩阵更强的控制，这会很有用。
labels (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100 之间（请参阅 input_ids 文档字符串）。索引设置为 -100 的 token 将被忽略（掩码），损失仅针对标签在 [0, ..., config.vocab_size] 中的 token 计算。
use_cache (bool，可选) — 如果设置为 True，将返回 past_key_values 键值状态，并可用于加速解码（参见 past_key_values）。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。更多详细信息请参阅返回张量中的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。更多详细信息请参阅返回张量中的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。

transformers.models.qwen2_audio.modeling_qwen2_audio.Qwen2AudioCausalLMOutputWithPast 或 tuple(torch.FloatTensor)

transformers.models.qwen2_audio.modeling_qwen2_audio.Qwen2AudioCausalLMOutputWithPast 或 torch.FloatTensor 的元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（Qwen2AudioConfig）和输入而定的各种元素。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个 token 预测）。
logits (形状为 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 语言建模头部的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (Cache，可选，当传入 use_cache=True 或当 config.use_cache=True 时返回) — 预先计算的隐藏状态，可用于加速自回归（顺序）解码。有两组预先计算的隐藏状态：自注意力块中的键和值状态。当传入 use_cache=True 或当 config.use_cache=True 时，将返回 past_key_values。它是一个 Cache 实例。

如果使用 past_key_values，用户可以选择只输入形状为 (batch_size, 1) 的最后一个 input_ids（那些没有将它们的过去键值状态提供给此模型的 token），而不是形状为 (batch_size, sequence_length) 的所有 input_ids。
hidden_states (tuple[torch.FloatTensor]，可选，当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型有嵌入层，则为一个用于嵌入层输出的张量，加上一个用于每个层输出的张量），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor]，可选，当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
attention_mask (torch.FloatTensor，可选) — 注意力掩码，用于更新注意力掩码和位置 ID。

Qwen2AudioForConditionalGeneration 的 forward 方法，覆盖了 __call__ 特殊方法。

尽管 forward pass 的配方需要在此函数中定义，但此后应调用 Module 实例，而不是此函数，因为前者负责运行预处理和后处理步骤，而后者会默默忽略它们。

示例

>>> from io import BytesIO
>>> from urllib.request import urlopen
>>> import librosa
>>> from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

>>> model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B")
>>> processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B")

>>> prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
>>> url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"
>>> audio, _ = librosa.load(BytesIO(urlopen(url).read()), sr=self.processor.feature_extractor.sampling_rate)

>>> inputs = processor(text=prompt, audios=audio, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(**inputs, max_length=30)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Generate the caption in English: Glass is breaking."

< > 在 GitHub 上更新

Transformers

Qwen2Audio

概述

使用提示

推理

语音聊天推理

音频分析推理

批量推理

Qwen2AudioConfig

class transformers.Qwen2AudioConfig

Qwen2AudioEncoderConfig

class transformers.Qwen2AudioEncoderConfig

Qwen2AudioProcessor

class transformers.Qwen2AudioProcessor

批量解码

decode

Qwen2AudioEncoder

class transformers.Qwen2AudioEncoder

forward

Qwen2AudioForConditionalGeneration

class transformers.Qwen2AudioForConditionalGeneration

forward