Transformers 流式输出
社区文章 发布于 2025 年 3 月 15 日
介绍
随着 AI 驱动的聊天机器人的进步,交互式学习变得更加引人入胜。在本博客中,我们将探讨如何使用 Python、Gradio 和**基于 Qwen 的语言模型**构建**流式输出**。
先决条件
在开始之前,请确保您已安装以下组件
pip install gradio transformers torch
代码实现
import gradio as gr # Import the Gradio library for creating user interfaces
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer # Import necessary classes from the transformers library
from threading import Thread # Import Thread for concurrent execution
import time # Import time for adding delays
model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit" # Define the model name or path
# Load the pre-trained model with automatic data type and device mapping
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
def QwenChat(message, history): # Define the QwenChat function
# Construct the messages list with system, history, and user message
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
]
messages.extend(history) # Add chat history to the messages list
messages.append({"role": "user", "content": message}) # Append the user's message
# Apply chat template to format the messages for the model
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Set up the streamer for token generation
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Prepare model inputs by tokenizing the text and moving it to the model's device
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Set up generation arguments including max tokens and streamer
generation_args = {
"max_new_tokens": 512,
"streamer": streamer,
**model_inputs
}
# Start a separate thread for model generation to allow streaming output
thread = Thread(
target=model.generate,
kwargs=generation_args,
)
thread.start()
# Accumulate and yield text tokens as they are generated
acc_text = ""
for text_token in streamer:
time.sleep(0.01) # Simulate real-time output with a short delay
acc_text += text_token # Append the generated token to the accumulated text
yield acc_text # Yield the accumulated text
# Ensure the generation thread completes
thread.join()
# Create a Gradio chat interface with the QwenChat function
demo = gr.ChatInterface(fn=QwenChat, type="messages")
# Launch the Gradio interface on all available network interfaces
demo.launch(server_name="0.0.0.0")
此 AI 导师的特点
- 实时响应:模型处理输入时动态生成单词。
- 交互式学习:用户可以与 AI 导师进行对话练习。
- 可自定义:修改系统提示以调整教学风格。
工作原理
- 用户输入消息。
- 系统构建包含先前对话的**聊天模板**。
- AI 模型处理输入并**逐字**实时生成响应。
- 响应逐渐出现,模拟自然对话。
结论
这种方法提供了一种**引人入胜的 AI 学习方式**。通过集成**流式输出**,学生可以体验动态、**逼真的互动**,而不是静态响应。