Transformers 文档

聊天基础

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

聊天基础

聊天模型是您可以向其发送和接收消息的对话模型。有很多聊天模型可供选择,但一般来说,较大的模型往往更好,尽管情况并非总是如此。模型大小通常包含在名称中,例如“8B”或“70B”,它描述了参数的数量。混合专家 (MoE) 模型的名称类似于“8x7B”或“141B-A35B”,这意味着它是 56B 和 141B 参数模型。您可以尝试量化较大的模型以减少内存需求,否则每个参数将需要约 2 字节的内存。

查看模型排行榜,例如 OpenLLMLMSys Chatbot Arena,以进一步帮助您确定最适合您用例的聊天模型。专门用于某些领域(医疗、法律文本、非英语语言等)的模型有时可能优于较大的通用模型。

HuggingChat 上免费与许多开源模型聊天!

本指南向您展示如何从命令行快速开始使用 Transformers 进行聊天,如何构建和格式化对话,以及如何使用 TextGenerationPipeline 进行聊天。

transformers-cli

直接从命令行与模型聊天,如下所示。它启动与模型的交互式会话。输入 clear 以重置对话,输入 exit 以终止会话,输入 help 以显示所有命令选项。

transformers-cli chat --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct

有关完整选项列表,请运行以下命令。

transformers-cli chat -h

聊天是在 AutoClass 之上实现的,使用了 文本生成聊天 中的工具。

TextGenerationPipeline

TextGenerationPipeline 是一个高级文本生成类,具有“聊天模式”。当检测到对话模型并且聊天提示格式正确时,聊天模式将被启用。

首先,使用以下两个角色构建聊天历史记录。

  • system 描述了当您与模型聊天时,模型应如何表现和响应。并非所有聊天模型都支持此角色。
  • user 是您向模型输入第一条消息的位置。
chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

创建 TextGenerationPipeline 并将 chat 传递给它。对于大型模型,设置 device_map=“auto” 有助于更快地加载模型并自动将其放置在可用的最快设备上。将数据类型更改为 torch.bfloat16 也有助于节省内存。

import torch
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!

So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million 
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of 
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for 
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got 
some wild stuff, like that Warhol guy's soup cans and all that jazz.

And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for 
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.

Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might 
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)

And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
pizzerias around the city. Just don't try to order a "robot-sized" slice, trust me, it won't end well. (laughs)

So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
excuse me, I've got some oil changes to attend to. (winks)

使用 append 方法在 chat 上回复模型的消息。

chat = response[0]["generated_text"]
chat.append(
    {"role": "user", "content": "Wait, what's so wild about soup cans?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man! 
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's 
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!" 
(sarcastically) Oh, yeah, real original, Andy.

But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
status quo, and Warhol was like, the king of that. He took the ordinary and made it extraordinary.
And, let me tell you, it was like, a real game-changer. I mean, who would've thought that a can of soup could be art? (laughs)

But, hey, you're not alone, pal. I mean, I'm a robot, and even I don't get it. (winks)
But, hey, that's what makes art, art, right? (laughs)

性能

Transformers 默认以全精度加载模型,对于 8B 模型,这需要约 32GB 的内存!通过以半精度或 bfloat16 加载模型来减少内存使用量(每个参数仅使用约 2 字节)。您甚至可以使用 bitsandbytes 将模型量化为较低的精度,例如 8 位或 4 位。

有关可用不同量化后端的信息,请参阅 量化 文档。

使用您所需的量化设置创建 BitsAndBytesConfig,并将其传递给 pipeline 的 model_kwargs 参数。以下示例将模型量化为 8 位。

from transformers import pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})

一般来说,除了需要更多内存之外,较大的模型速度也较慢,因为文本生成受内存带宽而不是计算能力限制。每个生成的 token 都必须从内存中读取每个活动参数。对于 16GB 模型,每个生成的 token 都必须从内存中读取 16GB。

生成的 token/秒数与系统的总内存带宽除以模型大小成正比。根据您的硬件,总内存带宽可能会有所不同。请参阅下表,了解不同硬件类型的近似生成速度。

硬件 内存带宽
消费级 CPU 20-100GB/秒
专用 CPU(Intel Xeon、AMD Threadripper/Epyc、Apple 芯片) 200-900GB/秒
数据中心 GPU (NVIDIA A100/H100) 2-3TB/秒

提高生成速度的最简单解决方案是量化模型或使用具有更高内存带宽的硬件。

您还可以尝试诸如 推测解码 之类的技术,其中较小的模型生成由较大模型验证的候选 token。如果候选 token 正确,则较大的模型可以在每个 forward 传递中生成多个 token。这显着缓解了带宽瓶颈并提高了生成速度。

对于诸如 MixtralQwen2MoEDBRX 之类的 MoE 模型,并非每个生成的 token 都需要激活参数。因此,MoE 模型通常具有低得多的内存带宽要求,并且可能比相同大小的常规 LLM 更快。但是,诸如推测解码之类的技术对于 MoE 模型无效,因为参数会随着每个新的推测 token 而激活。

< > 在 GitHub 上更新