聊天基础知识

聊天模型是您可以发送和接收消息的对话模型。有许多聊天模型可供选择，但通常情况下，较大的模型往往更好，尽管并非总是如此。模型大小通常包含在名称中，例如“8B”或“70B”，它描述了参数的数量。专家混合（MoE）模型的名称如“8x7B”或“141B-A35B”，这意味着它是一个56B和141B参数的模型。您可以尝试对较大的模型进行量化以减少内存需求，否则您将需要每个参数约2字节的内存。

查看模型排行榜，如 OpenLLM 和 LMSys Chatbot Arena，以进一步帮助您确定最适合您用例的聊天模型。在某些特定领域（医疗、法律文本、非英语语言等）的模型有时可能优于更大的通用模型。

在 HuggingChat 上免费与许多开源模型聊天！

本指南向您展示如何快速从命令行开始使用 Transformers 进行聊天，如何构建和格式化对话，以及如何使用 TextGenerationPipeline 进行聊天。

Transformers CLI

安装 Transformers 后，您可以直接从命令行与模型聊天，如下所示。它会启动一个与模型的交互式会话，会话开始时会列出一些基本命令。

transformers chat Qwen/Qwen2.5-0.5B-Instruct

您可以使用任意的 generate 标志启动 CLI，格式为 arg_1=value_1 arg_2=value_2 ...

transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10

有关所有选项的完整列表，请运行以下命令。

transformers chat -h

此聊天是在 AutoClass 之上实现的，使用了文本生成和聊天的工具。

TextGenerationPipeline

TextGenerationPipeline 是一个高级文本生成类，具有“聊天模式”。当检测到会话模型且聊天提示格式正确时，将启用聊天模式。

首先，使用以下两个角色构建聊天历史记录。

system 描述了模型在您与它聊天时应该如何表现和响应。并非所有聊天模型都支持此角色。
user 是您向模型输入第一条消息的地方。

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

创建 TextGenerationPipeline 并将其传递给 chat。对于大型模型，设置 device_map="auto" 有助于更快地加载模型，并自动将其放置在最快的可用设备上。将数据类型更改为 torch.bfloat16 也有助于节省内存。

import torch
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!

So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
some wild stuff, like that Warhol guy's soup cans and all that jazz.

And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.

Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)

And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
pizzerias around the city. Just don't try to order a "robot-sized" slice, trust me, it won't end well. (laughs)

So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
excuse me, I've got some oil changes to attend to. (winks)

使用 chat 上的 append 方法回复模型消息。

chat = response[0]["generated_text"]
chat.append(
    {"role": "user", "content": "Wait, what's so wild about soup cans?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(sarcastically) Oh, yeah, real original, Andy.

But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
status quo, and Warhol was like, the king of that. He took the ordinary and made it extraordinary.
And, let me tell you, it was like, a real game-changer. I mean, who would've thought that a can of soup could be art? (laughs)

But, hey, you're not alone, pal. I mean, I'm a robot, and even I don't get it. (winks)
But, hey, that's what makes art, art, right? (laughs)

性能

Transformers 默认以全精度加载模型，对于一个 8B 模型，这需要约 32GB 的内存！通过以半精度或 bfloat16（每个参数仅使用约 2 字节）加载模型来减少内存使用。您甚至可以使用 bitsandbytes 将模型量化到更低的精度，如 8 位或 4 位。

有关可用的不同量化后端的信息，请参阅量化文档。

使用您想要的量化设置创建 BitsAndBytesConfig 并将其传递给管道的 model_kwargs 参数。下面的示例将模型量化为 8 位。

from transformers import pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})

一般来说，大型模型除了需要更多内存外，速度也较慢，因为文本生成受到**内存带宽**而不是计算能力的瓶颈。每个激活参数都必须为每个生成的令牌从内存中读取。对于一个 16GB 的模型，每个生成的令牌必须从内存中读取 16GB。

生成的令牌/秒的数量与系统总内存带宽除以模型大小成正比。根据您的硬件，总内存带宽可能会有所不同。请参阅下表，了解不同硬件类型的近似生成速度。

硬件	内存带宽
消费级CPU	20-100GB/秒
专用CPU（Intel Xeon、AMD Threadripper/Epyc、Apple silicon）	200-900GB/秒
数据中心GPU（NVIDIA A100/H100）	2-3TB/秒

提高生成速度最简单的解决方案是量化模型或使用具有更高内存带宽的硬件。

您还可以尝试推测解码等技术，其中较小的模型生成候选令牌，然后由较大的模型进行验证。如果候选令牌正确，则较大的模型可以在每次 forward 传递中生成多个令牌。这显著缓解了带宽瓶颈并提高了生成速度。

在MoE模型中，例如Mixtral、Qwen2MoE和DBRX，并非所有生成的令牌都必须激活所有参数。因此，MoE模型通常具有更低的内存带宽要求，并且比同等大小的常规LLM更快。然而，推测解码等技术对MoE模型无效，因为每次推测新令牌时都会激活参数。

< > 在 GitHub 上更新