LLM 课程文档
优化的推理部署
并获得增强的文档体验
开始使用
优化的推理部署
在本节中,我们将探讨用于优化 LLM 部署的高级框架:Text Generation Inference (TGI)、vLLM 和 llama.cpp。这些应用程序主要用于生产环境,为用户提供 LLM 服务。本节重点介绍如何在生产环境中部署这些框架,而不是如何在单机上使用它们进行推理。
我们将介绍这些工具如何最大限度地提高推理效率并简化大型语言模型的生产部署。
框架选择指南
TGI、vLLM 和 llama.cpp 服务于类似的目的,但它们具有不同的特点,使其更适合不同的用例。让我们看看它们之间的主要区别,重点关注性能和集成。
内存管理和性能
TGI 旨在生产中保持稳定和可预测,它使用固定的序列长度来保持内存使用一致。TGI 使用 Flash Attention 2 和连续批处理技术来管理内存。这意味着它可以非常高效地处理注意力计算,并通过不断地提供工作来保持 GPU 繁忙。系统可以在需要时在 CPU 和 GPU 之间移动模型的部分,这有助于处理更大的模型。

其关键创新在于它如何管理高带宽内存 (HBM) 和更快的 SRAM 缓存之间的内存传输。传统的注意力机制反复在 HBM 和 SRAM 之间传输数据,通过让 GPU 空闲来造成瓶颈。Flash Attention 将数据一次加载到 SRAM 中并在那里执行所有计算,从而最大限度地减少昂贵的内存传输。
虽然其优势在训练期间最为显著,但 Flash Attention 减少的 VRAM 使用和提高的效率使其在推理方面也很有价值,从而实现更快、更可扩展的 LLM 服务。
vLLM 采用不同的方法,使用 PagedAttention。就像计算机管理其内存页一样,vLLM 将模型的内存分成更小的块。这种巧妙的系统意味着它可以更灵活地处理不同大小的请求,并且不会浪费内存空间。它特别擅长在不同请求之间共享内存并减少内存碎片,这使得整个系统更加高效。
vLLM 的关键创新在于它如何管理此缓存
- 内存分页:不将 KV 缓存视为一个大块,而是将其划分为固定大小的“页面”(类似于操作系统中的虚拟内存)。
- 非连续存储:页面无需在 GPU 内存中连续存储,从而实现更灵活的内存分配。
- 页表管理:页表跟踪哪些页面属于哪个序列,从而实现高效查找和访问。
- 内存共享:对于并行采样等操作,存储提示符 KV 缓存的页面可以在多个序列之间共享。
与传统方法相比,PagedAttention 方法可以使吞吐量提高高达 24 倍,这使其成为生产 LLM 部署的游戏规则改变者。如果您想深入了解 PagedAttention 的工作原理,您可以阅读 vLLM 文档中的指南。
llama.cpp 是一个高度优化的 C/C++ 实现,最初设计用于在消费级硬件上运行 LLaMA 模型。它专注于 CPU 效率,并可选地进行 GPU 加速,是资源受限环境的理想选择。llama.cpp 使用量化技术来减小模型大小和内存需求,同时保持良好的性能。它为各种 CPU 架构实现了优化的内核,并支持基本的 KV 缓存管理以实现高效的 token 生成。
llama.cpp 中的关键量化功能包括
- 多级量化:支持 8 位、4 位、3 位甚至 2 位量化
- GGML/GGUF 格式:使用针对量化推理优化的自定义张量格式
- 混合精度:可以对模型的不同部分应用不同的量化级别
- 硬件特定优化:包括各种 CPU 架构(AVX2、AVX-512、NEON)的优化代码路径
这种方法使得在内存有限的消费级硬件上运行数十亿参数的模型成为可能,使其非常适合本地部署和边缘设备。
部署和集成
让我们继续探讨框架之间的部署和集成差异。
TGI 凭借其生产就绪功能在企业级部署中表现出色。它内置 Kubernetes 支持,并包含在生产环境中运行所需的一切,例如通过 Prometheus 和 Grafana 进行监控、自动扩展和全面的安全功能。该系统还包括企业级日志记录和各种保护措施,例如内容过滤和速率限制,以确保您的部署安全稳定。
vLLM 采用更灵活、更注重开发人员的部署方法。它以 Python 为核心构建,可以轻松替换现有应用程序中的 OpenAI API。该框架专注于提供原始性能,并且可以根据您的特定需求进行定制。它与 Ray 配合使用特别好,可以管理集群,使其成为需要高性能和适应性时的绝佳选择。
llama.cpp 优先考虑简单性和可移植性。其服务器实现轻量级,可以在各种硬件上运行,从强大的服务器到消费级笔记本电脑,甚至一些高端移动设备。凭借最少的依赖项和简单的 C/C++ 核心,它易于部署在安装 Python 框架会很困难的环境中。该服务器提供与 OpenAI 兼容的 API,同时比其他解决方案具有更小的资源占用空间。
入门
让我们探讨如何使用这些框架部署 LLM,从安装和基本设置开始。
安装和基本设置
TGI 易于安装和使用,并与 Hugging Face 生态系统深度集成。
首先,使用 Docker 启动 TGI 服务器
docker run --gpus all \ --shm-size 1g \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id HuggingFaceTB/SmolLM2-360M-Instruct
然后使用 Hugging Face 的 InferenceClient 与其交互
from huggingface_hub import InferenceClient
# Initialize client pointing to TGI endpoint
client = InferenceClient(
model="https://:8080", # URL to the TGI server
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
stop_sequences=[],
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客户端
from openai import OpenAI
# Initialize client pointing to TGI endpoint
client = OpenAI(
base_url="https://:8080/v1", # Make sure to include /v1
api_key="not-needed", # TGI doesn't require an API key by default
)
# Chat completion
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
llama.cpp 易于安装和使用,所需依赖项极少,并支持 CPU 和 GPU 推理。
首先,安装和构建 llama.cpp
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build the project
make
# Download the SmolLM2-1.7B-Instruct-GGUF model
curl -L -O https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF/resolve/main/smollm2-1.7b-instruct.Q4_K_M.gguf
然后,启动服务器(与 OpenAI API 兼容)
# Start the server
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \
--n-gpu-layers 0 # Set to a higher number to use GPU
使用 Hugging Face 的 InferenceClient 与服务器交互
from huggingface_hub import InferenceClient
# Initialize client pointing to llama.cpp server
client = InferenceClient(
model="https://:8080/v1", # URL to the llama.cpp server
token="sk-no-key-required", # llama.cpp server requires this placeholder
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客户端
from openai import OpenAI
# Initialize client pointing to llama.cpp server
client = OpenAI(
base_url="https://:8080/v1",
api_key="sk-no-key-required", # llama.cpp server requires this placeholder
)
# Chat completion
response = client.chat.completions.create(
model="smollm2-1.7b-instruct", # Model identifier can be anything as server only loads one model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
vLLM 易于安装和使用,它与 OpenAI API 兼容,并提供本机 Python 接口。
首先,启动 vLLM OpenAI 兼容服务器
python -m vllm.entrypoints.openai.api_server \ --model HuggingFaceTB/SmolLM2-360M-Instruct \ --host 0.0.0.0 \ --port 8000
然后使用 Hugging Face 的 InferenceClient 与其交互
from huggingface_hub import InferenceClient
# Initialize client pointing to vLLM endpoint
client = InferenceClient(
model="https://:8000/v1", # URL to the vLLM server
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客户端
from openai import OpenAI
# Initialize client pointing to vLLM endpoint
client = OpenAI(
base_url="https://:8000/v1",
api_key="not-needed", # vLLM doesn't require an API key by default
)
# Chat completion
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
基本文本生成
让我们看看使用这些框架进行文本生成的示例
首先,使用高级参数部署 TGI
docker run --gpus all \ --shm-size 1g \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id HuggingFaceTB/SmolLM2-360M-Instruct \ --max-total-tokens 4096 \ --max-input-length 3072 \ --max-batch-total-tokens 8192 \ --waiting-served-ratio 1.2
使用 InferenceClient 进行灵活的文本生成
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8080")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# Raw text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
do_sample=True,
details=True,
)
print(response.generated_text)
或者使用 OpenAI 客户端
from openai import OpenAI
client = OpenAI(base_url="https://:8080/v1", api_key="not-needed")
# Advanced parameters example
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8, # Higher for more creativity
)
print(response.choices[0].message.content)
对于 llama.cpp,您可以在启动服务器时设置高级参数
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \ # Context size
--threads 8 \ # CPU threads to use
--batch-size 512 \ # Batch size for prompt evaluation
--n-gpu-layers 0 # GPU layers (0 = CPU only)
使用 InferenceClient
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8080/v1", token="sk-no-key-required")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# For direct text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
details=True,
)
print(response.generated_text)
或者使用 OpenAI 客户端进行生成,并控制采样参数
from openai import OpenAI
client = OpenAI(base_url="https://:8080/v1", api_key="sk-no-key-required")
# Advanced parameters example
response = client.chat.completions.create(
model="smollm2-1.7b-instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8, # Higher for more creativity
top_p=0.95, # Nucleus sampling probability
frequency_penalty=0.5, # Reduce repetition of frequent tokens
presence_penalty=0.5, # Reduce repetition by penalizing tokens already present
max_tokens=200, # Maximum generation length
)
print(response.choices[0].message.content)
您还可以使用 llama.cpp 的本机库进行更多控制
# Using llama-cpp-python package for direct model access
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="smollm2-1.7b-instruct.Q4_K_M.gguf",
n_ctx=4096, # Context window size
n_threads=8, # CPU threads
n_gpu_layers=0, # GPU layers (0 = CPU only)
)
# Format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""
# Generate response with precise parameter control
output = llm(
prompt,
max_tokens=200,
temperature=0.8,
top_p=0.95,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=["<|im_end|>"],
)
print(output["choices"][0]["text"])
对于 vLLM 的高级用法,您可以使用 InferenceClient
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8000/v1")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# For direct text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
details=True,
)
print(response.generated_text)
您还可以使用 OpenAI 客户端
from openai import OpenAI
client = OpenAI(base_url="https://:8000/v1", api_key="not-needed")
# Advanced parameters example
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
top_p=0.95,
max_tokens=200,
)
print(response.choices[0].message.content)
vLLM 还提供具有精细控制的本机 Python 接口
from vllm import LLM, SamplingParams
# Initialize the model with advanced parameters
llm = LLM(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
gpu_memory_utilization=0.85,
max_num_batched_tokens=8192,
max_num_seqs=256,
block_size=16,
)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
max_tokens=100, # Maximum length
presence_penalty=1.1, # Reduce repetition
frequency_penalty=1.1, # Reduce repetition
stop=["\n\n", "###"], # Stop sequences
)
# Generate text
prompt = "Write a creative story"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)
# For chat-style interactions
chat_prompt = [
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
]
formatted_prompt = llm.get_chat_template()(chat_prompt) # Uses model's chat template
outputs = llm.generate(formatted_prompt, sampling_params)
print(outputs[0].outputs[0].text)
高级生成控制
Token 选择和采样
文本生成过程涉及在每个步骤中选择下一个 token。此选择过程可以通过各种参数进行控制
- 原始 Logits:每个 token 的初始输出概率
- 温度:控制选择中的随机性(越高 = 越有创造力)
- Top-p (Nucleus) 采样:过滤到组成 X% 概率质量的 Top token
- Top-k 过滤:将选择限制为 k 个最可能的 token
以下是配置这些参数的方法
client.generate(
"Write a creative story",
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
top_k=50, # Consider top 50 tokens
max_new_tokens=100, # Maximum length
repetition_penalty=1.1, # Reduce repetition
)
# Via OpenAI API compatibility
response = client.completions.create(
model="smollm2-1.7b-instruct", # Model name (can be any string for llama.cpp server)
prompt="Write a creative story",
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
frequency_penalty=1.1, # Reduce repetition
presence_penalty=0.1, # Reduce repetition
max_tokens=100, # Maximum length
)
# Via llama-cpp-python direct access
output = llm(
"Write a creative story",
temperature=0.8,
top_p=0.95,
top_k=50,
max_tokens=100,
repeat_penalty=1.1,
)
params = SamplingParams(
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
top_k=50, # Consider top 50 tokens
max_tokens=100, # Maximum length
presence_penalty=0.1, # Reduce repetition
)
llm.generate("Write a creative story", sampling_params=params)
控制重复
这两个框架都提供了防止重复文本生成的方法
# Via OpenAI API
response = client.completions.create(
model="smollm2-1.7b-instruct",
prompt="Write a varied text",
frequency_penalty=1.1, # Penalize frequent tokens
presence_penalty=0.8, # Penalize tokens already present
)
# Via direct library
output = llm(
"Write a varied text",
repeat_penalty=1.1, # Penalize repeated tokens
frequency_penalty=0.5, # Additional frequency penalty
presence_penalty=0.5, # Additional presence penalty
)
params = SamplingParams(
presence_penalty=0.1, # Penalize token presence
frequency_penalty=0.1, # Penalize token frequency
)
长度控制和停止序列
您可以控制生成长度并指定何时停止
# Via OpenAI API
response = client.completions.create(
model="smollm2-1.7b-instruct",
prompt="Generate a short paragraph",
max_tokens=100,
stop=["\n\n", "###"],
)
# Via direct library
output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])
params = SamplingParams(
max_tokens=100,
min_tokens=10,
stop=["###", "\n\n"],
ignore_eos=False,
skip_special_tokens=True,
)
内存管理
这两个框架都实现了先进的内存管理技术,以实现高效推理。
# Docker deployment with memory optimization
docker run --gpus all -p 8080:80 \
--shm-size 1g \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id HuggingFaceTB/SmolLM2-1.7B-Instruct \
--max-batch-total-tokens 8192 \
--max-input-length 4096
llama.cpp 使用量化和优化的内存布局
# Server with memory optimizations
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 2048 \ # Context size
--threads 4 \ # CPU threads
--n-gpu-layers 32 \ # Use more GPU layers for larger models
--mlock \ # Lock memory to prevent swapping
--cont-batching # Enable continuous batching
对于 GPU 过大的模型,您可以使用 CPU 卸载
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--n-gpu-layers 20 \ # Keep first 20 layers on GPU
--threads 8 # Use more CPU threads for CPU layers
vLLM 使用 PagedAttention 进行优化内存管理
from vllm.engine.arg_utils import AsyncEngineArgs
engine_args = AsyncEngineArgs(
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
gpu_memory_utilization=0.85,
max_num_batched_tokens=8192,
block_size=16,
)
llm = LLM(engine_args=engine_args)
资源
- 文本生成推理文档
- TGI GitHub 存储库
- vLLM 文档
- vLLM GitHub 存储库
- PagedAttention 论文
- llama.cpp GitHub 存储库
- llama-cpp-python 存储库