优化的推理部署

在本节中，我们将探讨用于优化 LLM 部署的高级框架：Text Generation Inference (TGI)、vLLM 和 llama.cpp。这些应用程序主要用于生产环境，为用户提供 LLM 服务。本节重点介绍如何在生产环境中部署这些框架，而不是如何在单机上使用它们进行推理。

我们将介绍这些工具如何最大限度地提高推理效率并简化大型语言模型的生产部署。

框架选择指南

TGI、vLLM 和 llama.cpp 服务于类似的目的，但它们具有不同的特点，使其更适合不同的用例。让我们看看它们之间的主要区别，重点关注性能和集成。

内存管理和性能

TGI 旨在生产中保持稳定和可预测，它使用固定的序列长度来保持内存使用一致。TGI 使用 Flash Attention 2 和连续批处理技术来管理内存。这意味着它可以非常高效地处理注意力计算，并通过不断地提供工作来保持 GPU 繁忙。系统可以在需要时在 CPU 和 GPU 之间移动模型的部分，这有助于处理更大的模型。

Flash Attention 是一种通过解决内存带宽瓶颈来优化 Transformer 模型中注意力机制的技术。正如 [第 1.8 章](/course/chapter1/8) 中讨论的那样，注意力机制具有二次复杂性和内存使用量，使其在长序列中效率低下。

其关键创新在于它如何管理高带宽内存 (HBM) 和更快的 SRAM 缓存之间的内存传输。传统的注意力机制反复在 HBM 和 SRAM 之间传输数据，通过让 GPU 空闲来造成瓶颈。Flash Attention 将数据一次加载到 SRAM 中并在那里执行所有计算，从而最大限度地减少昂贵的内存传输。

虽然其优势在训练期间最为显著，但 Flash Attention 减少的 VRAM 使用和提高的效率使其在推理方面也很有价值，从而实现更快、更可扩展的 LLM 服务。

vLLM 采用不同的方法，使用 PagedAttention。就像计算机管理其内存页一样，vLLM 将模型的内存分成更小的块。这种巧妙的系统意味着它可以更灵活地处理不同大小的请求，并且不会浪费内存空间。它特别擅长在不同请求之间共享内存并减少内存碎片，这使得整个系统更加高效。

PagedAttention 是一种解决 LLM 推理中另一个关键瓶颈的技术：KV 缓存内存管理。正如 [第 1.8 章](/course/chapter1/8) 中讨论的那样，在文本生成过程中，模型为每个生成的 token 存储注意力键和值（KV 缓存），以减少冗余计算。KV 缓存可能变得非常庞大，尤其是在长序列或多个并发请求的情况下。

vLLM 的关键创新在于它如何管理此缓存

内存分页：不将 KV 缓存视为一个大块，而是将其划分为固定大小的“页面”（类似于操作系统中的虚拟内存）。
非连续存储：页面无需在 GPU 内存中连续存储，从而实现更灵活的内存分配。
页表管理：页表跟踪哪些页面属于哪个序列，从而实现高效查找和访问。
内存共享：对于并行采样等操作，存储提示符 KV 缓存的页面可以在多个序列之间共享。

与传统方法相比，PagedAttention 方法可以使吞吐量提高高达 24 倍，这使其成为生产 LLM 部署的游戏规则改变者。如果您想深入了解 PagedAttention 的工作原理，您可以阅读 vLLM 文档中的指南。

llama.cpp 是一个高度优化的 C/C++ 实现，最初设计用于在消费级硬件上运行 LLaMA 模型。它专注于 CPU 效率，并可选地进行 GPU 加速，是资源受限环境的理想选择。llama.cpp 使用量化技术来减小模型大小和内存需求，同时保持良好的性能。它为各种 CPU 架构实现了优化的内核，并支持基本的 KV 缓存管理以实现高效的 token 生成。

llama.cpp 中的量化将模型权重从 32 位或 16 位浮点精度降低到 8 位整数 (INT8)、4 位甚至更低的低精度格式。这显著减少了内存使用并提高了推理速度，同时最大限度地减少了质量损失。

llama.cpp 中的关键量化功能包括

多级量化：支持 8 位、4 位、3 位甚至 2 位量化
GGML/GGUF 格式：使用针对量化推理优化的自定义张量格式
混合精度：可以对模型的不同部分应用不同的量化级别
硬件特定优化：包括各种 CPU 架构（AVX2、AVX-512、NEON）的优化代码路径

这种方法使得在内存有限的消费级硬件上运行数十亿参数的模型成为可能，使其非常适合本地部署和边缘设备。

部署和集成

让我们继续探讨框架之间的部署和集成差异。

TGI 凭借其生产就绪功能在企业级部署中表现出色。它内置 Kubernetes 支持，并包含在生产环境中运行所需的一切，例如通过 Prometheus 和 Grafana 进行监控、自动扩展和全面的安全功能。该系统还包括企业级日志记录和各种保护措施，例如内容过滤和速率限制，以确保您的部署安全稳定。

vLLM 采用更灵活、更注重开发人员的部署方法。它以 Python 为核心构建，可以轻松替换现有应用程序中的 OpenAI API。该框架专注于提供原始性能，并且可以根据您的特定需求进行定制。它与 Ray 配合使用特别好，可以管理集群，使其成为需要高性能和适应性时的绝佳选择。

llama.cpp 优先考虑简单性和可移植性。其服务器实现轻量级，可以在各种硬件上运行，从强大的服务器到消费级笔记本电脑，甚至一些高端移动设备。凭借最少的依赖项和简单的 C/C++ 核心，它易于部署在安装 Python 框架会很困难的环境中。该服务器提供与 OpenAI 兼容的 API，同时比其他解决方案具有更小的资源占用空间。

入门

让我们探讨如何使用这些框架部署 LLM，从安装和基本设置开始。

安装和基本设置

TGI 易于安装和使用，并与 Hugging Face 生态系统深度集成。

首先，使用 Docker 启动 TGI 服务器

docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct

然后使用 Hugging Face 的 InferenceClient 与其交互

from huggingface_hub import InferenceClient

# Initialize client pointing to TGI endpoint
client = InferenceClient(
    model="https://:8080",  # URL to the TGI server
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
    stop_sequences=[],
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客户端

from openai import OpenAI

# Initialize client pointing to TGI endpoint
client = OpenAI(
    base_url="https://:8080/v1",  # Make sure to include /v1
    api_key="not-needed",  # TGI doesn't require an API key by default
)

# Chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="llama.cpp" label="llama.cpp">

llama.cpp 易于安装和使用，所需依赖项极少，并支持 CPU 和 GPU 推理。

首先，安装和构建 llama.cpp

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build the project
make

# Download the SmolLM2-1.7B-Instruct-GGUF model
curl -L -O https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF/resolve/main/smollm2-1.7b-instruct.Q4_K_M.gguf

然后，启动服务器（与 OpenAI API 兼容）

# Start the server
./server \
    -m smollm2-1.7b-instruct.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \
    --n-gpu-layers 0  # Set to a higher number to use GPU

使用 Hugging Face 的 InferenceClient 与服务器交互

from huggingface_hub import InferenceClient

# Initialize client pointing to llama.cpp server
client = InferenceClient(
    model="https://:8080/v1",  # URL to the llama.cpp server
    token="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客户端

from openai import OpenAI

# Initialize client pointing to llama.cpp server
client = OpenAI(
    base_url="https://:8080/v1",
    api_key="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# Chat completion
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",  # Model identifier can be anything as server only loads one model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="vllm" label="vLLM">

vLLM 易于安装和使用，它与 OpenAI API 兼容，并提供本机 Python 接口。

首先，启动 vLLM OpenAI 兼容服务器

python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-360M-Instruct \
    --host 0.0.0.0 \
    --port 8000

然后使用 Hugging Face 的 InferenceClient 与其交互

from huggingface_hub import InferenceClient

# Initialize client pointing to vLLM endpoint
client = InferenceClient(
    model="https://:8000/v1",  # URL to the vLLM server
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客户端

from openai import OpenAI

# Initialize client pointing to vLLM endpoint
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="not-needed",  # vLLM doesn't require an API key by default
)

# Chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption>

基本文本生成

让我们看看使用这些框架进行文本生成的示例

首先，使用高级参数部署 TGI

docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct \
    --max-total-tokens 4096 \
    --max-input-length 3072 \
    --max-batch-total-tokens 8192 \
    --waiting-served-ratio 1.2

使用 InferenceClient 进行灵活的文本生成

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8080")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# Raw text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
    details=True,
)
print(response.generated_text)

或者使用 OpenAI 客户端

from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1", api_key="not-needed")

# Advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # Higher for more creativity
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="llama.cpp" label="llama.cpp">

对于 llama.cpp，您可以在启动服务器时设置高级参数

./server \
    -m smollm2-1.7b-instruct.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \            # Context size
    --threads 8 \        # CPU threads to use
    --batch-size 512 \   # Batch size for prompt evaluation
    --n-gpu-layers 0     # GPU layers (0 = CPU only)

使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8080/v1", token="sk-no-key-required")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# For direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    details=True,
)
print(response.generated_text)

或者使用 OpenAI 客户端进行生成，并控制采样参数

from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1", api_key="sk-no-key-required")

# Advanced parameters example
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # Higher for more creativity
    top_p=0.95,  # Nucleus sampling probability
    frequency_penalty=0.5,  # Reduce repetition of frequent tokens
    presence_penalty=0.5,  # Reduce repetition by penalizing tokens already present
    max_tokens=200,  # Maximum generation length
)
print(response.choices[0].message.content)

您还可以使用 llama.cpp 的本机库进行更多控制

# Using llama-cpp-python package for direct model access
from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="smollm2-1.7b-instruct.Q4_K_M.gguf",
    n_ctx=4096,  # Context window size
    n_threads=8,  # CPU threads
    n_gpu_layers=0,  # GPU layers (0 = CPU only)
)

# Format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""

# Generate response with precise parameter control
output = llm(
    prompt,
    max_tokens=200,
    temperature=0.8,
    top_p=0.95,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["<|im_end|>"],
)

print(output["choices"][0]["text"])

</hfoption> <hfoption value="vllm" label="vLLM">

对于 vLLM 的高级用法，您可以使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8000/v1")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# For direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

您还可以使用 OpenAI 客户端

from openai import OpenAI

client = OpenAI(base_url="https://:8000/v1", api_key="not-needed")

# Advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    top_p=0.95,
    max_tokens=200,
)
print(response.choices[0].message.content)

vLLM 还提供具有精细控制的本机 Python 接口

from vllm import LLM, SamplingParams

# Initialize the model with advanced parameters
llm = LLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    gpu_memory_utilization=0.85,
    max_num_batched_tokens=8192,
    max_num_seqs=256,
    block_size=16,
)

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,  # Higher for more creativity
    top_p=0.95,  # Consider top 95% probability mass
    max_tokens=100,  # Maximum length
    presence_penalty=1.1,  # Reduce repetition
    frequency_penalty=1.1,  # Reduce repetition
    stop=["\n\n", "###"],  # Stop sequences
)

# Generate text
prompt = "Write a creative story"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

# For chat-style interactions
chat_prompt = [
    {"role": "system", "content": "You are a creative storyteller."},
    {"role": "user", "content": "Write a creative story"},
]
formatted_prompt = llm.get_chat_template()(chat_prompt)  # Uses model's chat template
outputs = llm.generate(formatted_prompt, sampling_params)
print(outputs[0].outputs[0].text)

</hfoption>

高级生成控制

Token 选择和采样

文本生成过程涉及在每个步骤中选择下一个 token。此选择过程可以通过各种参数进行控制

原始 Logits：每个 token 的初始输出概率
温度：控制选择中的随机性（越高 = 越有创造力）
Top-p (Nucleus) 采样：过滤到组成 X% 概率质量的 Top token
Top-k 过滤：将选择限制为 k 个最可能的 token

以下是配置这些参数的方法

控制重复

这两个框架都提供了防止重复文本生成的方法

长度控制和停止序列

您可以控制生成长度并指定何时停止

内存管理

这两个框架都实现了先进的内存管理技术，以实现高效推理。

资源

< > 在 GitHub 上更新

LLM 课程

优化的推理部署

框架选择指南

内存管理和性能

部署和集成

入门

安装和基本设置

基本文本生成

高级生成控制

Token 选择和采样

控制重复

长度控制和停止序列

内存管理

资源