Pipeline

Pipeline 是一个简单但功能强大的推理 API，可用于 Hugging Face Hub 中任何模型的各种机器学习任务。

通过特定于任务的参数（例如，为自动语音识别 (ASR) pipeline 添加时间戳以转录会议记录）来定制 Pipeline 以适应您的任务。Pipeline 支持 GPU、Apple Silicon 和半精度权重，以加速推理并节省内存。

Transformers 有两个 pipeline 类，一个通用的 Pipeline 和许多单独的特定于任务的 pipeline，例如 TextGenerationPipeline 或 VisualQuestionAnsweringPipeline。通过在 Pipeline 的 task 参数中设置任务标识符来加载这些单独的 pipeline。您可以在其 API 文档中找到每个 pipeline 的任务标识符。

每个任务都配置为使用默认的预训练模型和预处理器，但如果您想使用不同的模型，可以使用 model 参数覆盖它。

例如，要将 TextGenerationPipeline 与 Gemma 2 一起使用，请设置 task="text-generation" 和 model="google/gemma-2-2b"。

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b")
pipeline("the secret to baking a really good cake is ")
[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}]

当您有多个输入时，将它们作为列表传递。

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda")
pipeline(["the secret to baking a really good cake is ", "a baguette is "])
[[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}],
 [{'generated_text': 'a baguette is 100% bread.\n\na baguette is 100%'}]]

本指南将向您介绍 Pipeline，演示其功能，并展示如何配置其各种参数。

Tasks

Pipeline 与跨不同模态的许多机器学习任务兼容。将适当的输入传递给 pipeline，它将处理其余部分。

以下是如何将 Pipeline 用于不同任务和模态的一些示例。

summarization

automatic speech recognition

image classification

visual question answering

Parameters

至少，Pipeline 只需要任务标识符、模型和适当的输入。但是，有许多参数可用于配置 pipeline，从特定于任务的参数到优化性能。

本节向您介绍一些更重要的参数。

Device

Pipeline 与多种硬件类型兼容，包括 GPU、CPU、Apple Silicon 等。使用 device 参数配置硬件类型。默认情况下，Pipeline 在 CPU 上运行，由 device=-1 给出。

GPU

Apple silicon

Batch inference

Pipeline 还可以使用 batch_size 参数处理批量的输入。批量推理可以提高速度，尤其是在 GPU 上，但不能保证。硬件、数据和模型本身等其他变量会影响批量推理是否能提高速度。因此，默认情况下禁用批量推理。

在下面的示例中，当有 4 个输入且 batch_size 设置为 2 时，Pipeline 一次将一批 2 个输入传递给模型。

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda", batch_size=2)
pipeline(["the secret to baking a really good cake is", "a baguette is", "paris is the", "hotdogs are"])
[[{'generated_text': 'the secret to baking a really good cake is to use a good cake mix.\n\ni’'}],
 [{'generated_text': 'a baguette is'}],
 [{'generated_text': 'paris is the most beautiful city in the world.\n\ni’ve been to paris 3'}],
 [{'generated_text': 'hotdogs are a staple of the american diet. they are a great source of protein and can'}]]

批量推理的另一个良好用例是在 Pipeline 中流式传输数据。

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
import datasets

# KeyDataset is a utility that returns the item in the dict returned by the dataset
dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
    print(out)

在确定批量推理是否有助于提高性能时，请记住以下一般经验法则。

唯一确定的方法是测量模型、数据和硬件的性能。
如果您受到延迟的限制（例如，实时推理产品），请不要进行批量推理。
如果您使用的是 CPU，请不要进行批量推理。
如果您不知道数据的 sequence_length，请不要进行批量推理。测量性能，迭代添加到 sequence_length，并包括内存不足 (OOM) 检查以从故障中恢复。
如果您的 sequence_length 是规则的，请进行批量推理，并不断推进直到达到 OOM 错误。GPU 越大，批量推理越有帮助。
如果您决定进行批量推理，请务必确保您可以处理 OOM 错误。

Task-specific parameters

Pipeline 接受每个单独的任务 pipeline 支持的任何参数。请务必查看每个单独的任务 pipeline，以了解有哪些类型的参数可用。如果您找不到对您的用例有用的参数，请随时打开 GitHub issue 来请求它！

以下示例演示了一些可用的特定于任务的参数。

automatic speech recognition

text generation

Chunk batching

在某些情况下，您需要分块处理数据。

对于某些数据类型，单个输入（例如，非常长的音频文件）可能需要分块成多个部分才能进行处理
对于某些任务，例如 zero-shot classification 或 question answering，单个输入可能需要多次前向传递，这可能会导致 batch_size 参数出现问题

ChunkPipeline 类旨在处理这些用例。两个 pipeline 类以相同的方式使用，但由于 ChunkPipeline 可以自动处理批处理，因此您无需担心输入触发的前向传递次数。相反，您可以独立于输入优化 batch_size。

下面的示例显示了它与 Pipeline 的不同之处。

# ChunkPipeline
all_model_outputs = []
for preprocessed in pipeline.preprocess(inputs):
    model_outputs = pipeline.model_forward(preprocessed)
    all_model_outputs.append(model_outputs)
outputs =pipeline.postprocess(all_model_outputs)

# Pipeline
preprocessed = pipeline.preprocess(inputs)
model_outputs = pipeline.forward(preprocessed)
outputs = pipeline.postprocess(model_outputs)

Large datasets

对于大型数据集的推理，您可以直接迭代数据集本身。这避免了立即为整个数据集分配内存，并且您无需担心自己创建批次。尝试使用 batch_size 参数进行批量推理，看看它是否能提高性能。

from transformers.pipelines.pt_utils import KeyDataset
from transformers import pipeline
from datasets import load_dataset

dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
    print(out)

使用 Pipeline 对大型数据集运行推理的其他方法包括使用迭代器或生成器。

def data():
    for i in range(1000):
        yield f"My example {i}"

pipeline = pipeline(model="openai-community/gpt2", device=0)
generated_characters = 0
for out in pipeline(data()):
    generated_characters += len(out[0]["generated_text"])

Large models

Accelerate 为使用 Pipeline 运行大型模型提供了一些优化。首先确保已安装 Accelerate。

!pip install -U accelerate

device_map="auto" 设置对于自动将模型分布在最快的设备 (GPU) 上非常有用，然后再调度到其他较慢的设备（如果可用）（CPU、硬盘驱动器）。

Pipeline 支持半精度权重 (torch.float16)，这可以显着提高速度并节省内存。对于大多数模型，尤其是对于较大的模型，性能损失可忽略不计。如果您的硬件支持，您可以启用 torch.bfloat16 以获得更大的范围。

输入在内部转换为 torch.float16，并且仅适用于具有 PyTorch 后端的模型。

最后，Pipeline 还接受量化模型以进一步减少内存使用。首先确保您已安装 bitsandbytes 库，然后在 pipeline 中的 model_kwargs 中添加 load_in_8bit=True。

import torch
from transformers import pipeline, BitsAndBytesConfig

pipeline = pipeline(model="google/gemma-7b", torch_dtype=torch.bfloat16, device_map="auto", model_kwargs={"quantization_config": BitsAndBytesConfig(load_in_8bit=True)})
pipeline("the secret to baking a good cake is ")
[{'generated_text': 'the secret to baking a good cake is 1. the right ingredients 2. the right'}]

< > Update on GitHub