开源 AI 食谱文档

在 Hugging Face 文档上使用 LangChain 进行高级 RAG

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

在 Hugging Face 文档上使用 LangChain 进行高级 RAG

作者：Aymeric Roucher

本笔记本演示了如何使用 LangChain 构建一个高级 RAG (Retrieval Augmented Generation，检索增强生成)，以回答用户关于特定知识库（此处为 Hugging Face 文档）的问题。

关于 RAG 的入门介绍，您可以查看这篇其他的指南！

RAG 系统很复杂，包含许多活动部件：这是一个 RAG 的图示，我们用蓝色标注了所有可以增强系统的可能性。

💡 如您所见，这个架构中有许多步骤可以调整：正确地调整系统将带来显著的性能提升。

在本笔记本中，我们将深入探讨这些蓝色标注中的许多内容，看看如何调整您的 RAG 系统并获得最佳性能。

让我们深入模型构建吧！ 首先，我们安装所需的模型依赖项。

!pip install -q torch transformers accelerate bitsandbytes langchain sentence-transformers faiss-cpu openpyxl pacmap datasets langchain-community ragatouille

from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", None)  # This will be helpful when visualizing retriever outputs

加载您的知识库

import datasets

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)
]

1. 检索器 - 嵌入 🗂️

检索器就像一个内部搜索引擎：给定用户查询，它会从您的知识库中返回一些相关的片段。

然后，这些片段将被输入到读取器模型中，以帮助它生成答案。

因此，我们的目标是，给定一个用户问题，从我们的知识库中找到最相关的片段来回答该问题。

这是一个宽泛的目标，它留下了一些问题。我们应该检索多少个片段？这个参数将被命名为 `top_k`。

这些片段应该有多长？这被称为 `chunk size`（块大小）。没有一刀切的答案，但这里有几个要点：

🔀 您的 `chunk size` 可以因片段而异。
由于检索中总会有一些噪音，增加 `top_k` 会增加在检索到的片段中获得相关元素的机会。🎯 射出更多的箭会增加你击中目标的概率。
同时，检索到的文档的总长度不应太高：例如，对于大多数当前模型，16k 的 token 可能会因“迷失在中间（Lost-in-the-middle）”现象而使您的读取器模型不堪重负。🎯 只给你的读取器模型最相关的见解，而不是一大堆书！

在本笔记本中，我们使用 Langchain 库，因为它为向量数据库提供了多种多样的选择，并允许我们在整个处理过程中保留文档元数据。

1.1 将文档分割成块

在这一部分，我们将知识库中的文档分割成更小的块，这些块将成为读取器 LLM 生成答案所依据的片段。
我们的目标是准备一个由语义相关片段组成的集合。因此，它们的大小应该适应于精确的思想：太小会截断思想，太大则会稀释思想。

💡 文本分割有很多选择：按词分割、按句子边界分割、递归分块（以树状方式处理文档以保留结构信息）……要了解更多关于分块的知识，我推荐您阅读 Greg Kamradt 的这本很棒的笔记本。

递归分块使用一个给定的分隔符列表（从最重要到最不重要的分隔符排序）逐步将文本分解成更小的部分。如果第一次分割没有得到正确大小或形状的块，该方法会使用不同的分隔符在新的块上重复进行。例如，使用分隔符列表 ["\n\n", "\n", ".", ""]：
- 该方法将首先在任何有双换行符 "\n\n" 的地方分解文档。
- 生成的文档将再次在单换行符 "\n" 处分割，然后在句末 "." 处分割。
- 最后，如果某些块仍然太大，它们将在超出最大大小时被分割。
通过这种方法，全局结构得到了很好的保留，代价是块大小会有轻微的变化。

这个 Space 可以让你可视化不同的分割选项如何影响你得到的块。

🔬 让我们对块大小进行一些实验，从一个任意的大小开始，看看分割是如何工作的。我们使用 Langchain 的 `RecursiveCharacterTextSplitter` 实现递归分块。

参数 `chunk_size` 控制单个块的长度：默认情况下，此长度按块中的字符数计算。
参数 `chunk_overlap` 让相邻的块之间有一些重叠。这减少了某个想法被两个相邻块之间的分割切成两半的可能性。我们大致将其设置为块大小的 1/10，您可以尝试不同的值！

from langchain.text_splitter import RecursiveCharacterTextSplitter

# We use a hierarchical list of separators specifically tailored for splitting Markdown documents
# This list is taken from LangChain's MarkdownTextSplitter class
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # The maximum number of characters in a chunk: we selected this value arbitrarily
    chunk_overlap=100,  # The number of characters to overlap between chunks
    add_start_index=True,  # If `True`, includes chunk's start index in metadata
    strip_whitespace=True,  # If `True`, strips whitespace from the start and end of every document
    separators=MARKDOWN_SEPARATORS,
)

docs_processed = []
for doc in RAW_KNOWLEDGE_BASE:
    docs_processed += text_splitter.split_documents([doc])

我们还必须记住，在嵌入文档时，我们将使用一个接受特定最大序列长度 `max_seq_length` 的嵌入模型。

因此，我们应该确保我们的块大小低于这个限制，因为任何更长的块在处理前都会被截断，从而失去相关性。

>>> from sentence_transformers import SentenceTransformer

>>> # To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter
>>> print(f"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}")

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
>>> lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]

>>> # Plot the distribution of document lengths, counted as the number of tokens
>>> fig = pd.Series(lengths).hist()
>>> plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
>>> plt.show()

Model's maximum sequence length: 512

👀 正如您所见，块的长度并未与我们 512 个 token 的限制对齐，并且一些文档超过了限制，因此它们的一部分将在截断中丢失！

所以我们应该修改 `RecursiveCharacterTextSplitter` 类，以 token 的数量而不是字符的数量来计算长度。
然后我们可以选择一个特定的块大小，这里我们会选择一个低于 512 的阈值。
- 较小的文档可以让分割更专注于具体的思想。
- 但太小的块会把句子分成两半，从而再次失去意义：适当的调整是一个平衡问题。

>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from transformers import AutoTokenizer

>>> EMBEDDING_MODEL_NAME = "thenlper/gte-small"


>>> def split_documents(
...     chunk_size: int,
...     knowledge_base: List[LangchainDocument],
...     tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
... ) -> List[LangchainDocument]:
...     """
...     Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
...     """
...     text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
...         AutoTokenizer.from_pretrained(tokenizer_name),
...         chunk_size=chunk_size,
...         chunk_overlap=int(chunk_size / 10),
...         add_start_index=True,
...         strip_whitespace=True,
...         separators=MARKDOWN_SEPARATORS,
...     )

...     docs_processed = []
...     for doc in knowledge_base:
...         docs_processed += text_splitter.split_documents([doc])

...     # Remove duplicates
...     unique_texts = {}
...     docs_processed_unique = []
...     for doc in docs_processed:
...         if doc.page_content not in unique_texts:
...             unique_texts[doc.page_content] = True
...             docs_processed_unique.append(doc)

...     return docs_processed_unique


>>> docs_processed = split_documents(
...     512,  # We choose a chunk size adapted to our model
...     RAW_KNOWLEDGE_BASE,
...     tokenizer_name=EMBEDDING_MODEL_NAME,
... )

>>> # Let's visualize the chunk sizes we would have in tokens from a common model
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
>>> lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]
>>> fig = pd.Series(lengths).hist()
>>> plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
>>> plt.show()

➡️ 现在块长度的分布看起来好多了！

1.2 构建向量数据库

我们想为我们知识库中的所有块计算嵌入：要了解更多关于句子嵌入的信息，我们推荐阅读这篇指南。

检索如何工作？

一旦所有块都被嵌入，我们将它们存储在向量数据库中。当用户输入一个查询时，它会被之前使用的同一个模型嵌入，然后通过相似性搜索从向量数据库中返回最接近的文档。

因此，技术挑战在于，给定一个查询向量，如何在包含数千条记录的数据库中快速找到该向量的最近邻。为此，我们需要选择两样东西：一个距离度量，以及一个用于在数据库中快速找到最近邻的搜索算法。

距离度量

关于距离度量，你可以在这里找到一份很好的指南。简而言之：

余弦相似度计算两个向量之间相似度的方式是它们相对角度的余弦值：它允许我们比较向量的方向而不考虑它们的大小。使用它需要对所有向量进行归一化，将它们重新缩放为单位范数。
点积会考虑向量的模长，这有时会产生不希望的效果，即增加一个向量的长度会使其与其他所有向量更相似。
欧几里得距离是向量端点之间的距离。

你可以尝试这个小练习来检验你对这些概念的理解。但一旦向量被归一化，特定距离的选择就不那么重要了。

我们特定的模型在余弦相似度下表现良好，所以我们选择这个距离，并在嵌入模型和 FAISS 索引的 `distance_strategy` 参数中都进行了设置。使用余弦相似度时，我们必须对嵌入进行归一化。

🚨👇 下面的单元格在 A10G 上运行需要几分钟！

from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)

KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

👀 为了可视化最接近文档的搜索过程，让我们使用 PaCMAP 将我们的嵌入从 384 维降到 2 维。

💡 *我们选择 PaCMAP 而不是其他技术，如 t-SNE 或 UMAP，因为它效率高（保留局部和全局结构），对初始化参数鲁棒且速度快。*

# Embed a user query in the same space
user_query = "How to create a pipeline object?"
query_vector = embedding_model.embed_query(user_query)

import pacmap
import numpy as np
import plotly.express as px

embedding_projector = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1)

embeddings_2d = [
    list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0]) for idx in range(len(docs_processed))
] + [query_vector]

# Fit the data (the index of transformed data corresponds to the index of the original data)
documents_projected = embedding_projector.fit_transform(np.array(embeddings_2d), init="pca")

df = pd.DataFrame.from_dict(
    [
        {
            "x": documents_projected[i, 0],
            "y": documents_projected[i, 1],
            "source": docs_processed[i].metadata["source"].split("/")[1],
            "extract": docs_processed[i].page_content[:100] + "...",
            "symbol": "circle",
            "size_col": 4,
        }
        for i in range(len(docs_processed))
    ]
    + [
        {
            "x": documents_projected[-1, 0],
            "y": documents_projected[-1, 1],
            "source": "User query",
            "extract": user_query,
            "size_col": 100,
            "symbol": "star",
        }
    ]
)

# Visualize the embedding
fig = px.scatter(
    df,
    x="x",
    y="y",
    color="source",
    hover_data="extract",
    size="size_col",
    symbol="symbol",
    color_discrete_map={"User query": "black"},
    width=1000,
    height=700,
)
fig.update_traces(
    marker=dict(opacity=1, line=dict(width=0, color="DarkSlateGrey")),
    selector=dict(mode="markers"),
)
fig.update_layout(
    legend_title_text="<b>Chunk source</b>",
    title="<b>2D Projection of Chunk Embeddings via PaCMAP</b>",
)
fig.show()

➡️ 在上图中，您可以看到知识库文档的空间表示。由于向量嵌入代表了文档的意义，它们在意义上的接近程度应该反映在它们嵌入的接近程度上。

图中也显示了用户查询的嵌入：我们想找到 `k` 个意义最接近的文档，因此我们选择 `k` 个最接近的向量。

在 LangChain 向量数据库实现中，这个搜索操作由 `vector_database.similarity_search(query)` 方法执行。

这是结果

>>> print(f"\nStarting retrieval for {user_query=}...")
>>> retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)
>>> print("\n==================================Top document==================================")
>>> print(retrieved_docs[0].page_content)
>>> print("==================================Metadata==================================")
>>> print(retrieved_docs[0].metadata)

Starting retrieval for user_query='How to create a pipeline object?'...

==================================Top document==================================
```

## Available Pipelines:
==================================Metadata==================================
&#123;'source': 'huggingface/diffusers/blob/main/docs/source/en/api/pipelines/deepfloyd_if.md', 'start_index': 16887}

2. 读取器 - LLM 💬

在这一部分，LLM 读取器会读取检索到的上下文来形成其答案。

这里有一些子步骤，都可以进行调整：

检索到的文档内容被聚合成“上下文”，有许多处理选项，如 *提示词压缩*。
上下文和用户查询被聚合成一个提示词，然后交给 LLM 生成其答案。

2.1. 读取器模型

读取器模型的选择在几个方面很重要：

读取器模型的 `max_seq_length` 必须容纳我们的提示词，其中包括检索器调用输出的上下文：上下文包含 5 个各 512 个 token 的文档，所以我们的目标是上下文长度至少为 4k 个 token。
读取器模型

在这个例子中，我们选择了HuggingFaceH4/zephyr-7b-beta，这是一个小巧但功能强大的模型。

由于每周都有许多新模型发布，您可能希望将此模型替换为最新最好的模型。跟踪开源 LLM 的最佳方式是查看开源 LLM 排行榜。

为了加快推理速度，我们将加载模型的量化版本。

from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

READER_LLM("What is 4+4? Answer:")

2.2. 提示词

下面的 RAG 提示词模板是我们提供给读取器 LLM 的内容：将其格式化为读取器 LLM 的聊天模板非常重要。

我们给它我们的上下文和用户的问题。

>>> prompt_in_chat_format = [
...     {
...         "role": "system",
...         "content": """Using the information contained in the context,
... give a comprehensive answer to the question.
... Respond only to the question asked, response should be concise and relevant to the question.
... Provide the number of the source document when relevant.
... If the answer cannot be deduced from the context, do not give an answer.""",
...     },
...     {
...         "role": "user",
...         "content": """Context:
... {context}
... ---
... Now here is the question you need to answer.

... Question: {question}""",
...     },
... ]
>>> RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
...     prompt_in_chat_format, tokenize=False, add_generation_prompt=True
... )
>>> print(RAG_PROMPT_TEMPLATE)

<|system|>
Using the information contained in the context, 
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.
<|user|>
Context:
&#123;context}
---
Now here is the question you need to answer.

Question: &#123;question}
<|assistant|>

让我们在我们之前检索到的文档上测试一下我们的读取器！

>>> retrieved_docs_text = [doc.page_content for doc in retrieved_docs]  # We only need the text of the documents
>>> context = "\nExtracted documents:\n"
>>> context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(retrieved_docs_text)])

>>> final_prompt = RAG_PROMPT_TEMPLATE.format(question="How to create a pipeline object?", context=context)

>>> # Redact an answer
>>> answer = READER_LLM(final_prompt)[0]["generated_text"]
>>> print(answer)

To create a pipeline object, follow these steps:

1. Define the inputs and outputs of your pipeline. These could be strings, dictionaries, or any other format that best suits your use case.

2. Inherit the `Pipeline` class from the `transformers` module and implement the following methods:

   - `preprocess`: This method takes the raw inputs and returns a preprocessed dictionary that can be passed to the model.

   - `_forward`: This method performs the actual inference using the model and returns the output tensor.

   - `postprocess`: This method takes the output tensor and returns the final output in the desired format.

   - `_sanitize_parameters`: This method is used to sanitize the input parameters before passing them to the model.

3. Load the necessary components, such as the model and scheduler, into the pipeline object.

4. Instantiate the pipeline object and return it.

Here's an example implementation based on the given context:

```python
from transformers import Pipeline
import torch
from diffusers import StableDiffusionPipeline

class MyPipeline(Pipeline):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.pipe = StableDiffusionPipeline.from_pretrained("my_model")

    def preprocess(self, inputs):
        # Preprocess the inputs as needed
        return &#123;"input_ids":...}

    def _forward(self, inputs):
        # Run the forward pass of the model
        return self.pipe(**inputs).images[0]

    def postprocess(self, outputs):
        # Postprocess the outputs as needed
        return outputs["sample"]

    def _sanitize_parameters(self, params):
        # Sanitize the input parameters
        return params

my_pipeline = MyPipeline()
result = my_pipeline("My input string")
print(result)
```

Note that this implementation assumes that the model and scheduler are already loaded into memory. If they need to be loaded dynamically, you can modify the `__init__` method accordingly.

2.3. 重排 (Reranking)

RAG 的一个好选择是检索比最终需要的更多的文档，然后用一个更强大的检索模型对结果进行重排，最后只保留 `top_k` 个结果。

为此，Colbertv2 是一个很好的选择：与我们传统的嵌入模型（双编码器）不同，它是一个交叉编码器，可以计算查询 token 和每个文档 token 之间更细粒度的交互。

得益于 RAGatouille 库，它可以轻松使用。

from ragatouille import RAGPretrainedModel

RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

3. 整合所有部分！

from transformers import Pipeline


def answer_with_rag(
    question: str,
    llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 5,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # Keep only the text

    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    print("=> Generating answer...")
    answer = llm(final_prompt)[0]["generated_text"]

    return answer, relevant_docs

让我们看看我们的 RAG 流水线如何回答用户查询。

>>> question = "how to create a pipeline object?"

>>> answer, relevant_docs = answer_with_rag(question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER)

=> Retrieving documents...

>>> print("==================================Answer==================================")
>>> print(f"{answer}")
>>> print("==================================Source docs==================================")
>>> for i, doc in enumerate(relevant_docs):
...     print(f"Document {i}------------------------------------------------------------")
...     print(doc)

==================================Answer==================================
To create a pipeline object, follow these steps:

1. Import the `pipeline` function from the `transformers` module:

   ```python
   from transformers import pipeline
   ```

2. Choose the task you want to perform, such as object detection, sentiment analysis, or image generation, and pass it as an argument to the `pipeline` function:

   - For object detection:

     ```python
     >>> object_detector = pipeline('object-detection')
     >>> object_detector(image)
     [&#123;'score': 0.9982201457023621,
       'label':'remote',
       'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
     ...]
     ```

   - For sentiment analysis:

     ```python
     >>> classifier = pipeline("sentiment-analysis")
     >>> classifier("This is a great product!")
     &#123;'labels': ['POSITIVE'],'scores': tensor([0.9999], device='cpu', dtype=torch.float32)}
     ```

   - For image generation:

     ```python
     >>> image = pipeline(
    ... "stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
    ... ).images[0]
     >>> image
     PILImage mode RGB size 7680x4320 at 0 DPI
     ```

Note that the exact syntax may vary depending on the specific pipeline being used. Refer to the documentation for more details on how to use each pipeline.

In general, the process involves importing the necessary modules, selecting the desired pipeline task, and passing it to the `pipeline` function along with any required arguments. The resulting pipeline object can then be used to perform the selected task on input data.
==================================Source docs==================================
Document 0------------------------------------------------------------
# Allocate a pipeline for object detection
>>> object_detector = pipeline('object-detection')
>>> object_detector(image)
[&#123;'score': 0.9982201457023621,
  'label': 'remote',
  'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 &#123;'score': 0.9960021376609802,
  'label': 'remote',
  'box': &#123;'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 &#123;'score': 0.9954745173454285,
  'label': 'couch',
  'box': &#123;'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 &#123;'score': 0.9988006353378296,
  'label': 'cat',
  'box': &#123;'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 &#123;'score': 0.9986783862113953,
  'label': 'cat',
  'box': &#123;'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]
Document 1------------------------------------------------------------
# Allocate a pipeline for object detection
>>> object_detector = pipeline('object_detection')
>>> object_detector(image)
[&#123;'score': 0.9982201457023621,
  'label': 'remote',
  'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 &#123;'score': 0.9960021376609802,
  'label': 'remote',
  'box': &#123;'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 &#123;'score': 0.9954745173454285,
  'label': 'couch',
  'box': &#123;'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 &#123;'score': 0.9988006353378296,
  'label': 'cat',
  'box': &#123;'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 &#123;'score': 0.9986783862113953,
  'label': 'cat',
  'box': &#123;'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]
Document 2------------------------------------------------------------
Start by creating an instance of [`pipeline`] and specifying a task you want to use it for. In this guide, you'll use the [`pipeline`] for sentiment analysis as an example:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis")
Document 3------------------------------------------------------------
```

## Add the pipeline to 🤗 Transformers

If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the `pipelines` submodule
with the code of your pipeline, then add it to the list of tasks defined in `pipelines/__init__.py`.

Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with examples of the other tests.

The `run_pipeline_test` function will be very generic and run on small random models on every possible
architecture as defined by `model_mapping` and `tf_model_mapping`.

This is very important to test future compatibility, meaning if someone adds a new model for
`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
impossible to check for actual values, that's why there is a helper `ANY` that will simply attempt to match the
output of the pipeline TYPE.

You also *need* to implement 2 (ideally 4) tests.

- `test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
  and test the pipeline outputs. The results should be the same as `test_small_model_tf`.
- `test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
  and test the pipeline outputs. The results should be the same as `test_small_model_pt`.
- `test_large_model_pt` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to
  make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
  sure there is no drift in future releases.
- `test_large_model_tf` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to
  make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
  sure there is no drift in future releases.
Document 4------------------------------------------------------------
```

2. Pass a prompt to the pipeline to generate an image:

```py
image = pipeline(
	"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
).images[0]
image

✅ 我们现在有了一个功能齐全、性能优越的 RAG 系统。今天就到这里！恭喜你坚持到了最后 🥳

更进一步 🗺️

这并非旅程的终点！您可以尝试许多步骤来改进您的 RAG 系统。我们建议以迭代的方式进行：对系统进行小的改动，然后观察性能是否有所提升。

建立评估流水线

💬 “你无法改进你没有衡量的模型性能”，甘地曾说… 或者至少 Llama2 告诉我他这么说过。无论如何，你绝对应该从衡量性能开始：这意味着建立一个小的评估数据集，然后监控你的 RAG 系统在这个评估数据集上的性能。

改进检索器

🛠️ 您可以使用这些选项来调整结果：

调整分块方法
- 块的大小
- 方法：使用不同的分隔符进行分割，使用语义分块等
更换嵌入模型

👷‍♀️ 可以考虑更多：

尝试另一种分块方法，如语义分块
更改使用的索引（此处为 FAISS）
查询扩展：以略微不同的方式重新表述用户查询，以检索更多文档。

改进读取器

🛠️ 在这里，您可以尝试以下选项来改进结果：

调整提示词
开启/关闭重排
选择一个更强大的读取器模型

💡 为了进一步改进结果，这里可以考虑许多选项：

压缩检索到的上下文，只保留与回答查询最相关的部分。
扩展 RAG 系统，使其更加用户友好
- 引用来源
- 使其具有对话性

< > 在 GitHub 上更新

←使用 TGI 的 Messages API 从 OpenAI 迁移到开放 LLM 在零样本（Zero-shot）文本分类中使用 SetFit 进行数据标注的建议→

开源 AI 食谱

在 Hugging Face 文档上使用 LangChain 进行高级 RAG

加载您的知识库

1. 检索器 - 嵌入 🗂️

1.1 将文档分割成块

1.2 构建向量数据库

检索如何工作？

最近邻搜索算法

距离度量

2. 读取器 - LLM 💬

2.1. 读取器模型

2.2. 提示词

2.3. 重排 (Reranking)

3. 整合所有部分！

更进一步 🗺️

建立评估流水线

改进检索器

改进读取器