使用 FAISS 构建语义缓存以改进 RAG 系统。
作者:Pere Martra
在本笔记本中,我们将探索一个典型的 RAG 解决方案,其中我们将使用开源模型和向量数据库 Chroma DB。**但是,我们将集成一个语义缓存系统,该系统将存储各种用户查询,并决定是生成使用向量数据库中的信息丰富提示,还是使用缓存中的信息。**
语义缓存系统旨在识别相似或相同的用户请求。当找到匹配的请求时,系统会从缓存中检索相应的信息,从而减少从原始源获取信息的需要。
由于比较考虑了请求的语义含义,因此它们不必完全相同才能被系统识别为相同的问题。它们可以以不同的方式表述或包含不准确之处,无论是印刷错误还是句子结构,我们都可以识别出用户实际上请求的是相同的信息。
例如,查询如**法国的首都是什么?**、**告诉我法国的首都是什么名字?**和**法国的首都是什么?**都传达了相同的意图,应该被识别为相同的问题。
虽然模型的响应可能会根据第二个示例中对简洁答案的请求而有所不同,但从向量数据库中检索的信息应该相同。这就是为什么我将缓存系统放置在用户和向量数据库之间,而不是用户和大型语言模型之间。
大多数指导您创建 RAG 系统的教程都是为单用户使用而设计的,旨在在测试环境中运行。换句话说,在笔记本中,与本地向量数据库交互并进行 API 调用或使用本地存储的模型。
当尝试将其中一个模型过渡到生产环境时,这种架构很快就会变得不足,因为它们可能会遇到从几十到几千个重复请求。
提高性能的一种方法是通过一个或多个语义缓存。此缓存保留先前请求的结果,并在解决新请求之前,检查之前是否收到过类似请求。如果是,则无需重新执行该过程,而是从缓存中检索信息。
在 RAG 系统中,有两个耗时的点
- 检索用于构建丰富提示的信息
- 调用大型语言模型以获取响应。
在这两个点上,都可以实现语义缓存系统,我们甚至可以为每个点设置两个缓存。
将其放置在模型响应点可能会导致对获得的响应的影响力降低。我们的缓存系统可能会将“用 10 个词解释法国大革命”和“用 100 个词解释法国大革命”视为相同的查询。如果我们的缓存系统存储模型响应,用户可能会认为他们的指令没有被准确地执行。
但是这两个请求都需要相同的信息来丰富提示。这就是我选择将语义缓存系统放置在用户的请求和从向量数据库检索信息之间主要原因。
但是,这是一个设计决策。根据响应类型和系统请求,它可以放置在一个点或另一个点。很明显,缓存模型响应将产生最多的时间节省,但正如我之前解释的那样,它需要以牺牲用户对响应的影响力为代价。
导入并加载库。
首先,我们需要安装必要的 Python 包。
- **sentence transformers**。此库用于将句子转换为固定长度的向量,也称为嵌入。
- **xformers**。它提供库和实用程序来简化 Transformer 模型的使用。我们需要安装它以避免在使用模型和嵌入时出现错误。
- **chromadb**。这是我们的向量数据库。ChromaDB 易于使用且开源,可能是最常用的用于存储嵌入的向量数据库。
- **accelerate**。在 GPU 上运行模型所必需的。
!pip install -q transformers==4.38.1
!pip install -q accelerate==0.27.2
!pip install -q sentence-transformers==2.5.1
!pip install -q xformers==0.0.24
!pip install -q chromadb==0.4.24
!pip install -q datasets==2.17.1
import numpy as np
import pandas as pd
加载数据集
由于我们工作在免费且有限的空间中,并且只能使用几个 GB 的内存,因此我使用变量 MAX_ROWS
限制了要从数据集中使用的行数。
#Login to Hugging Face. It is mandatory to use the Gemma Model,
#and recommended to acces public models and Datasets.
from getpass import getpass
if 'hf_key' not in locals():
hf_key = getpass("Your Hugging Face API Key: ")
!huggingface-cli login --token $hf_key
from datasets import load_dataset
data = load_dataset("keivalya/MedQuad-MedicalQnADataset", split="train")
ChromaDB 要求数据具有唯一的标识符。我们可以使用此语句创建它,这将创建一个名为 **Id** 的新列。
data = data.to_pandas()
data["id"] = data.index
data.head(10)
MAX_ROWS = 15000
DOCUMENT = "Answer"
TOPIC = "qtype"
# Because it is just a sample we select a small portion of News.
subset_data = data.head(MAX_ROWS)
导入和配置向量数据库
为了存储信息,我选择使用 ChromaDB,这是一个最知名且广泛使用的开源向量数据库。
首先,我们需要导入 ChromaDB。
import chromadb
现在,我们只需要指定向量数据库将存储的路径。
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")
填充和查询 ChromaDB 数据库
ChromaDB 中的数据存储在集合中。如果集合存在,我们需要删除它。
在接下来的几行中,我们通过调用上面创建的 chroma_client
中的 create_collection
函数来创建集合。
collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
chroma_client.delete_collection(name=collection_name)
collection = chroma_client.create_collection(name=collection_name)
现在,我们可以使用 add
函数将数据添加到集合中。此函数需要三个关键信息。
- 在 **document** 中,我们存储数据集中
Answer
列的内容。 - 在 **metadatas** 中,我们可以提供主题列表。我使用了
qtype
列中的值。 - 在 **id** 中,我们需要为每一行提供一个唯一的标识符。我使用
MAX_ROWS
的范围创建 ID。
collection.add(
documents=subset_data[DOCUMENT].tolist(),
metadatas=[{TOPIC: topic} for topic in subset_data[TOPIC].tolist()],
ids=[f"id{x}" for x in range(MAX_ROWS)],
)
将信息存储到数据库后,我们可以查询它并请求满足我们需求的数据。搜索是在文档内容中进行的,它不会查找完全匹配的单词或短语。结果将基于搜索词与文档内容之间的相似度。
元数据不会直接参与初始搜索过程,它可以用于在检索后过滤或细化结果,从而实现进一步的自定义和精确度。
让我们定义一个函数来查询 ChromaDB 数据库。
def query_database(query_text, n_results=10):
results = collection.query(query_texts=query_text, n_results=n_results)
return results
创建语义缓存系统
为了实现缓存系统,我们将使用 Faiss,一个允许将嵌入存储在内存中的库。它与 Chroma 的功能非常相似,但没有持久性。
为此,我们将创建一个名为 semantic_cache
的类,它将使用自己的编码器并为用户提供执行查询的必要函数。
在此类中,我们首先查询使用 Faiss 实现的缓存,其中包含先前的请求,如果返回的结果高于指定的阈值,它将返回缓存的内容。否则,它将从 Chroma 数据库中获取结果。
缓存存储在 .json 文件中。
!pip install -q faiss-cpu==1.8.0
import faiss
from sentence_transformers import SentenceTransformer
import time
import json
下面的 init_cache()
函数初始化语义缓存。
它使用 FlatLS 索引,它可能不是最快的,但对于小型数据集来说是理想的。根据用于缓存的数据特征和预期的数据集大小,可以使用其他索引,例如 HNSW 或 IVF。
我选择此索引是因为它与示例很好地匹配。它可以用于高维向量,消耗最少的内存,并且在小型数据集上表现良好。
我概述了 Faiss 提供的各种索引的关键特性。
- FlatL2 或 FlatIP。非常适合小型数据集,它可能不是最快的,但其内存消耗并不多。
- LSH。它在小型数据集上有效,建议用于最多 128 维的向量。
- HNSW。非常快,但需要大量的 RAM。
- IVF。在大型数据集上表现良好,不会消耗太多内存或影响性能。
有关 Faiss 提供的不同索引的更多信息,请访问此链接:https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
def init_cache():
index = faiss.IndexFlatL2(768)
if index.is_trained:
print("Index trained")
# Initialize Sentence Transformer model
encoder = SentenceTransformer("all-mpnet-base-v2")
return index, encoder
在 retrieve_cache
函数中,如果需要跨会话重用缓存,则从磁盘检索 .json 文件。
def retrieve_cache(json_file):
try:
with open(json_file, "r") as file:
cache = json.load(file)
except FileNotFoundError:
cache = {"questions": [], "embeddings": [], "answers": [], "response_text": []}
return cache
store_cache
函数将包含缓存数据的文件保存到磁盘。
def store_cache(json_file, cache):
with open(json_file, "w") as file:
json.dump(cache, file)
这些函数将在 SemanticCache
类中使用,该类包含搜索函数及其初始化函数。
尽管 ask
函数包含大量代码,但其目的非常简单。它在缓存中查找与用户刚刚提出的问题最接近的问题。
然后,检查它是否在指定的阈值内。如果为真,则直接返回缓存中的响应;否则,它将调用 query_database
函数从 ChromaDB 中检索数据。
我使用了欧氏距离而不是余弦相似度,后者广泛用于向量比较。此选择基于以下事实:欧氏距离是 Faiss 使用的默认度量。虽然也可以计算余弦距离,但这样做会增加复杂性,而这可能不会对最终结果产生重大影响。
我在 semantic_cache
类中包含了 FIFO 驱逐策略,旨在提高其效率和灵活性。通过引入驱逐策略,我们为用户提供了控制缓存达到最大容量时行为的能力。这对于维护最佳缓存性能以及处理可用内存受限的情况至关重要。
查看缓存的结构,FIFO 的实现似乎很简单。每当将新的问题-答案对添加到缓存时,它都会附加到列表的末尾。因此,最旧的(先入)项目位于列表的前面。当缓存达到最大大小并且需要驱逐项目时,您将从每个列表中删除(弹出)第一个项目。这就是 FIFO 驱逐策略。
另一种驱逐策略是最近最少使用 (LRU) 策略,它更复杂,因为它需要了解缓存中每个项目的最后访问时间。但是,此策略尚不可用,将在以后实现。
class semantic_cache:
def __init__(self, json_file="cache_file.json", thresold=0.35, max_response=100, eviction_policy=None):
"""Initializes the semantic cache.
Args:
json_file (str): The name of the JSON file where the cache is stored.
thresold (float): The threshold for the Euclidean distance to determine if a question is similar.
max_response (int): The maximum number of responses the cache can store.
eviction_policy (str): The policy for evicting items from the cache.
This can be any policy, but 'FIFO' (First In First Out) has been implemented for now.
If None, no eviction policy will be applied.
"""
# Initialize Faiss index with Euclidean distance
self.index, self.encoder = init_cache()
# Set Euclidean distance threshold
# a distance of 0 means identicals sentences
# We only return from cache sentences under this thresold
self.euclidean_threshold = thresold
self.json_file = json_file
self.cache = retrieve_cache(self.json_file)
self.max_response = max_response
self.eviction_policy = eviction_policy
def evict(self):
"""Evicts an item from the cache based on the eviction policy."""
if self.eviction_policy and len(self.cache["questions"]) > self.max_size:
for _ in range((len(self.cache["questions"]) - self.max_response)):
if self.eviction_policy == "FIFO":
self.cache["questions"].pop(0)
self.cache["embeddings"].pop(0)
self.cache["answers"].pop(0)
self.cache["response_text"].pop(0)
def ask(self, question: str) -> str:
# Method to retrieve an answer from the cache or generate a new one
start_time = time.time()
try:
# First we obtain the embeddings corresponding to the user question
embedding = self.encoder.encode([question])
# Search for the nearest neighbor in the index
self.index.nprobe = 8
D, I = self.index.search(embedding, 1)
if D[0] >= 0:
if I[0][0] >= 0 and D[0][0] <= self.euclidean_threshold:
row_id = int(I[0][0])
print("Answer recovered from Cache. ")
print(f"{D[0][0]:.3f} smaller than {self.euclidean_threshold}")
print(f"Found cache in row: {row_id} with score {D[0][0]:.3f}")
print(f"response_text: " + self.cache["response_text"][row_id])
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.3f} seconds")
return self.cache["response_text"][row_id]
# Handle the case when there are not enough results
# or Euclidean distance is not met, asking to chromaDB.
answer = query_database([question], 1)
response_text = answer["documents"][0][0]
self.cache["questions"].append(question)
self.cache["embeddings"].append(embedding[0].tolist())
self.cache["answers"].append(answer)
self.cache["response_text"].append(response_text)
print("Answer recovered from ChromaDB. ")
print(f"response_text: {response_text}")
self.index.add(embedding)
self.evict()
store_cache(self.json_file, self.cache)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.3f} seconds")
return response_text
except Exception as e:
raise RuntimeError(f"Error during 'ask' method: {e}")
测试 semantic_cache
类。
>>> # Initialize the cache.
>>> cache = semantic_cache("4cache.json")
Index trained
>>> results = cache.ask("How do vaccines work?")
Answer recovered from ChromaDB. response_text: Summary : Shots may hurt a little, but the diseases they can prevent are a lot worse. Some are even life-threatening. Immunization shots, or vaccinations, are essential. They protect against things like measles, mumps, rubella, hepatitis B, polio, tetanus, diphtheria, and pertussis (whooping cough). Immunizations are important for adults as well as children. Your immune system helps your body fight germs by producing substances to combat them. Once it does, the immune system "remembers" the germ and can fight it again. Vaccines contain germs that have been killed or weakened. When given to a healthy person, the vaccine triggers the immune system to respond and thus build immunity. Before vaccines, people became immune only by actually getting a disease and surviving it. Immunizations are an easier and less risky way to become immune. NIH: National Institute of Allergy and Infectious Diseases Time taken: 0.057 seconds
正如预期的那样,此响应是从 ChromaDB 中获取的。然后,该类将其存储在缓存中。
现在,如果我们发送第二个完全不同的问题,则响应也应该从 ChromaDB 中检索。这是因为之前存储的问题差异很大,以至于在欧氏距离方面会超过指定的阈值。
>>> results = cache.ask("Explain briefly what is a Sydenham chorea")
Answer recovered from ChromaDB. response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations. Time taken: 0.082 seconds
完美,语义缓存系统按预期工作。
接下来,让我们用一个与刚才提出的问题非常相似的问题来测试它。
在这种情况下,响应应该直接来自缓存,无需访问 ChromaDB 数据库。
>>> results = cache.ask("Briefly explain me what is a Sydenham chorea.")
Answer recovered from Cache. 0.028 smaller than 0.35 Found cache in row: 1 with score 0.028 response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations. Time taken: 0.019 seconds
这两个问题非常相似,它们的欧氏距离非常小,几乎就像它们是相同的。
现在,让我们尝试另一个问题,这次稍微有些不同,并观察系统的行为。
>>> question_def = "Write in 20 words what is a Sydenham chorea."
>>> results = cache.ask(question_def)
Answer recovered from Cache. 0.228 smaller than 0.35 Found cache in row: 1 with score 0.228 response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations. Time taken: 0.016 seconds
我们观察到欧氏距离有所增加,但它仍然在指定的阈值内。因此,它继续直接从缓存返回响应。
加载模型并创建提示
是时候使用transformers库了,这是来自hugging face最著名的用于处理语言模型的库。
我们正在导入
- Autotokenizer:这是一个实用程序类,用于对与各种预训练语言模型兼容的文本输入进行标记。
- AutoModelForCausalLM:它提供了一个接口,用于预训练的语言模型,专门用于使用因果语言建模(例如,GPT 模型)的语言生成任务,或者本笔记本中使用的模型Gemma-2b-it。
请随意测试不同的模型,您需要搜索为文本生成训练的 NLP 模型。
!pip install torch
from torch import cuda, torch
# In a MAC Silicon the device must be 'mps'
# device = torch.device('mps') #to use with MAC Silicon
device = f"cuda:{cuda.current_device()}" if cuda.is_available() else "cpu"
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", torch_dtype=torch.bfloat16)
创建扩展提示
为了创建提示,我们使用查询“semantic_cache”类的结果和用户输入的问题。
提示包含两个部分:从数据库中检索到的信息,即相关上下文;以及用户的问题。
我们只需要将这两个部分组合起来创建提示,然后将其发送给模型。
prompt_template = f"Relevant context: {results}\n\n The user's question: {question_def}"
prompt_template
input_ids = tokenizer(prompt_template, return_tensors="pt").to("cuda")
现在剩下的就是将提示发送给模型并等待其响应!
>>> outputs = model.generate(**input_ids, max_new_tokens=256)
>>> print(tokenizer.decode(outputs[0]))
Relevant context: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations. The user's question: Write in 20 words what is a Sydenham chorea. Sure, here is a 20-word answer: Sydenham chorea is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS).
结论。
在访问 ChromaDB 和直接访问缓存之间,数据检索时间减少了 50%。但是,在更大的项目中,这种差异会增大,从而导致 90-95% 的性能提升。
我们在 Chroma 中只有很少的数据,并且只有一个缓存类的实例。通常,缓存系统背后的数据要大得多,可能不仅仅涉及对向量数据库的查询,而是来自各种来源。
通常有多个缓存类的实例,通常基于用户类型,因为具有共同特征的用户的问题往往会重复出现。
总之,我们创建了一个非常简单的 RAG(检索增强生成)系统,并在用户问题和获取创建增强提示所需信息之间添加了一个语义缓存层,从而对其进行了增强。
< > GitHub 更新