使用 Hugging Face 工具进行 RAG

社区文章发布于 2024 年 7 月 7 日

定义

工具

嵌入原始数据集

搜索数据集

RAG 聊天机器人

演示

致谢

定义

首先我们来定义什么是 RAG：检索增强生成（Retrieval-Augmented Generation）。它是一种自然语言处理（NLP）技术，通过整合外部知识源（如数据库或搜索引擎）来提高语言模型的性能。其基本思想是根据输入查询从外部源检索相关信息。

工具

本博客需要以下库

pip install -q datasets sentence-transformers faiss-cpu accelerate

嵌入原始数据集

这是一个非常慢的过程，因此我们建议您选择 GPU

这是必要的一步，也是我们列表中最慢的一步，我们建议您嵌入数据集并将其保存/推送到 Hub，以避免每次都执行此操作。

让我们从加载原始数据集开始

from datasets import load_dataset

dataset = load_dataset("not-lain/wikipedia")

dataset # Let's checkout our dataset
>>> DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 3000
    })
})

然后我们加载嵌入模型，我将选择 mixedbread-ai/mxbai-embed-large-v1

from sentence_transformers import SentenceTransformer
ST = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

现在让我们嵌入数据集

def embed(batch):
    """
    adds a column to the dataset called 'embeddings'
    """
    # or you can combine multiple columns here
    # For example the title and the text
    information = batch["text"]
    return {"embeddings" : ST.encode(information)}

dataset = dataset.map(embed,batched=True,batch_size=16)

建议您保存数据集，以避免每次都重复此步骤

为了保持所有用户的原始数据集完整，我将把嵌入后的数据集推送到一个新的分支，这可以使用 `revision` 参数轻松实现

dataset.push_to_hub("not-lain/wikipedia", revision="embedded")

搜索数据集

您可以从 Hub 调用数据集

from datasets import load_dataset

dataset = load_dataset("not-lain/wikipedia",revision = "embedded")

然后使用我们创建的 `embeddings` 列添加 Faiss 索引。

data = dataset["train"]
data = data.add_faiss_index("embeddings")

让我们定义一个搜索函数

def search(query: str, k: int = 3 ):
    """a function that embeds a new query and returns the most probable results"""
    embedded_query = ST.encode(query) # embed new query
    scores, retrieved_examples = data.get_nearest_examples( # retrieve results
        "embeddings", embedded_query, # compare our new embedded query with the dataset embeddings
        k=k # get only top k results
    )
    return scores, retrieved_examples

# search for word anarchy and get the best 4 matching values from the dataset
scores , result = search("anarchy", 4 ) 
result['title']
>>> ['Anarchism', 'Anarcho-capitalism', 'Community', 'Capitalism']

print(result["text"][0])
>>>"Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and (...)"

RAG 聊天机器人

以下是一个 RAG 聊天机器人的草稿

embed (only once)
│
└── new query
    │
    └── retrieve
        │
        └─── format prompt
            │
            └── GenAI
                │
                └── generate response

现在让我们在嵌入后将所有内容整合到一个新会话中。

pip install -q datasets sentence-transformers faiss-cpu accelerate bitsandbytes

from sentence_transformers import SentenceTransformer
from datasets import load_dataset

ST = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

dataset = load_dataset("not-lain/wikipedia",revision = "embedded")

data = dataset["train"]
data = data.add_faiss_index("embeddings") # column name that has the embeddings of the dataset

def search(query: str, k: int = 3 ):
    """a function that embeds a new query and returns the most probable results"""
    embedded_query = ST.encode(query) # embed new query
    scores, retrieved_examples = data.get_nearest_examples( # retrieve results
        "embeddings", embedded_query, # compare our new embedded query with the dataset embeddings
        k=k # get only top k results
    )
    return scores, retrieved_examples

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# use quantization to lower GPU usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=bnb_config
)
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

我们建议您设置一个系统提示，以引导大型语言模型 (LLM) 生成响应。

SYS_PROMPT = """You are an assistant for answering questions.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer."""

def format_prompt(prompt,retrieved_documents,k):
  """using the retrieved documents we will prompt the model to generate our responses"""
  PROMPT = f"Question:{prompt}\nContext:"
  for idx in range(k) :
    PROMPT+= f"{retrieved_documents['text'][idx]}\n"
  return PROMPT

def generate(formatted_prompt):
  formatted_prompt = formatted_prompt[:2000] # to avoid GPU OOM
  messages = [{"role":"system","content":SYS_PROMPT},{"role":"user","content":formatted_prompt}]
  # tell the model to generate
  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)
  outputs = model.generate(
      input_ids,
      max_new_tokens=1024,
      eos_token_id=terminators,
      do_sample=True,
      temperature=0.6,
      top_p=0.9,
  )
  response = outputs[0][input_ids.shape[-1]:]
  return tokenizer.decode(response, skip_special_tokens=True)

def rag_chatbot(prompt:str,k:int=2):
  scores , retrieved_documents = search(prompt, k)
  formatted_prompt = format_prompt(prompt,retrieved_documents,k)
  return generate(formatted_prompt)

rag_chatbot("what's anarchy ?", k = 2)
>>>"So, anarchism is a political philosophy that questions the need for authority and hierarchy, and (...)"

演示

您可以在此处找到一个演示应用程序来试用该应用程序。

致谢

in loving memory of Rayner V. Giuret, a friend, a brother, and an idol to all of us at LowRes.
Your legacy lives on in our hearts and minds. Thanks for everything.

Rest in peace, Rayner.

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论

使用 Hugging Face 工具进行 RAG

定义 工具 嵌入原始数据集 搜索数据集 RAG 聊天机器人 演示 致谢 定义

工具

嵌入原始数据集

搜索数据集

RAG 聊天机器人

演示

致谢

社区

定义

工具

嵌入原始数据集

搜索数据集

RAG 聊天机器人

演示

致谢

定义