开源 AI Cookbook 文档

使用Gemma、MongoDB和开源模型构建RAG系统

Hugging Face's logo
加入 Hugging Face 社区

并获得增强型文档体验

开始使用

Open In Colab

使用 Gemma、MongoDB 和开源模型构建 RAG 系统

作者:Richmond Alake

步骤 1:安装库

下面的 Shell 命令序列安装了用于利用开源大型语言模型 (LLM)、嵌入模型和数据库交互功能的库。这些库简化了 RAG 系统的开发,将代码复杂度降低到极少。

  • PyMongo:一个用于与 MongoDB 交互的 Python 库,它提供了连接集群和查询存储在集合和文档中的数据的功能。
  • Pandas:为 Python 提供数据结构,以便进行高效的数据处理和分析。
  • Hugging Face 数据集:包含音频、视觉和文本数据集。
  • Hugging Face Accelerate:抽象出编写利用 GPU 等硬件加速器的代码的复杂性。在实现中利用 Accelerate 来利用 Gemma 模型的 GPU 资源。
  • Hugging Face Transformers:提供对大量预训练模型的访问。
  • Hugging Face Sentence Transformers:提供对句子、文本和图像嵌入的访问。
!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

步骤 2:数据来源和准备

本教程中使用的数据来自 Hugging Face 数据集,特别是 AIatMongoDB/embedded_movies 数据集

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

以下代码片段中的操作侧重于强制执行数据完整性和质量。

  1. 第一个流程确保每个数据点的 fullplot 属性不为空,因为这是我们嵌入过程中使用的主要数据。
  2. 此步骤还确保我们从所有数据点中删除了plot_embedding属性,因为该属性将被使用不同嵌入模型(gte-large)创建的新嵌入替换。
>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)
Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

步骤 3:生成嵌入

代码片段中的步骤如下:

  1. 导入SentenceTransformer类以访问嵌入模型。
  2. 使用SentenceTransformer构造函数加载嵌入模型,以实例化gte-large嵌入模型。
  3. 定义get_embedding函数,该函数以文本字符串作为输入,并返回表示嵌入的浮点数列表。该函数首先检查输入文本是否为空(去除空格后)。如果文本为空,则返回一个空列表。否则,它使用加载的模型生成嵌入。
  4. 通过将get_embedding函数应用于dataset_df DataFrame 的“fullplot”列来生成嵌入,为每部电影的剧情生成嵌入。生成的嵌入列表被分配给一个名为 embedding 的新列。

注意:没有必要对完整剧情中的文本进行分块,因为我们可以确保文本长度保持在可管理的范围内。

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

步骤 4:数据库设置和连接

MongoDB 同时作为操作型数据库和向量数据库。它提供了一种数据库解决方案,可以有效地存储、查询和检索向量嵌入——它的优势在于数据库维护、管理和成本的简便性。

要创建一个新的 MongoDB 数据库,请设置一个数据库集群。

  1. 前往 MongoDB 官方网站,注册一个免费的 MongoDB Atlas 帐户,或对于现有用户,登录 MongoDB Atlas

  2. 选择左侧窗格中的“数据库”选项,这将导航到数据库部署页面,其中包含任何现有集群的部署规范。通过单击“+创建”按钮创建一个新的数据库集群。

  3. 选择适用于数据库集群的所有配置。选择完所有配置选项后,单击“创建集群”按钮以部署新创建的集群。MongoDB 还支持在“共享选项卡”上创建免费集群。

    注意:创建概念验证时,不要忘记将 IP 列入白名单,以便用于 Python 主机,或者将 0.0.0.0/0 列入白名单以允许任何 IP。

  4. 成功创建和部署集群后,集群将可以在“数据库部署”页面上访问。

  5. 单击集群的“连接”按钮,以查看通过各种语言驱动程序设置与集群连接的选项。

  6. 本教程只需要集群的 URI(统一资源标识符)。获取 URI 并将其复制到 Google Colabs Secrets 环境中的名为MONGO_URI的变量中,或者将其放在 .env 文件或等效文件中。

4.1 数据库和集合设置

在继续之前,请确保满足以下先决条件:

  • 在 MongoDB Atlas 上设置数据库集群
  • 已获得集群的 URI

有关数据库集群设置和获取 URI 的帮助,请参阅有关设置 MongoDB 集群获取连接字符串的指南。

创建集群后,通过在集群概述页面中单击 + 创建数据库,在 MongoDB Atlas 集群中创建数据库和集合。

以下是如何创建数据库和集合的指南。

数据库将命名为movies

集合将命名为movie_collection_2

步骤 5:创建向量搜索索引

此时,请确保通过 MongoDB Atlas 创建了向量索引。

此步骤对于根据存储在movie_collection_2集合中的文档中向量嵌入进行高效且准确的基于向量的搜索是必需的。

创建向量搜索索引可以有效地遍历文档,根据向量相似度检索与查询嵌入匹配的嵌入的文档。

前往此处阅读有关MongoDB 向量搜索索引的更多信息。

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

1024 的 numDimension 字段值对应于gte-large嵌入模型生成的向量的维度。如果您使用gte-basegte-small嵌入模型,则向量搜索索引中的 numDimension 值必须分别设置为 768 和 384。

步骤 6:建立数据连接

下面的代码片段还使用 PyMongo 创建一个 MongoDB 客户端对象,表示与集群的连接,并允许访问其数据库和集合。

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]
Connection to MongoDB successful
# Delete any existing records in the collection
collection.delete_many({})

将数据从 pandas DataFrame 导入到 MongoDB 集合是一个简单的过程,可以通过将 DataFrame 转换为字典,然后使用集合上的insert_many方法传递转换后的数据集记录来有效地实现。

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")
Data ingestion into MongoDB completed

步骤 7:对用户查询执行向量搜索

以下步骤实现了一个函数,该函数通过生成查询嵌入并定义一个 MongoDB 聚合管道来返回向量搜索结果。

管道由$vectorSearch$project阶段组成,使用生成的向量执行查询,并将结果格式化为仅包含所需信息(如剧情、标题和流派),同时包含每个结果的搜索分数。

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

步骤 8:处理用户查询并加载 Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result
>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = (
...     f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
... )

>>> print(combined_information)
Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")
>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))
Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.
< > 更新 在 GitHub 上