开源 AI 食谱文档

使用 Gemma、MongoDB 和开源模型构建 RAG 系统

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

使用 Gemma、MongoDB 和开源模型构建 RAG 系统

作者: Richmond Alake

步骤 1: 安装库

下面的 shell 命令序列安装了用于利用开源大型语言模型 (LLM)、嵌入模型和数据库交互功能的库。这些库简化了 RAG 系统的开发,将复杂性降低到少量代码即可完成。

  • PyMongo: 一个用于与 MongoDB 交互的 Python 库,它提供了连接到集群、查询集合和文档中存储的数据的功能。
  • Pandas: 提供了一种数据结构,可使用 Python 进行高效的数据处理和分析。
  • Hugging Face datasets: 包含音频、视觉和文本数据集。
  • Hugging Face Accelerate: 抽象了编写利用 GPU 等硬件加速器代码的复杂性。在实现中利用 Accelerate 在 GPU 资源上使用 Gemma 模型。
  • Hugging Face Transformers: 提供对大量预训练模型的访问。
  • Hugging Face Sentence Transformers: 提供对句子、文本和图像嵌入的访问。
!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

步骤 2: 数据来源与准备

本教程中使用的数据来自 Hugging Face datasets,特别是 AIatMongoDB/embedded_movies 数据集

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

以下代码片段中的操作侧重于强制执行数据完整性和质量。

  1. 第一个过程确保每个数据点的 fullplot 属性不为空,因为这是我们在嵌入过程中使用的主要数据。
  2. 此步骤还确保我们从所有数据点中删除 plot_embedding 属性,因为它将被使用不同嵌入模型 gte-large 创建的新嵌入所替换。
>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)
Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

步骤 3: 生成嵌入

代码片段中的步骤如下:

  1. 导入 SentenceTransformer 类以访问嵌入模型。
  2. 使用 SentenceTransformer 构造函数加载嵌入模型,以实例化 gte-large 嵌入模型。
  3. 定义 get_embedding 函数,该函数以文本字符串为输入,并返回一个表示嵌入的浮点数列表。该函数首先检查输入文本是否不为空(去除空白后)。如果文本为空,则返回一个空列表。否则,它将使用加载的模型生成嵌入。
  4. 通过将 get_embedding 函数应用于 dataset_df DataFrame 的 “fullplot” 列来生成嵌入,为每部电影的情节生成嵌入。生成的嵌入列表被分配到一个名为 embedding 的新列中。

注意: 没有必要对完整情节中的文本进行分块,因为我们可以确保文本长度保持在可管理的范围内。

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

步骤 4: 数据库设置和连接

MongoDB 同时充当操作型数据库和向量数据库。它提供了一种能高效存储、查询和检索向量嵌入的数据库解决方案——其优势在于数据库维护、管理和成本的简化。

要创建新的 MongoDB 数据库,请设置一个数据库集群。

  1. 前往 MongoDB 官方网站注册一个免费的 MongoDB Atlas 账户,或者对于现有用户,登录 MongoDB Atlas

  2. 在左侧面板中选择 ‘Database’ 选项,这将导航到数据库部署页面,那里有任何现有集群的部署规范。通过点击 “+Create” 按钮创建一个新的数据库集群。

  3. 为数据库集群选择所有适用的配置。一旦选择了所有配置选项,点击 “Create Cluster” 按钮来部署新创建的集群。MongoDB 还支持在 “Shared Tab” 中创建免费集群。

    注意: 在创建概念验证时,不要忘记为 Python 主机设置 IP 白名单,或为任何 IP 设置 0.0.0.0/0。

  4. 成功创建并部署集群后,该集群将在 ‘Database Deployment’ 页面上变为可访问状态。

  5. 点击集群的 “Connect” 按钮,查看通过各种语言驱动程序设置集群连接的选项。

  6. 本教程仅需要集群的 URI(唯一资源标识符)。获取该 URI 并将其复制到名为 MONGO_URI 的 Google Colabs Secrets 环境中,或将其放在 .env 文件或等效文件中。

4.1 数据库和集合设置

在继续之前,请确保满足以下先决条件

  • 在 MongoDB Atlas 上设置了数据库集群
  • 获取了您的集群 URI

有关数据库集群设置和获取 URI 的帮助,请参阅我们的设置 MongoDB 集群获取连接字符串指南。

创建集群后,通过在集群概览页面点击 + Create Database,在 MongoDB Atlas 集群内创建数据库和集合。

这里有一份创建数据库和集合的指南。

数据库将命名为 movies

集合将命名为 movie_collection_2

步骤 5: 创建向量搜索索引

此时,请确保已通过 MongoDB Atlas 创建了您的向量索引。

下一步是强制性的,用于对存储在 movie_collection_2 集合文档中的向量嵌入进行高效准确的基于向量的搜索。

创建向量搜索索引可以高效地遍历文档,以根据向量相似性检索与查询嵌入匹配的文档。

点击此处阅读更多关于 MongoDB 向量搜索索引 的信息。

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

numDimension 字段的 1024 值对应于 gte-large 嵌入模型生成的向量维度。如果您使用 gte-basegte-small 嵌入模型,向量搜索索引中的 numDimension 值必须分别设置为 768 和 384。

步骤 6: 建立数据连接

下面的代码片段还利用 PyMongo 创建了一个 MongoDB 客户端对象,该对象代表与集群的连接,并允许访问其数据库和集合。

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]
Connection to MongoDB successful
# Delete any existing records in the collection
collection.delete_many({})

将 pandas DataFrame 中的数据摄取到 MongoDB 集合中是一个简单的过程,可以通过将 DataFrame 转换为字典,然后利用集合上的 insert_many 方法传递转换后的数据集记录来高效完成。

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")
Data ingestion into MongoDB completed

步骤 7: 对用户查询执行向量搜索

下一步实现一个函数,该函数通过生成查询嵌入并定义 MongoDB 聚合管道来返回向量搜索结果。

该管道由 $vectorSearch$project 阶段组成,使用生成的向量执行查询,并格式化结果以仅包含所需信息,例如情节、标题和类型,同时为每个结果包含一个搜索分数。

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

步骤 8: 处理用户查询并加载 Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result
>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = (
...     f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
... )

>>> print(combined_information)
Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")
>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))
Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.
< > 在 GitHub 上更新