开源 AI 食谱文档

使用 Gemma、MongoDB 和开源模型构建 RAG 系统

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

使用 Gemma、MongoDB 和开源模型构建 RAG 系统

作者：Richmond Alake

步骤 1：安装库

下面的 shell 命令序列安装了用于利用开源大型语言模型 (LLM)、嵌入模型和数据库交互功能的库。这些库简化了 RAG 系统的开发，将复杂性降低到少量代码

PyMongo：一个用于与 MongoDB 交互的 Python 库，它启用了连接到集群和查询存储在集合和文档中的数据的功能。
Pandas：提供了一种数据结构，用于使用 Python 进行高效的数据处理和分析
Hugging Face datasets：包含音频、视觉和文本数据集
Hugging Face Accelerate：抽象了编写利用硬件加速器（如 GPU）的代码的复杂性。在该实现中利用 Accelerate 在 GPU 资源上使用 Gemma 模型。
Hugging Face Transformers：访问大量预训练模型
Hugging Face Sentence Transformers：提供对句子、文本和图像嵌入的访问。

!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

步骤 2：数据来源和准备

本教程中使用的数据来源于 Hugging Face datasets，特别是 AIatMongoDB/embedded_movies 数据集。

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

以下代码片段中的操作侧重于强制执行数据完整性和质量。

第一个过程确保每个数据点的 fullplot 属性不为空，因为这是我们在嵌入过程中使用的主要数据。
此步骤还确保我们从所有数据点中删除 plot_embedding 属性，因为这将由使用不同嵌入模型 gte-large 创建的新嵌入替换。

>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)

Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

步骤 3：生成嵌入

代码片段中的步骤如下

导入 SentenceTransformer 类以访问嵌入模型。
使用 SentenceTransformer 构造函数加载嵌入模型，以实例化 gte-large 嵌入模型。
定义 get_embedding 函数，该函数接受一个文本字符串作为输入，并返回一个表示嵌入的浮点数列表。该函数首先检查输入文本是否为空（去除空格后）。如果文本为空，则返回一个空列表。否则，它使用加载的模型生成嵌入。
通过将 get_embedding 函数应用于 dataset_df DataFrame 的“fullplot”列来生成嵌入，从而为每个电影的情节生成嵌入。生成的嵌入列表被分配给一个名为 embedding 的新列。

注意：没有必要对完整情节中的文本进行分块，因为我们可以确保文本长度保持在可管理的范围内。

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

步骤 4：数据库设置和连接

MongoDB 既充当操作数据库，又充当向量数据库。它提供了一种数据库解决方案，可以高效地存储、查询和检索向量嵌入——其优势在于数据库维护、管理和成本的简单性。

要创建新的 MongoDB 数据库，请设置数据库集群

前往 MongoDB 官方网站并注册一个免费 MongoDB Atlas 帐户，或者对于现有用户，登录 MongoDB Atlas。
在左侧窗格中选择“数据库”选项，这将导航到“数据库部署”页面，其中包含任何现有集群的部署规范。单击“+创建”按钮创建一个新的数据库集群。
选择数据库集群的所有适用配置。选择所有配置选项后，单击“创建集群”按钮以部署新创建的集群。MongoDB 还允许在“共享选项卡”上创建免费集群。

注意：创建概念验证时，不要忘记将 Python 主机的 IP 或 0.0.0.0/0 列入任何 IP 的白名单。
成功创建和部署集群后，可以在“数据库部署”页面上访问该集群。
单击集群的“连接”按钮以查看通过各种语言驱动程序设置与集群连接的选项。
本教程仅需要集群的 URI（唯一资源标识符）。获取 URI 并将其复制到名为 MONGO_URI 的 Google Colabs Secrets 环境中，或将其放置在 .env 文件或等效文件中。

4.1 数据库和集合设置

在继续之前，请确保满足以下先决条件

在 MongoDB Atlas 上设置数据库集群
获取集群的 URI

如需数据库集群设置和获取 URI 的帮助，请参阅我们的设置 MongoDB 集群和获取连接字符串指南

创建集群后，通过单击集群概览页面中的 + 创建数据库，在 MongoDB Atlas 集群中创建数据库和集合。

这是创建数据库和集合的指南

数据库将命名为 movies。

集合将命名为 movie_collection_2。

步骤 5：创建向量搜索索引

此时，请确保通过 MongoDB Atlas 创建了向量索引。

下一步对于基于 movie_collection_2 集合中文档内存储的向量嵌入执行高效且准确的基于向量的搜索至关重要。

创建向量搜索索引能够有效地遍历文档，以检索具有与基于向量相似度查询嵌入匹配的嵌入的文档。

点击此处阅读更多关于 MongoDB 向量搜索索引的信息。

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

numDimension 字段的 1024 值对应于 gte-large 嵌入模型生成的向量的维度。如果您使用 gte-base 或 gte-small 嵌入模型，则向量搜索索引中的 numDimension 值必须分别设置为 768 和 384。

步骤 6：建立数据连接

下面的代码片段还利用 PyMongo 创建一个 MongoDB 客户端对象，表示与集群的连接，并允许访问其数据库和集合。

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]

Connection to MongoDB successful

# Delete any existing records in the collection
collection.delete_many({})

将数据从 pandas DataFrame 摄取到 MongoDB 集合中是一个简单的过程，可以通过将 DataFrame 转换为字典，然后利用集合上的 insert_many 方法传递转换后的数据集记录来高效地完成。

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed

步骤 7：对用户查询执行向量搜索

以下步骤实现了一个函数，该函数通过生成查询嵌入和定义 MongoDB 聚合管道来返回向量搜索结果。

该管道由 $vectorSearch 和 $project 阶段组成，使用生成的向量执行查询并格式化结果，使其仅包含所需的信息，例如情节、标题和类型，同时为每个结果合并搜索分数。

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

步骤 8：处理用户查询和加载 Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result

>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = (
...     f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
... )

>>> print(combined_information)

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.

< > 在 GitHub 上更新

←使用 Gemma、Elasticsearch 和开源模型构建 RAG 系统使用 Hugging Face Zephyr 和 LangChain 的简单 RAG→