开源 AI 食谱文档

使用 Elasticsearch 和 Hugging Face 进行语义重排序

Hugging Face's logo
加入 Hugging Face 社区

并获得增强文档体验的访问权限

开始使用

Open In Colab

使用 Elasticsearch 和 Hugging Face 进行语义重排序

作者:Liam Thompson

在本笔记本中,我们将学习如何通过将 Hugging Face 的模型上传到 Elasticsearch 集群中,在 Elasticsearch 中实现语义重排序。我们将使用 `retriever` 抽象,这是一种更简单的 Elasticsearch 语法,用于构建查询和组合不同的搜索操作。

你将

  • 从 Hugging Face 选择一个交叉编码器模型来执行语义重排序
  • 使用 Eland(Elasticsearch 的机器学习 Python 客户端)将模型上传到你的 Elasticsearch 部署中
  • 创建一个推理端点来管理你的 `rerank` 任务
  • 使用 `text_similarity_rerank` 检索器查询你的数据

🧰 需求

对于此示例,你需要

  • 版本 8.15.0 或更高版本的 Elastic 部署(对于非无服务器部署)
    • 我们将在此示例中使用 Elastic Cloud(可通过 免费试用 获得)。
    • 请参阅我们的其他 部署选项
  • 你需要找到部署的云 ID 并创建一个 API 密钥。了解更多

安装和导入包

ℹ️ `eland` 的安装需要几分钟。

!pip install -qU elasticsearch
!pip install eland[pytorch]
from elasticsearch import Elasticsearch, helpers

初始化 Elasticsearch Python 客户端

首先,你需要连接到你的 Elasticsearch 实例。

>>> from getpass import getpass

>>> # https://elastic.ac.cn/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
>>> ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

>>> # https://elastic.ac.cn/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
>>> ELASTIC_API_KEY = getpass("Elastic Api Key: ")

>>> # Create the client instance
>>> client = Elasticsearch(
...     # For local development
...     # hosts=["https://127.0.0.1:9200"]
...     cloud_id=ELASTIC_CLOUD_ID,
...     api_key=ELASTIC_API_KEY,
... )
Elastic Cloud ID: ··········
Elastic Api Key: ··········

测试连接

使用此测试确认 Python 客户端已连接到你的 Elasticsearch 实例。

print(client.info())

此示例使用了一个小型电影数据集。

>>> from urllib.request import urlopen
>>> import json
>>> import time

>>> url = "https://huggingface.co/datasets/leemthompo/small-movies/raw/main/small-movies.json"
>>> response = urlopen(url)

>>> # Load the response data into a JSON object
>>> data_json = json.loads(response.read())

>>> # Prepare the documents to be indexed
>>> documents = []
>>> for doc in data_json:
...     documents.append(
...         {
...             "_index": "movies",
...             "_source": doc,
...         }
...     )

>>> # Use helpers.bulk to index
>>> helpers.bulk(client, documents)

>>> print("Done indexing documents into `movies` index!")
>>> time.sleep(3)
Done indexing documents into `movies` index!

使用 Eland 上传 Hugging Face 模型

现在,我们将使用 Eland 的 `eland_import_hub_model` 命令将模型上传到 Elasticsearch。在此示例中,我们选择了 `cross-encoder/ms-marco-MiniLM-L-6-v2` 文本相似度模型。

>>> !eland_import_hub_model \
...   --cloud-id $ELASTIC_CLOUD_ID \
...   --es-api-key $ELASTIC_API_KEY \
...   --hub-model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \
...   --task-type text_similarity \
...   --clear-previous \
...   --start
2024-08-13 17:04:12,386 INFO : Establishing connection to Elasticsearch
2024-08-13 17:04:12,567 INFO : Connected to serverless cluster 'bd8c004c050e4654ad32fb86ab159889'
2024-08-13 17:04:12,568 INFO : Loading HuggingFace transformer tokenizer and model 'cross-encoder/ms-marco-MiniLM-L-6-v2'
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
tokenizer_config.json: 100% 316/316 [00:00<00:00, 1.81MB/s]
config.json: 100% 794/794 [00:00<00:00, 4.09MB/s]
vocab.txt: 100% 232k/232k [00:00<00:00, 2.37MB/s]
special_tokens_map.json: 100% 112/112 [00:00<00:00, 549kB/s]
pytorch_model.bin: 100% 90.9M/90.9M [00:00<00:00, 135MB/s]
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
2024-08-13 17:04:18,789 INFO : Creating model with id 'cross-encoder__ms-marco-minilm-l-6-v2'
2024-08-13 17:04:21,123 INFO : Uploading model definition
100% 87/87 [00:55<00:00,  1.57 parts/s]
2024-08-13 17:05:16,416 INFO : Uploading model vocabulary
2024-08-13 17:05:16,987 INFO : Starting model deployment
2024-08-13 17:05:18,238 INFO : Model successfully imported with id 'cross-encoder__ms-marco-minilm-l-6-v2'

创建推理端点

接下来,我们将为 `rerank` 任务创建一个推理端点,以部署和管理我们的模型,并在必要时在幕后启动必要的机器学习资源。

client.inference.put(
    task_type="rerank",
    inference_id="my-msmarco-minilm-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "model_id": "cross-encoder__ms-marco-minilm-l-6-v2",
            "num_allocations": 1,
            "num_threads": 1,
        },
    },
)

运行以下命令以确认你的推理端点已部署。

client.inference.get()

⚠️ 部署模型时,你可能需要在 Kibana(或无服务器)UI 中同步机器学习保存的对象。转到 **已训练模型** 并选择 **同步保存的对象**。

词汇查询

首先,让我们使用 `standard` 检索器测试一些词汇(或全文)搜索,然后我们将比较在分层语义重排序时所带来的改进。

使用 query_string 查询进行词法匹配

假设我们依稀记得有一部关于吃人狂魔的著名电影。为了便于讨论,假设我们暂时忘记了“食人族”这个词。

让我们执行一个query_string 查询,以在 Elasticsearch 文档的 plot 字段中查找短语“食肉坏蛋”。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "standard": {
...             "query": {
...                 "query_string": {
...                     "query": "flesh-eating bad guy",
...                     "default_field": "plot",
...                 }
...             }
...         }
...     },
... )

>>> if resp["hits"]["hits"]:
...     for hit in resp["hits"]["hits"]:
...         title = hit["_source"]["title"]
...         plot = hit["_source"]["plot"]
...         print(f"Title: {title}\nPlot: {plot}\n")
>>> else:
...     print("No search results found")
No search results found

没有结果!不幸的是,我们没有找到“食肉坏蛋”的任何近似匹配。因为我们没有关于 Elasticsearch 数据中确切措辞的更多具体信息,我们需要扩大搜索范围。

简单的 multi_match 查询

此词法查询在 Elasticsearch 文档的“plot”和“genre”字段中对术语“crime”执行标准关键字搜索。

>>> resp = client.search(
...     index="movies",
...     retriever={"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")
Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

好多了!至少我们现在有一些结果了。我们拓宽了搜索条件,以增加找到相关结果的机会。

但是,这些结果在我们最初的查询“食肉坏蛋”的上下文中并不十分精确。我们可以看到,使用此通用 match 查询,结果集中间返回了“沉默的羔羊”。让我们看看是否可以使用我们的语义重排序模型更接近搜索者的原始意图。

语义重排序器

在以下 retriever 语法中,我们将标准查询检索器包装在 text_similarity_reranker 中。这使我们能够利用已部署到 Elasticsearch 的 NLP 模型,根据短语“食肉坏蛋”对结果进行重新排序。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "text_similarity_reranker": {
...             "retriever": {"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
...             "field": "plot",
...             "inference_id": "my-msmarco-minilm-model",
...             "inference_text": "flesh-eating bad guy",
...         }
...     },
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")
Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

成功!“沉默的羔羊”是我们排名第一的结果。语义重排序通过解析自然语言查询帮助我们找到了最相关的结果,克服了更多依赖精确匹配的词法搜索的局限性。

语义重排序只需几个步骤即可实现语义搜索,而无需生成和存储嵌入。能够在 Elasticsearch 集群中本地使用托管在 Hugging Face 上的开源模型,非常适合原型设计、测试和构建搜索体验。

了解更多

< > 在 GitHub 上更新