使用 Elasticsearch 和 Hugging Face 进行语义重排序

在本 notebook 中，我们将学习如何通过将 Hugging Face 的模型上传到 Elasticsearch 集群中，在 Elasticsearch 中实现语义重排序。我们将使用 retriever 抽象，这是一种更简单的 Elasticsearch 语法，用于编写查询和组合不同的搜索操作。

您将：

从 Hugging Face 中选择一个交叉编码器模型来执行语义重排序
使用 Eland（一个用于 Elasticsearch 机器学习的 Python 客户端）将模型上传到您的 Elasticsearch 部署中
创建一个推理终端节点来管理您的 rerank 任务
使用 text_similarity_rerank retriever 查询您的数据

🧰 需求

对于此示例，您将需要：

版本 8.15.0 或更高版本的 Elastic 部署（对于非无服务器部署）
- 在此示例中，我们将使用 Elastic Cloud （可通过免费试用获得）。
- 请参阅我们的其他部署选项
您需要找到您的部署的 Cloud ID 并创建一个 API 密钥。了解更多。

安装和导入软件包

ℹ️ eland 安装将需要几分钟时间。

!pip install -qU elasticsearch
!pip install eland[pytorch]
from elasticsearch import Elasticsearch, helpers

初始化 Elasticsearch Python 客户端

首先，您需要连接到您的 Elasticsearch 实例。

>>> from getpass import getpass

>>> # https://elastic.ac.cn/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
>>> ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

>>> # https://elastic.ac.cn/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
>>> ELASTIC_API_KEY = getpass("Elastic Api Key: ")

>>> # Create the client instance
>>> client = Elasticsearch(
...     # For local development
...     # hosts=["http://localhost:9200"]
...     cloud_id=ELASTIC_CLOUD_ID,
...     api_key=ELASTIC_API_KEY,
... )

Elastic Cloud ID: ··········
Elastic Api Key: ··········

测试连接

通过此测试确认 Python 客户端已连接到您的 Elasticsearch 实例。

print(client.info())

此示例使用一个小型电影数据集。

>>> from urllib.request import urlopen
>>> import json
>>> import time

>>> url = "https://huggingface.co/datasets/leemthompo/small-movies/raw/main/small-movies.json"
>>> response = urlopen(url)

>>> # Load the response data into a JSON object
>>> data_json = json.loads(response.read())

>>> # Prepare the documents to be indexed
>>> documents = []
>>> for doc in data_json:
...     documents.append(
...         {
...             "_index": "movies",
...             "_source": doc,
...         }
...     )

>>> # Use helpers.bulk to index
>>> helpers.bulk(client, documents)

>>> print("Done indexing documents into `movies` index!")
>>> time.sleep(3)

Done indexing documents into `movies` index!

使用 Eland 上传 Hugging Face 模型

现在我们将使用 Eland 的 eland_import_hub_model 命令将模型上传到 Elasticsearch。在此示例中，我们选择了 cross-encoder/ms-marco-MiniLM-L-6-v2 文本相似度模型。

>>> !eland_import_hub_model \
...   --cloud-id $ELASTIC_CLOUD_ID \
...   --es-api-key $ELASTIC_API_KEY \
...   --hub-model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \
...   --task-type text_similarity \
...   --clear-previous \
...   --start

2024-08-13 17:04:12,386 INFO : Establishing connection to Elasticsearch
2024-08-13 17:04:12,567 INFO : Connected to serverless cluster 'bd8c004c050e4654ad32fb86ab159889'
2024-08-13 17:04:12,568 INFO : Loading HuggingFace transformer tokenizer and model 'cross-encoder/ms-marco-MiniLM-L-6-v2'
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
tokenizer_config.json: 100% 316/316 [00:00<00:00, 1.81MB/s]
config.json: 100% 794/794 [00:00<00:00, 4.09MB/s]
vocab.txt: 100% 232k/232k [00:00<00:00, 2.37MB/s]
special_tokens_map.json: 100% 112/112 [00:00<00:00, 549kB/s]
pytorch_model.bin: 100% 90.9M/90.9M [00:00<00:00, 135MB/s]
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
2024-08-13 17:04:18,789 INFO : Creating model with id 'cross-encoder__ms-marco-minilm-l-6-v2'
2024-08-13 17:04:21,123 INFO : Uploading model definition
100% 87/87 [00:55<00:00,  1.57 parts/s]
2024-08-13 17:05:16,416 INFO : Uploading model vocabulary
2024-08-13 17:05:16,987 INFO : Starting model deployment
2024-08-13 17:05:18,238 INFO : Model successfully imported with id 'cross-encoder__ms-marco-minilm-l-6-v2'

创建推理终端节点

接下来，我们将为 rerank 任务创建一个推理终端节点，以部署和管理我们的模型，并在必要时在后台启动必要的 ML 资源。

client.inference.put(
    task_type="rerank",
    inference_id="my-msmarco-minilm-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "model_id": "cross-encoder__ms-marco-minilm-l-6-v2",
            "num_allocations": 1,
            "num_threads": 1,
        },
    },
)

运行以下命令以确认您的推理终端节点已部署。

client.inference.get()

⚠️ 当您部署模型时，您可能需要在 Kibana（或无服务器）UI 中同步您的 ML 保存对象。转到 Trained Models（训练模型） 并选择 Synchronize saved objects（同步保存对象）。

词汇查询

首先，让我们使用 standard retriever 测试一些词汇（或全文）搜索，然后我们将比较当我们加入语义重排序时的改进。

使用 query_string 查询进行词汇匹配

假设我们模糊地记得有一部关于一个吃受害者的杀手的著名电影。为了便于讨论，假设我们暂时忘记了“食人者”这个词。

让我们执行一个 query_string 查询，以在我们的 Elasticsearch 文档的 plot 字段中查找短语“flesh-eating bad guy”。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "standard": {
...             "query": {
...                 "query_string": {
...                     "query": "flesh-eating bad guy",
...                     "default_field": "plot",
...                 }
...             }
...         }
...     },
... )

>>> if resp["hits"]["hits"]:
...     for hit in resp["hits"]["hits"]:
...         title = hit["_source"]["title"]
...         plot = hit["_source"]["plot"]
...         print(f"Title: {title}\nPlot: {plot}\n")
>>> else:
...     print("No search results found")

No search results found

没有结果！不幸的是，我们没有任何与“flesh-eating bad guy”完全匹配的结果。由于我们没有关于 Elasticsearch 数据中确切措辞的更多具体信息，因此我们需要扩大搜索范围。

简单的 multi_match 查询

此词汇查询在我们的 Elasticsearch 文档的“plot”和“genre”字段中执行对术语“crime”的标准关键字搜索。

>>> resp = client.search(
...     index="movies",
...     retriever={"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")

Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

这样更好！至少我们现在得到了一些结果。我们扩大了搜索条件，以增加找到相关结果的机会。

但是，在我们的原始查询“flesh-eating bad guy”的上下文中，这些结果不是很精确。我们可以看到，使用此通用 match 查询，“The Silence of the Lambs”（沉默的羔羊）在结果集中间返回。让我们看看是否可以使用我们的语义重排序模型来更接近搜索者的原始意图。

语义重排序器

在以下 retriever 语法中，我们将标准查询 retriever 包装在 text_similarity_reranker 中。这使我们能够利用我们部署到 Elasticsearch 的 NLP 模型，根据短语“flesh-eating bad guy”对结果进行重排序。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "text_similarity_reranker": {
...             "retriever": {"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
...             "field": "plot",
...             "inference_id": "my-msmarco-minilm-model",
...             "inference_text": "flesh-eating bad guy",
...         }
...     },
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")

Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

成功！“The Silence of the Lambs”（沉默的羔羊）是我们的首要结果。语义重排序通过解析自然语言查询，帮助我们找到最相关的结果，克服了更多依赖于精确匹配的词汇搜索的局限性。

语义重排序只需几个步骤即可实现语义搜索，而无需生成和存储嵌入。能够在您的 Elasticsearch 集群中本地使用 Hugging Face 上托管的开源模型，对于原型设计、测试和构建搜索体验非常有用。

了解更多

在此示例中，我们选择了 cross-encoder/ms-marco-MiniLM-L-6-v2 文本相似度模型。有关 Elasticsearch 支持的第三方文本相似度模型的列表，请参阅 Elastic NLP 模型参考。
了解更多关于将 Hugging Face 集成到 Elasticsearch 的信息。
查看 Elastic 在 elasticsearch-labs repo 中的 Python notebook 目录。
了解更多关于 Elasticsearch 中的 retrievers 和 reranking 的信息

< > 在 GitHub 上更新

开源 AI 食谱