notebooks/zh-CN/semantic_reranking_elasticsearch.ipynb (581 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "tn-NTlKI2EYR" }, "source": [ "# 使用 Elasticsearch 和 Hugging Face 进行语义重排序\n", "\n", "_作者:[Liam Thompson](https://github.com/leemthompo)_\n", "\n", "在本 Notebook 中,我们将学习如何通过将 Hugging Face 中的模型上传到 Elasticsearch 集群来实现语义重排序。我们将使用 `retriever` 抽象,它是一个简化的 Elasticsearch 查询语法,便于构建查询并结合不同的搜索操作。\n", "\n", "你将会:\n", "\n", "- 从 Hugging Face 选择一个跨编码器模型来执行语义重排序\n", "- 使用 [Eland](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html) —— 一个用于与 Elasticsearch 进行机器学习的 Python 客户端,将模型上传到你的 Elasticsearch 部署\n", "- 创建一个推理端点来管理你的 `rerank` 任务\n", "- 使用 `text_similarity_rerank` 检索器查询你的数据" ] }, { "cell_type": "markdown", "metadata": { "id": "-tir9w4Sz80v" }, "source": [ "## 🧰 必备条件\n", "\n", "为了运行此示例,你需要:\n", "\n", "- 一个版本为 8.15.0 或更高版本的 Elastic 部署(对于非无服务器部署)\n", "\n", " - 我们将在本示例中使用 Elastic Cloud(可通过 [免费试用](https://cloud.elastic.co/registration) 进行访问)。\n", " - 查看其他 [部署选项](https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html#elasticsearch-intro-deploy)\n", "- 你需要找到你的部署的 Cloud ID 并创建一个 API 密钥。 [了解更多](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id)。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Ut-7TfSB2EYS" }, "source": [ "## 安装并导入包\n", "\n", "ℹ️ `eland` 的安装可能需要几分钟时间。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzuZInsj2EYS" }, "outputs": [], "source": [ "!pip install -qU elasticsearch\n", "!pip install eland[pytorch]\n", "from elasticsearch import Elasticsearch, helpers" ] }, { "cell_type": "markdown", "metadata": { "id": "FlJXqZkJ2EYT" }, "source": [ "## 初始化 Elasticsearch Python 客户端\n", "\n", "首先,你需要连接到你的 Elasticsearch 实例。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "frMDUtzt2EYT", "outputId": "6339d445-e5ef-41e4-83f2-100b58875c27" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Elastic Cloud ID: ··········\n", "Elastic Api Key: ··········\n" ] } ], "source": [ "from getpass import getpass\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "# Create the client instance\n", "client = Elasticsearch(\n", " # For local development\n", " # hosts=[\"http://localhost:9200\"]\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "rL5AEaDrzj9t" }, "source": [ "## 测试连接\n", "\n", "通过以下测试确认 Python 客户端是否已成功连接到你的 Elasticsearch 实例。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t_lhez7Tznhp" }, "outputs": [], "source": [ "print(client.info())" ] }, { "cell_type": "markdown", "metadata": { "id": "yR-zE2ii2EYU" }, "source": [ "这个示例使用了一个小型的电影数据集。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bWEpw_X52EYU", "outputId": "e4faa762-cb9d-487f-9055-54b2e4476a53" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done indexing documents into `movies` index!\n" ] } ], "source": [ "from urllib.request import urlopen\n", "import json\n", "import time\n", "\n", "url = \"https://huggingface.co/datasets/leemthompo/small-movies/raw/main/small-movies.json\"\n", "response = urlopen(url)\n", "\n", "# Load the response data into a JSON object\n", "data_json = json.loads(response.read())\n", "\n", "# Prepare the documents to be indexed\n", "documents = []\n", "for doc in data_json:\n", " documents.append(\n", " {\n", " \"_index\": \"movies\",\n", " \"_source\": doc,\n", " }\n", " )\n", "\n", "# Use helpers.bulk to index\n", "helpers.bulk(client, documents)\n", "\n", "print(\"Done indexing documents into `movies` index!\")\n", "time.sleep(3)" ] }, { "cell_type": "markdown", "metadata": { "id": "AWd_YOZh2EYU" }, "source": [ "## 使用 Eland 上传 Hugging Face 模型\n", "\n", "现在,我们将使用 Eland 的 `eland_import_hub_model` 命令将模型上传到 Elasticsearch。在这个示例中,我们选择了 `cross-encoder/ms-marco-MiniLM-L-6-v2` 文本相似度模型。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sra93pWm2EYV", "outputId": "a7bb15dc-bb43-492e-f92e-dd1d78711fb8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2024-08-13 17:04:12,386 INFO : Establishing connection to Elasticsearch\n", "2024-08-13 17:04:12,567 INFO : Connected to serverless cluster 'bd8c004c050e4654ad32fb86ab159889'\n", "2024-08-13 17:04:12,568 INFO : Loading HuggingFace transformer tokenizer and model 'cross-encoder/ms-marco-MiniLM-L-6-v2'\n", "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n", "tokenizer_config.json: 100% 316/316 [00:00<00:00, 1.81MB/s]\n", "config.json: 100% 794/794 [00:00<00:00, 4.09MB/s]\n", "vocab.txt: 100% 232k/232k [00:00<00:00, 2.37MB/s]\n", "special_tokens_map.json: 100% 112/112 [00:00<00:00, 549kB/s]\n", "pytorch_model.bin: 100% 90.9M/90.9M [00:00<00:00, 135MB/s]\n", "STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:312] Completed Stage: Warm Up\n", "STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:318] Completed Stage: Collection\n", "STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:322] Completed Stage: Post Processing\n", "2024-08-13 17:04:18,789 INFO : Creating model with id 'cross-encoder__ms-marco-minilm-l-6-v2'\n", "2024-08-13 17:04:21,123 INFO : Uploading model definition\n", "100% 87/87 [00:55<00:00, 1.57 parts/s]\n", "2024-08-13 17:05:16,416 INFO : Uploading model vocabulary\n", "2024-08-13 17:05:16,987 INFO : Starting model deployment\n", "2024-08-13 17:05:18,238 INFO : Model successfully imported with id 'cross-encoder__ms-marco-minilm-l-6-v2'\n" ] } ], "source": [ "!eland_import_hub_model \\\n", " --cloud-id $ELASTIC_CLOUD_ID \\\n", " --es-api-key $ELASTIC_API_KEY \\\n", " --hub-model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \\\n", " --task-type text_similarity \\\n", " --clear-previous \\\n", " --start" ] }, { "cell_type": "markdown", "metadata": { "id": "DRIABkGAgV_Q" }, "source": [ "## 创建推理端点\n", "\n", "接下来,我们将为 `rerank` 任务创建一个推理端点,以部署和管理我们的模型,并在需要时启动必要的机器学习资源。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DiKsd3YygV_Q", "outputId": "c3c46c6b-b502-4167-c98c-d2e2e0a4613c" }, "outputs": [ { "data": { "text/plain": [ "ObjectApiResponse({'endpoints': [{'model_id': 'my-msmarco-minilm-model', 'inference_id': 'my-msmarco-minilm-model', 'task_type': 'rerank', 'service': 'elasticsearch', 'service_settings': {'num_allocations': 1, 'num_threads': 1, 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'}, 'task_settings': {'return_documents': True}}]})" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.inference.put(\n", " task_type=\"rerank\",\n", " inference_id=\"my-msmarco-minilm-model\",\n", " inference_config={\n", " \"service\": \"elasticsearch\",\n", " \"service_settings\": {\n", " \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n", " \"num_allocations\": 1,\n", " \"num_threads\": 1,\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "运行以下命令以确认你的推理端点已成功部署。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client.inference.get()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⚠️ 当你部署模型时,可能需要在 Kibana(或无服务器)UI 中同步你的机器学习保存对象。 \n", "请前往 **训练模型** 并选择 **同步保存对象**。" ] }, { "cell_type": "markdown", "metadata": { "id": "pDAB9pX3VGKE" }, "source": [ "## 词汇查询\n", "\n", "首先,让我们使用一个 `standard` 检索器来测试一些词汇(或全文)搜索,然后我们将比较在加入语义重排序后所带来的改进。\n", "\n", "### 使用 `query_string` 查询进行词汇匹配\n", "\n", "假设我们模糊记得有一部关于吃人肉的杀手的著名电影。为了论证,假设我们暂时忘记了“食人者”这个词。\n", "\n", "让我们执行一个 [`query_string` 查询](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html),在 Elasticsearch 文档的 `plot` 字段中查找短语 “flesh-eating bad guy”。" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EAG0yZ_bVX-j", "outputId": "a8446bbf-ece5-47db-c7cc-ce9856ab9811" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No search results found\n" ] } ], "source": [ "resp = client.search(\n", " index=\"movies\",\n", " retriever={\n", " \"standard\": {\n", " \"query\": {\n", " \"query_string\": {\n", " \"query\": \"flesh-eating bad guy\",\n", " \"default_field\": \"plot\",\n", " }\n", " }\n", " }\n", " },\n", ")\n", "\n", "if resp[\"hits\"][\"hits\"]:\n", " for hit in resp[\"hits\"][\"hits\"]:\n", " title = hit[\"_source\"][\"title\"]\n", " plot = hit[\"_source\"][\"plot\"]\n", " print(f\"Title: {title}\\nPlot: {plot}\\n\")\n", "else:\n", " print(\"No search results found\")" ] }, { "cell_type": "markdown", "metadata": { "id": "48_nf56d2EYV" }, "source": [ "没有结果!不幸的是,我们没有找到与 “flesh-eating bad guy” 精确匹配的结果。由于我们没有关于 Elasticsearch 数据中确切措辞的更多信息,我们需要扩大搜索范围。\n", "\n", "### 简单的 `multi_match` 查询\n", "\n", "这个词汇查询在我们的 Elasticsearch 文档的 `plot` 和 `genre` 字段中执行了一个标准的关键词搜索,查找术语 “crime”。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XkAlzHW82EYV", "outputId": "f01c6ca5-f2fd-4cba-f5bf-7b8f6914eef6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: The Godfather\n", "Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.\n", "\n", "Title: Goodfellas\n", "Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.\n", "\n", "Title: The Silence of the Lambs\n", "Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.\n", "\n", "Title: Pulp Fiction\n", "Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n", "\n", "Title: Se7en\n", "Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.\n", "\n", "Title: The Departed\n", "Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.\n", "\n", "Title: The Usual Suspects\n", "Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.\n", "\n", "Title: The Dark Knight\n", "Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.\n", "\n" ] } ], "source": [ "resp = client.search(\n", " index=\"movies\",\n", " retriever={\n", " \"standard\": {\n", " \"query\": {\"multi_match\": {\"query\": \"crime\", \"fields\": [\"plot\", \"genre\"]}}\n", " }\n", " },\n", ")\n", "\n", "for hit in resp[\"hits\"][\"hits\"]:\n", " title = hit[\"_source\"][\"title\"]\n", " plot = hit[\"_source\"][\"plot\"]\n", " print(f\"Title: {title}\\nPlot: {plot}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "2K1C_Pmzmrxw" }, "source": [ "好多了!至少现在我们有了一些结果。我们扩大了搜索标准,以提高找到相关结果的机会。\n", "\n", "不过,这些结果在我们原始查询 \"flesh-eating bad guy\" 的语境下并不是很精确。我们可以看到,“沉默的羔羊”出现在结果列表的中间,这是一个使用了通用 `match` 查询的结果。让我们看看是否可以使用我们的语义重排序模型,来更接近搜索者的原始意图。" ] }, { "cell_type": "markdown", "metadata": { "id": "Nd8gAE6H2EYV" }, "source": [ "## 语义重排序\n", "\n", "在以下的 `retriever` 语法中,我们将标准查询检索器包装在一个 `text_similarity_reranker` 中。这样,我们就可以利用我们部署到 Elasticsearch 的 NLP 模型,根据短语 \"flesh-eating bad guy\" 对结果进行重排序。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Z9DegKqb2EYV", "outputId": "fc3f8827-4ddd-46d0-cccd-5376e0a81bbf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: The Silence of the Lambs\n", "Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.\n", "\n", "Title: Pulp Fiction\n", "Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n", "\n", "Title: Se7en\n", "Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.\n", "\n", "Title: Goodfellas\n", "Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.\n", "\n", "Title: The Dark Knight\n", "Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.\n", "\n", "Title: The Godfather\n", "Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.\n", "\n", "Title: The Departed\n", "Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.\n", "\n", "Title: The Usual Suspects\n", "Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.\n", "\n" ] } ], "source": [ "resp = client.search(\n", " index=\"movies\",\n", " retriever={\n", " \"text_similarity_reranker\": {\n", " \"retriever\": {\n", " \"standard\": {\n", " \"query\": {\n", " \"multi_match\": {\"query\": \"crime\", \"fields\": [\"plot\", \"genre\"]}\n", " }\n", " }\n", " },\n", " \"field\": \"plot\",\n", " \"inference_id\": \"my-msmarco-minilm-model\",\n", " \"inference_text\": \"flesh-eating bad guy\",\n", " }\n", " },\n", ")\n", "\n", "for hit in resp[\"hits\"][\"hits\"]:\n", " title = hit[\"_source\"][\"title\"]\n", " plot = hit[\"_source\"][\"plot\"]\n", " print(f\"Title: {title}\\nPlot: {plot}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "LqVh3qt82EYW" }, "source": [ "成功了!“沉默的羔羊”是我们的首个结果。语义重排序通过解析自然语言查询,帮助我们找到最相关的结果,克服了依赖精确匹配的词汇搜索的局限性。\n", "\n", "语义重排序通过几个步骤实现了语义搜索,而无需生成和存储嵌入。能够在 Elasticsearch 集群中本地使用托管在 Hugging Face 上的开源模型,非常适合原型开发、测试和构建搜索体验。" ] }, { "cell_type": "markdown", "metadata": { "id": "6yYfJjjstxwK" }, "source": [ "## 了解更多\n", "\n", "- 在本示例中,我们选择了 [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) 文本相似度模型。请参考 [Elastic NLP 模型参考](https://www.elastic.co/guide/en/machine-learning/8.15/ml-nlp-model-ref.html#ml-nlp-model-ref-text-similarity) 获取 Elasticsearch 支持的第三方文本相似度模型列表。\n", "- 了解更多关于 [将 Hugging Face 与 Elasticsearch 集成](https://www.elastic.co/search-labs/integrations/hugging-face) 的信息。\n", "- 查看 Elastic 的 Python 笔记本目录,访问 [`elasticsearch-labs` 仓库](https://github.com/elastic/elasticsearch-labs/tree/main/notebooks)。\n", "- 了解更多关于 [Elasticsearch 中的检索器和重排序](https://www.elastic.co/guide/en/elasticsearch/reference/current/retrievers-reranking-overview.html)。" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }