supporting-blog-content/multilingual-e5/multilingual-e5.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "s49gpkvZ7q53" }, "source": [ "# Multilingual vector search with E5 embedding models\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/supporting-blog-content/multilingual-e5/multilingual-e5.ipynb)\n", "\n", "In this example we'll use a multilingual embedding model\n", "[multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) to perform search on a toy dataset of mixed\n", "language documents. The examples in this notebook follow the blog post of the same title: Multilingual vector search with E5 embedding models." ] }, { "cell_type": "markdown", "metadata": { "id": "Y01AXpELkygt" }, "source": [ "# 🧰 Requirements\n", "\n", "For this example, you will need:\n", "\n", "- An Elastic Cloud deployment with an ML node (min. 8 GB memory)\n", " - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "N4pI1-eIvWrI" }, "source": [ "## Create Elastic Cloud deployment\n", "\n", "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.\n", "\n", "- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page\n", " - Select **Create deployment**\n", " - Use the default node types for Elasticsearch and Kibana\n", " - Add an ML node with **8 GB memory** (the multilingual E5 base model is larger than most)" ] }, { "cell_type": "markdown", "metadata": { "id": "gaTFHLJC-Mgi" }, "source": [ "# Setup Elasticsearch environment\n", "\n", "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", "\n", "First we need to `pip` install the packages we need for this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "K9Q1p2C9-wce", "outputId": "9745cf6b-d8ae-4c85-9992-3b096645e52c" }, "outputs": [], "source": [ "!pip install elasticsearch eland[pytorch]" ] }, { "cell_type": "markdown", "metadata": { "id": "gEzq2Z1wBs3M" }, "source": [ "Next we need to import the `elasticsearch` module and the `getpass` module.\n", "`getpass` is part of the Python standard library and is used to securely prompt for credentials." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "uP_GTVRi-d96" }, "outputs": [], "source": [ "import getpass\n", "\n", "from elasticsearch import Elasticsearch" ] }, { "cell_type": "markdown", "metadata": { "id": "AMSePFiZCRqX" }, "source": [ "Now we can instantiate the Python Elasticsearch client.\n", "First we prompt the user for their password and Cloud ID.\n", "\n", "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", "\n", "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "h0MdAZ53CdKL", "outputId": "eac211ff-8172-45af-898c-993f7389d557" }, "outputs": [], "source": [ "# Found in the \"Manage Deployment\" page\n", "CLOUD_ID = getpass.getpass(\"Enter Elastic Cloud ID: \")\n", "\n", "# Password for the \"elastic\" user generated by Elasticsearch\n", "ELASTIC_PASSWORD = getpass.getpass(\"Enter Elastic password: \")\n", "\n", "# Create the client instance\n", "client = Elasticsearch(cloud_id=CLOUD_ID, basic_auth=(\"elastic\", ELASTIC_PASSWORD))\n", "\n", "client.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "rw80JKiZVrPE" }, "source": [ "# Setup emebdding model\n", "\n", "Next we upload the E5 multilingual embedding model into Elasticsearch and create an ingest pipeline to automatically create embeddings when ingesting documents. For more details on this process, please see the blog post: [How to deploy NLP: Text Embeddings and Vector Search](https://www.elastic.co/blog/how-to-deploy-nlp-text-embeddings-and-vector-search)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "I2MQhlQKVrPF", "outputId": "fd2e7798-2be6-4e36-9999-a7a29bd1c537" }, "outputs": [], "source": [ "MODEL_ID = \"multilingual-e5-base\"\n", "\n", "!eland_import_hub_model \\\n", " --cloud-id $CLOUD_ID \\\n", " --es-username elastic \\\n", " --es-password $ELASTIC_PASSWORD \\\n", " --hub-model-id intfloat/$MODEL_ID \\\n", " --es-model-id $MODEL_ID \\\n", " --task-type text_embedding \\\n", " --start" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bMil5myQVrPF" }, "outputs": [], "source": [ "client.ingest.put_pipeline(\n", " id=\"pipeline\",\n", " processors=[\n", " {\n", " \"inference\": {\n", " \"model_id\": MODEL_ID,\n", " \"field_map\": {\"passage\": \"text_field\"}, # field to embed: passage\n", " \"target_field\": \"passage_embedding\", # embedded field: passage_embedding\n", " }\n", " }\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "TF_wxIAhD07a" }, "source": [ "# Index documents\n", "\n", "We need to add a field to support dense vector storage and search.\n", "Note the `passage_embedding.predicted_value` field below, which is used to store the dense vector representation of the `passage` field, and will be automatically populated by the inference processor in the pipeline created above. The `passage_embedding` field will also store metadata from the inference process." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cvYECABJJs_2" }, "outputs": [], "source": [ "# Define the mapping and settings\n", "mapping = {\n", " \"properties\": {\n", " \"id\": {\"type\": \"keyword\"},\n", " \"language\": {\"type\": \"keyword\"},\n", " \"passage\": {\"type\": \"text\"},\n", " \"passage_embedding.predicted_value\": {\n", " \"type\": \"dense_vector\",\n", " \"dims\": 768,\n", " \"index\": \"true\",\n", " \"similarity\": \"cosine\",\n", " },\n", " },\n", " \"_source\": {\"excludes\": [\"passage_embedding.predicted_value\"]},\n", "}\n", "\n", "settings = {\n", " \"index\": {\n", " \"number_of_replicas\": \"1\",\n", " \"number_of_shards\": \"1\",\n", " \"default_pipeline\": \"pipeline\",\n", " }\n", "}\n", "\n", "# Create the index (deleting any existing index)\n", "client.indices.delete(index=\"passages\", ignore_unavailable=True)\n", "client.indices.create(index=\"passages\", mappings=mapping, settings=settings)" ] }, { "cell_type": "markdown", "metadata": { "id": "_QUZr2gwVrPF" }, "source": [ "Now that we have the pipeline and mappings ready, we can index our documents. This is of course just a demo so we only index the few toy examples from the blog post." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "48GUdoH1VrPF" }, "outputs": [], "source": [ "passages = [\n", " {\n", " \"id\": \"doc1\",\n", " \"language\": \"en\",\n", " \"passage\": \"\"\"I sat on the bank of the river today.\"\"\",\n", " },\n", " {\n", " \"id\": \"doc2\",\n", " \"language\": \"de\",\n", " \"passage\": \"\"\"Ich bin heute zum Flussufer gegangen.\"\"\",\n", " },\n", " {\n", " \"id\": \"doc3\",\n", " \"language\": \"en\",\n", " \"passage\": \"\"\"I walked to the bank today to deposit money.\"\"\",\n", " },\n", " {\n", " \"id\": \"doc4\",\n", " \"language\": \"de\",\n", " \"passage\": \"\"\"Ich saß heute bei der Bank und wartete auf mein Geld.\"\"\",\n", " },\n", "]\n", "\n", "# Index passages, adding first the \"passage: \" instruction for E5\n", "for doc in passages:\n", " doc[\"passage\"] = f\"\"\"passage: {doc[\"passage\"]}\"\"\"\n", " client.index(index=\"passages\", document=doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "MrBCHdH1u8Wd" }, "source": [ "# Multilingual semantic search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f2fHLYxnVrPF" }, "outputs": [], "source": [ "def query(q):\n", " \"\"\"Query with embeddings, adding first the \"query: \" instruction for E5.\"\"\"\n", "\n", " return client.search(\n", " index=\"passages\",\n", " knn={\n", " \"field\": \"passage_embedding.predicted_value\",\n", " \"query_vector_builder\": {\n", " \"text_embedding\": {\n", " \"model_id\": MODEL_ID,\n", " \"model_text\": f\"query: {q}\",\n", " }\n", " },\n", " \"k\": 2, # for the demo, we're always just searching for pairs of passages\n", " \"num_candidates\": 5,\n", " },\n", " )\n", "\n", "\n", "def pretty_response(response):\n", " \"\"\"Pretty print search responses.\"\"\"\n", "\n", " for hit in response[\"hits\"][\"hits\"]:\n", " score = hit[\"_score\"]\n", " id = hit[\"_source\"][\"id\"]\n", " language = hit[\"_source\"][\"language\"]\n", " passage = hit[\"_source\"][\"passage\"]\n", " print()\n", " print(f\"ID: {id}\")\n", " print(f\"Language: {language}\")\n", " print(f\"Passage: {passage}\")\n", " print(f\"Score: {score}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e6ssY6dfVrPG", "outputId": "01625e8c-bef8-485e-e5b8-118f7386d79a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ID: doc1\n", "Language: en\n", "Passage: passage: I sat on the bank of the river today.\n", "Score: 0.88001645\n", "\n", "ID: doc2\n", "Language: de\n", "Passage: passage: Ich bin heute zum Flussufer gegangen.\n", "Score: 0.87662137\n" ] } ], "source": [ "# Example 1\n", "pretty_response(query(\"riverside\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ifFbSGkSVrPG", "outputId": "a14f0ff6-62a2-4ed2-cec3-2dc2bb22e969" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ID: doc4\n", "Language: de\n", "Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld.\n", "Score: 0.8967148\n", "\n", "ID: doc3\n", "Language: en\n", "Passage: passage: I walked to the bank today to deposit money.\n", "Score: 0.8863925\n" ] } ], "source": [ "# Example 2\n", "pretty_response(query(\"Geldautomat\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YqcLx5fSVrPH", "outputId": "5a0e2b19-24dd-4ee6-c887-fad26cb24538" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ID: doc3\n", "Language: en\n", "Passage: passage: I walked to the bank today to deposit money.\n", "Score: 0.87475425\n", "\n", "ID: doc2\n", "Language: de\n", "Passage: passage: Ich bin heute zum Flussufer gegangen.\n", "Score: 0.8741033\n" ] } ], "source": [ "# Example 3a\n", "pretty_response(query(\"movement\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SoIXRY-jVrPH", "outputId": "2285cd2f-7d79-4553-dbea-dc8844841622" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "ID: doc4\n", "Language: de\n", "Passage: passage: Ich saß heute bei der Bank und wartete auf mein Geld.\n", "Score: 0.85991657\n", "\n", "ID: doc1\n", "Language: en\n", "Passage: passage: I sat on the bank of the river today.\n", "Score: 0.8561436\n" ] } ], "source": [ "# Example 3b\n", "pretty_response(query(\"stillness\"))" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3.11.4 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 0 }

supporting-blog-content/multilingual-e5/multilingual-e5.ipynb (491 lines of code) (raw):