supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/ecommerce_dense_sparse_project.ipynb (816 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "r8OKk3QOGBXl",
"metadata": {
"id": "r8OKk3QOGBXl"
},
"source": [
"# **Lexical and Semantic Search with Elasticsearch**\n",
"\n",
"In this example, you will explore various approaches to retrieving information using Elasticsearch, focusing specifically on text, lexical and semantic search.\n",
"\n",
"To accomplish this, this example demonstrate various search scenarios on a dataset generated to simulate e-commerce product information.\n",
"\n",
"This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products.\n",
"\n",
"## **🧰 Requirements**\n",
"\n",
"For this example, you will need:\n",
"\n",
"- Python 3.6 or later\n",
"- The Elastic Python client\n",
"- Elastic 8.8 deployment or later, with 8GB memory machine learning node\n",
"- The Elastic Learned Sparse EncodeR model that comes pre-loaded into Elastic installed and started on your deployment\n",
"\n",
"We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html), a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) is available."
]
},
{
"cell_type": "markdown",
"id": "hmMWo2e-IkTB",
"metadata": {
"id": "hmMWo2e-IkTB"
},
"source": [
"## Setup Elasticsearch environment:\n",
"\n",
"To get started, we'll need to connect to our Elastic deployment using the Python client.\n",
"\n",
"Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8d24cd8-a437-4bd2-a1f0-93e535ccf8a9",
"metadata": {
"id": "e8d24cd8-a437-4bd2-a1f0-93e535ccf8a9"
},
"outputs": [],
"source": [
"!pip install elasticsearch==8.8 #Elasticsearch"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c36e9b5-8f2b-4734-9213-1350caa7f837",
"metadata": {
"id": "8c36e9b5-8f2b-4734-9213-1350caa7f837"
},
"outputs": [],
"source": [
"pip -q install eland elasticsearch sentence_transformers transformers torch==1.11 # Eland Python Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eaf90bc8-647e-4ada-9aa9-5cb9e60762b7",
"metadata": {
"id": "eaf90bc8-647e-4ada-9aa9-5cb9e60762b7"
},
"outputs": [],
"source": [
"from elasticsearch import (\n",
" Elasticsearch,\n",
" helpers,\n",
") # Import the Elasticsearch client and helpers module\n",
"from urllib.request import urlopen # library for opening URLs\n",
"import json # module for handling JSON data\n",
"from pathlib import Path # module for working with file paths\n",
"\n",
"# Python client and toolkit for machine learning in Elasticsearch\n",
"from eland.ml.pytorch import PyTorchModel\n",
"from eland.ml.pytorch.transformers import TransformerModel\n",
"from elasticsearch.client import MlClient # Elastic module for ml\n",
"import getpass # handling password input"
]
},
{
"cell_type": "markdown",
"id": "ea1VkDBXJIQR",
"metadata": {
"id": "ea1VkDBXJIQR"
},
"source": [
"Now we can instantiate the Python Elasticsearch client.\n",
"\n",
"First we prompt the user for their password and Cloud ID.\n",
"\n",
"🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n",
"\n",
"Then we create a `client` object that instantiates an instance of the `Elasticsearch` class."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6907a2bf-4927-428e-9ca8-9df3dd35a2cc",
"metadata": {
"id": "6907a2bf-4927-428e-9ca8-9df3dd35a2cc"
},
"outputs": [],
"source": [
"# Found in the 'Manage Deployment' page\n",
"CLOUD_ID = getpass.getpass(\"Enter Elastic Cloud ID: \")\n",
"\n",
"# Password for the 'elastic' user generated by Elasticsearch\n",
"ELASTIC_PASSWORD = getpass.getpass(\"Enter Elastic password: \")\n",
"\n",
"# Create the client instance\n",
"client = Elasticsearch(\n",
" cloud_id=CLOUD_ID, basic_auth=(\"elastic\", ELASTIC_PASSWORD), request_timeout=3600\n",
")"
]
},
{
"cell_type": "markdown",
"id": "BH-N6epTJarM",
"metadata": {
"id": "BH-N6epTJarM"
},
"source": [
"## Setup emebdding model\n",
"\n",
"Next we upload the all-mpnet-base-v2 embedding model into Elasticsearch and create an ingest pipeline with inference processors for text embedding and text expansion, using the description field for both. This field contains the description of each product."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f6f3f5a-2b93-4a0c-93c8-c887ca80f687",
"metadata": {
"id": "7f6f3f5a-2b93-4a0c-93c8-c887ca80f687"
},
"outputs": [],
"source": [
"# Set the model name from Hugging Face and task type\n",
"# sentence-transformers model\n",
"hf_model_id = \"sentence-transformers/all-mpnet-base-v2\"\n",
"tm = TransformerModel(hf_model_id, \"text_embedding\")\n",
"\n",
"# set the modelID as it is named in Elasticsearch\n",
"es_model_id = tm.elasticsearch_model_id()\n",
"\n",
"# Download the model from Hugging Face\n",
"tmp_path = \"models\"\n",
"Path(tmp_path).mkdir(parents=True, exist_ok=True)\n",
"model_path, config, vocab_path = tm.save(tmp_path)\n",
"\n",
"# Load the model into Elasticsearch\n",
"ptm = PyTorchModel(client, es_model_id)\n",
"ptm.import_model(\n",
" model_path=model_path, config_path=None, vocab_path=vocab_path, config=config\n",
")\n",
"\n",
"# Start the model\n",
"s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)\n",
"s.body"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6739f55b-6983-4b48-9349-6e0111b313fe",
"metadata": {
"id": "6739f55b-6983-4b48-9349-6e0111b313fe"
},
"outputs": [],
"source": [
"# Creating an ingest pipeline with inference processors to use ELSER (sparse) and all-mpnet-base-v2 (dense) to infer against data that will be ingested in the pipeline.\n",
"\n",
"client.ingest.put_pipeline(\n",
" id=\"ecommerce-pipeline\",\n",
" processors=[\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \"elser_model\",\n",
" \"target_field\": \"ml\",\n",
" \"field_map\": {\"description\": \"text_field\"},\n",
" \"inference_config\": {\n",
" \"text_expansion\": { # text_expansion inference type (ELSER)\n",
" \"results_field\": \"tokens\"\n",
" }\n",
" },\n",
" }\n",
" },\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n",
" \"target_field\": \"description_vector\", # Target field for the inference results\n",
" \"field_map\": {\n",
" \"description\": \"text_field\" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.\n",
" },\n",
" }\n",
" },\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "QUQ1nCaiKIQr",
"metadata": {
"id": "QUQ1nCaiKIQr"
},
"source": [
"## Index documents\n",
"\n",
"Then, we create a source index to load `products-ecommerce.json`, this will be the `ecommerce` index and a destination index to extract the documents from the source and index these documents into the destination `ecommerce-search`.\n",
"\n",
"For the `ecommerce-search` index we add a field to support dense vector storage and search `description_vector.predicted_value`, this is the target field for inference results. The field type in this case is `dense_vector`, the `all-mpnet-base-v2` model has embedding_size of 768, so dims is set to 768. We also add a `rank_features` field type to support the text expansion output."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e115bd0-e758-44db-b5b9-96217af472c1",
"metadata": {
"id": "6e115bd0-e758-44db-b5b9-96217af472c1"
},
"outputs": [],
"source": [
"# Index to load products-ecommerce.json docs\n",
"\n",
"client.indices.create(\n",
" index=\"ecommerce\",\n",
" mappings={\n",
" \"properties\": {\n",
" \"product\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"description\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"category\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" }\n",
" },\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418",
"metadata": {
"id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418"
},
"outputs": [],
"source": [
"# Reindex dest index\n",
"\n",
"INDEX = \"ecommerce-search\"\n",
"client.indices.create(\n",
" index=INDEX,\n",
" settings={\"index\": {\"number_of_shards\": 1, \"number_of_replicas\": 1}},\n",
" mappings={\n",
" # Saving disk space by excluding the ELSER tokens and the dense_vector field from document source.\n",
" # Note: That should only be applied if you are certain that reindexing will not be required in the future.\n",
" \"_source\": {\"excludes\": [\"ml.tokens\", \"description_vector.predicted_value\"]},\n",
" \"properties\": {\n",
" \"product\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"description\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"category\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"ml.tokens\": { # The name of the field to contain the generated tokens.\n",
" \"type\": \"rank_features\" # ELSER output must be ingested into a field with the rank_features field type.\n",
" },\n",
" \"description_vector.predicted_value\": { # Inference results field, target_field.predicted_value\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 768, # The all-mpnet-base-v2 model has embedding_size of 768, so dims is set to 768.\n",
" \"index\": \"true\",\n",
" \"similarity\": \"dot_product\", # When indexing vectors for approximate kNN search, you need to specify the similarity function for comparing the vectors.\n",
" },\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "Vo-LKu8TOT5j",
"metadata": {
"id": "Vo-LKu8TOT5j"
},
"source": [
"## Load documents\n",
"\n",
"Then we load `products-ecommerce.json` into the `ecommerce` index."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cfdc3b7-7e4f-4111-997b-c333ac8938ba",
"metadata": {
"id": "3cfdc3b7-7e4f-4111-997b-c333ac8938ba"
},
"outputs": [],
"source": [
"# dataset\n",
"\n",
"url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/02c01b3450e8ddc72ccec85d559eee5280c185ac/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/products-ecommerce.json\" # json raw file - update the link here\n",
"\n",
"response = urlopen(url)\n",
"\n",
"# Load the response data into a JSON object\n",
"data_json = json.loads(response.read())\n",
"\n",
"\n",
"def create_index_body(doc):\n",
" \"\"\"Generate the body for an Elasticsearch document.\"\"\"\n",
" return {\n",
" \"_index\": \"ecommerce\",\n",
" \"_source\": doc,\n",
" }\n",
"\n",
"\n",
"# Prepare the documents to be indexed\n",
"documents = [create_index_body(doc) for doc in data_json]\n",
"\n",
"# Use helpers.bulk to index\n",
"helpers.bulk(client, documents)\n",
"\n",
"print(\"Done indexing documents into `ecommerce` index\")"
]
},
{
"cell_type": "markdown",
"id": "3dShN9W4Opl8",
"metadata": {
"id": "3dShN9W4Opl8"
},
"source": [
"## Reindex\n",
"\n",
"Now we can reindex data from the `source` index `ecommerce` to the `dest` index `ecommerce-search` with the ingest pipeline `ecommerce-pipeline` we created.\n",
"\n",
"After this step our `dest` index will have the fields we need to perform Semantic Search."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4297cb0b-ae2e-44f9-811d-27a41c43a858",
"metadata": {
"id": "4297cb0b-ae2e-44f9-811d-27a41c43a858"
},
"outputs": [],
"source": [
"# Reindex data from one index 'source' to another 'dest' with the 'ecommerce-pipeline' pipeline.\n",
"\n",
"client.reindex(\n",
" wait_for_completion=True,\n",
" source={\"index\": \"ecommerce\"},\n",
" dest={\"index\": \"ecommerce-search\", \"pipeline\": \"ecommerce-pipeline\"},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "-qUXNuOvPDsI",
"metadata": {
"id": "-qUXNuOvPDsI"
},
"source": [
"## Text Analysis with Standard Analyzer"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "829ae6e8-807d-4f0d-ada6-fee86748b91a",
"metadata": {
"id": "829ae6e8-807d-4f0d-ada6-fee86748b91a"
},
"outputs": [],
"source": [
"# Performs text analysis on a string and returns the resulting tokens.\n",
"\n",
"# Define the text to be analyzed\n",
"text = \"Comfortable furniture for a large balcony\"\n",
"\n",
"# Define the analyze request\n",
"request_body = {\"analyzer\": \"standard\", \"text\": text} # Standard Analyzer\n",
"\n",
"# Perform the analyze request\n",
"response = client.indices.analyze(\n",
" analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n",
")\n",
"\n",
"# Extract and display the analyzed tokens\n",
"tokens = [token[\"token\"] for token in response[\"tokens\"]]\n",
"print(\"Analyzed Tokens:\", tokens)"
]
},
{
"cell_type": "markdown",
"id": "12u70NLmPyNV",
"metadata": {
"id": "12u70NLmPyNV"
},
"source": [
"## Text Analysis with Stop Analyzer"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55b602d1-f1e4-4b70-9273-5fc701ac9039",
"metadata": {
"id": "55b602d1-f1e4-4b70-9273-5fc701ac9039"
},
"outputs": [],
"source": [
"# Performs text analysis on a string and returns the resulting tokens.\n",
"\n",
"# Define the text to be analyzed\n",
"text = \"Comfortable furniture for a large balcony\"\n",
"\n",
"# Define the analyze request\n",
"request_body = {\"analyzer\": \"stop\", \"text\": text} # Stop Analyzer\n",
"\n",
"# Perform the analyze request\n",
"response = client.indices.analyze(\n",
" analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n",
")\n",
"\n",
"# Extract and display the analyzed tokens\n",
"tokens = [token[\"token\"] for token in response[\"tokens\"]]\n",
"print(\"Analyzed Tokens:\", tokens)"
]
},
{
"cell_type": "markdown",
"id": "8G8MKcUvP0zs",
"metadata": {
"id": "8G8MKcUvP0zs"
},
"source": [
"## Lexical Search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4984f6c-ceec-46a4-b64c-f749e6b1b04f",
"metadata": {
"id": "f4984f6c-ceec-46a4-b64c-f749e6b1b04f"
},
"outputs": [],
"source": [
"# BM25\n",
"\n",
"response = client.search(\n",
" size=2,\n",
" index=\"ecommerce-search\",\n",
" query={\n",
" \"match\": {\n",
" \"description\": {\n",
" \"query\": \"Comfortable furniture for a large balcony\",\n",
" \"analyzer\": \"stop\",\n",
" }\n",
" }\n",
" },\n",
")\n",
"hits = response[\"hits\"][\"hits\"]\n",
"\n",
"if not hits:\n",
" print(\"No matches found\")\n",
"else:\n",
" for hit in hits:\n",
" score = hit[\"_score\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "xiywcf_-P39a",
"metadata": {
"id": "xiywcf_-P39a"
},
"source": [
"## Semantic Search with Dense Vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72187c9a-14c1-4084-a080-4e5c1e614f22",
"metadata": {
"id": "72187c9a-14c1-4084-a080-4e5c1e614f22"
},
"outputs": [],
"source": [
"# KNN\n",
"\n",
"response = client.search(\n",
" index=\"ecommerce-search\",\n",
" size=2,\n",
" knn={\n",
" \"field\": \"description_vector.predicted_value\",\n",
" \"k\": 50, # Number of nearest neighbors to return as top hits.\n",
" \"num_candidates\": 500, # Number of nearest neighbor candidates to consider per shard. Increasing num_candidates tends to improve the accuracy of the final k results.\n",
" \"query_vector_builder\": { # Object indicating how to build a query_vector. kNN search enables you to perform semantic search by using a previously deployed text embedding model.\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__all-mpnet-base-v2\", # Text embedding model id\n",
" \"model_text\": \"Comfortable furniture for a large balcony\", # Query\n",
" }\n",
" },\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
"\n",
" score = hit[\"_score\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "QlWFdngRQFbv",
"metadata": {
"id": "QlWFdngRQFbv"
},
"source": [
"## Semantic Search with Sparse Vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c0bf5fc-ab32-4f33-8f26-904ff10635a5",
"metadata": {
"id": "2c0bf5fc-ab32-4f33-8f26-904ff10635a5"
},
"outputs": [],
"source": [
"# Elastic Learned Sparse Encoder - ELSER\n",
"\n",
"response = client.search(\n",
" index=\"ecommerce-search\",\n",
" size=2,\n",
" query={\n",
" \"text_expansion\": {\n",
" \"ml.tokens\": {\n",
" \"model_id\": \"elser_model\",\n",
" \"model_text\": \"Comfortable furniture for a large balcony\",\n",
" }\n",
" }\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
"\n",
" score = hit[\"_score\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "kz9deDBYQJxr",
"metadata": {
"id": "kz9deDBYQJxr"
},
"source": [
"## Hybrid Search - BM25+KNN linear combination"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f84aa16b-49c5-4abf-a049-d556c225542e",
"metadata": {
"id": "f84aa16b-49c5-4abf-a049-d556c225542e"
},
"outputs": [],
"source": [
"# BM25 + KNN (Linear Combination)\n",
"\n",
"response = client.search(\n",
" index=\"ecommerce-search\",\n",
" size=2,\n",
" query={\n",
" \"bool\": {\n",
" \"should\": [\n",
" {\n",
" \"match\": {\n",
" \"description\": {\n",
" \"query\": \"A dining table and comfortable chairs for a large balcony\",\n",
" \"boost\": 1, # You can adjust the boost value\n",
" }\n",
" }\n",
" }\n",
" ]\n",
" }\n",
" },\n",
" knn={\n",
" \"field\": \"description_vector.predicted_value\",\n",
" \"k\": 50,\n",
" \"num_candidates\": 500,\n",
" \"boost\": 1, # You can adjust the boost value\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n",
" \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n",
" }\n",
" },\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
"\n",
" score = hit[\"_score\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cybkWjmpQV8g",
"metadata": {
"id": "cybkWjmpQV8g"
},
"source": [
"## Hybrid Search - BM25+KNN RRF"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa2e072d-37bb-43fd-a83f-e1cb55a24861",
"metadata": {
"id": "aa2e072d-37bb-43fd-a83f-e1cb55a24861"
},
"outputs": [],
"source": [
"# BM25 + KNN (RRF)\n",
"# RRF functionality is in technical preview and may be changed or removed in a future release. The syntax will likely change before GA.\n",
"\n",
"response = client.search(\n",
" index=\"ecommerce-search\",\n",
" size=2,\n",
" query={\n",
" \"bool\": {\n",
" \"should\": [\n",
" {\n",
" \"match\": {\n",
" \"description\": {\n",
" \"query\": \"A dining table and comfortable chairs for a large balcony\"\n",
" }\n",
" }\n",
" }\n",
" ]\n",
" }\n",
" },\n",
" knn={\n",
" \"field\": \"description_vector.predicted_value\",\n",
" \"k\": 50,\n",
" \"num_candidates\": 500,\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n",
" \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n",
" }\n",
" },\n",
" },\n",
" rank={\n",
" \"rrf\": { # Reciprocal rank fusion\n",
" \"window_size\": 50, # This value determines the size of the individual result sets per query.\n",
" \"rank_constant\": 20, # This value determines how much influence documents in individual result sets per query have over the final ranked result set.\n",
" }\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
"\n",
" rank = hit[\"_rank\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nRank: {rank}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "LyKI2Z-XQbI6",
"metadata": {
"id": "LyKI2Z-XQbI6"
},
"source": [
"## Hybrid Search - BM25+ELSER linear combination"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd842732-b20a-4c7a-b735-e1f558a9b922",
"metadata": {
"id": "bd842732-b20a-4c7a-b735-e1f558a9b922"
},
"outputs": [],
"source": [
"# BM25 + Elastic Learned Sparse Encoder (Linear Combination)\n",
"\n",
"response = client.search(\n",
" index=\"ecommerce-search\",\n",
" size=2,\n",
" query={\n",
" \"bool\": {\n",
" \"should\": [\n",
" {\n",
" \"match\": {\n",
" \"description\": {\n",
" \"query\": \"A dining table and comfortable chairs for a large balcony\",\n",
" \"boost\": 1, # You can adjust the boost value\n",
" }\n",
" }\n",
" },\n",
" {\n",
" \"text_expansion\": {\n",
" \"ml.tokens\": {\n",
" \"model_id\": \"elser_model\",\n",
" \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n",
" \"boost\": 1, # You can adjust the boost value\n",
" }\n",
" }\n",
" },\n",
" ]\n",
" }\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
"\n",
" score = hit[\"_score\"]\n",
" product = hit[\"_source\"][\"product\"]\n",
" category = hit[\"_source\"][\"category\"]\n",
" description = hit[\"_source\"][\"description\"]\n",
" print(\n",
" f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n",
" )"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}