# **Lexical and Semantic Search with Elasticsearch**

In this example, you will explore various approaches to retrieving information using Elasticsearch, focusing specifically on text, lexical and semantic search.

To accomplish this, this example demonstrate various search scenarios on a dataset generated to simulate e-commerce product information.

This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products.

## **üß∞ Requirements**

For this example, you will need:

- Python 3.6 or later
- The Elastic Python client
- Elastic 8.8 deployment or later, with 8GB memory machine learning node
- The Elastic Learned Sparse EncodeR model that comes pre-loaded into Elastic installed and started on your deployment

We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html), a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) is available.

## Setup Elasticsearch environment:

To get started, we'll need to connect to our Elastic deployment using the Python client.

Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.


In [None]:
!pip install elasticsearch==8.8 #Elasticsearch

In [None]:
pip -q install eland elasticsearch sentence_transformers transformers torch==1.11 # Eland Python Client

In [None]:
from elasticsearch import (
    Elasticsearch,
    helpers,
)  # Import the Elasticsearch client and helpers module
from urllib.request import urlopen  # library for opening URLs
import json  # module for handling JSON data
from pathlib import Path  # module for working with file paths

# Python client and toolkit for machine learning in Elasticsearch
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel
from elasticsearch.client import MlClient  # Elastic module for ml
import getpass  # handling password input

Now we can instantiate the Python Elasticsearch client.

First we prompt the user for their password and Cloud ID.

üîê NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [None]:
# Found in the 'Manage Deployment' page
CLOUD_ID = getpass.getpass("Enter Elastic Cloud ID:  ")

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = getpass.getpass("Enter Elastic password:  ")

# Create the client instance
client = Elasticsearch(
    cloud_id=CLOUD_ID, basic_auth=("elastic", ELASTIC_PASSWORD), request_timeout=3600
)

## Setup emebdding model

Next we upload the all-mpnet-base-v2 embedding model into Elasticsearch and create an ingest pipeline with inference processors for text embedding and text expansion, using the description field for both. This field contains the description of each product.

In [None]:
# Set the model name from Hugging Face and task type
# sentence-transformers model
hf_model_id = "sentence-transformers/all-mpnet-base-v2"
tm = TransformerModel(hf_model_id, "text_embedding")

# set the modelID as it is named in Elasticsearch
es_model_id = tm.elasticsearch_model_id()

# Download the model from Hugging Face
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)

# Load the model into Elasticsearch
ptm = PyTorchModel(client, es_model_id)
ptm.import_model(
    model_path=model_path, config_path=None, vocab_path=vocab_path, config=config
)

# Start the model
s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)
s.body

In [None]:
# Creating an ingest pipeline with inference processors to use ELSER (sparse) and all-mpnet-base-v2 (dense) to infer against data that will be ingested in the pipeline.

client.ingest.put_pipeline(
    id="ecommerce-pipeline",
    processors=[
        {
            "inference": {
                "model_id": "elser_model",
                "target_field": "ml",
                "field_map": {"description": "text_field"},
                "inference_config": {
                    "text_expansion": {  # text_expansion inference type (ELSER)
                        "results_field": "tokens"
                    }
                },
            }
        },
        {
            "inference": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",
                "target_field": "description_vector",  # Target field for the inference results
                "field_map": {
                    "description": "text_field"  # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.
                },
            }
        },
    ],
)

## Index documents

Then, we create a source index to load `products-ecommerce.json`, this will be the `ecommerce` index and a destination index to extract the documents from the source and index these documents into the destination `ecommerce-search`.

For the `ecommerce-search` index we add a field to support dense vector storage and search `description_vector.predicted_value`, this is the target field for inference results. The field type in this case is `dense_vector`, the `all-mpnet-base-v2` model has embedding_size of 768, so dims is set to 768. We also add a `rank_features` field type to support the text expansion output.

In [None]:
# Index to load products-ecommerce.json docs

client.indices.create(
    index="ecommerce",
    mappings={
        "properties": {
            "product": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "description": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "category": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
        }
    },
)

In [None]:
# Reindex dest index

INDEX = "ecommerce-search"
client.indices.create(
    index=INDEX,
    settings={"index": {"number_of_shards": 1, "number_of_replicas": 1}},
    mappings={
        # Saving disk space by excluding the ELSER tokens and the dense_vector field from document source.
        # Note: That should only be applied if you are certain that reindexing will not be required in the future.
        "_source": {"excludes": ["ml.tokens", "description_vector.predicted_value"]},
        "properties": {
            "product": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "description": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "category": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "ml.tokens": {  # The name of the field to contain the generated tokens.
                "type": "rank_features"  # ELSER output must be ingested into a field with the rank_features field type.
            },
            "description_vector.predicted_value": {  # Inference results field, target_field.predicted_value
                "type": "dense_vector",
                "dims": 768,  # The all-mpnet-base-v2 model has embedding_size of 768, so dims is set to 768.
                "index": "true",
                "similarity": "dot_product",  #  When indexing vectors for approximate kNN search, you need to specify the similarity function for comparing the vectors.
            },
        },
    },
)

## Load documents

Then we load `products-ecommerce.json` into the `ecommerce` index.

In [None]:
#  dataset

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/02c01b3450e8ddc72ccec85d559eee5280c185ac/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/products-ecommerce.json"  # json raw file - update the link here

response = urlopen(url)

# Load the response data into a JSON object
data_json = json.loads(response.read())


def create_index_body(doc):
    """Generate the body for an Elasticsearch document."""
    return {
        "_index": "ecommerce",
        "_source": doc,
    }


# Prepare the documents to be indexed
documents = [create_index_body(doc) for doc in data_json]

# Use helpers.bulk to index
helpers.bulk(client, documents)

print("Done indexing documents into `ecommerce` index")

## Reindex

Now we can reindex data from the `source` index `ecommerce` to the `dest` index `ecommerce-search` with the ingest pipeline `ecommerce-pipeline` we created.

After this step our `dest` index will have the fields we need to perform Semantic Search.

In [None]:
# Reindex data from one index 'source' to another 'dest' with the 'ecommerce-pipeline' pipeline.

client.reindex(
    wait_for_completion=True,
    source={"index": "ecommerce"},
    dest={"index": "ecommerce-search", "pipeline": "ecommerce-pipeline"},
)

## Text Analysis with Standard Analyzer

In [None]:
# Performs text analysis on a string and returns the resulting tokens.

# Define the text to be analyzed
text = "Comfortable furniture for a large balcony"

# Define the analyze request
request_body = {"analyzer": "standard", "text": text}  # Standard Analyzer

# Perform the analyze request
response = client.indices.analyze(
    analyzer=request_body["analyzer"], text=request_body["text"]
)

# Extract and display the analyzed tokens
tokens = [token["token"] for token in response["tokens"]]
print("Analyzed Tokens:", tokens)

## Text Analysis with Stop Analyzer

In [None]:
# Performs text analysis on a string and returns the resulting tokens.

# Define the text to be analyzed
text = "Comfortable furniture for a large balcony"

# Define the analyze request
request_body = {"analyzer": "stop", "text": text}  # Stop Analyzer

# Perform the analyze request
response = client.indices.analyze(
    analyzer=request_body["analyzer"], text=request_body["text"]
)

# Extract and display the analyzed tokens
tokens = [token["token"] for token in response["tokens"]]
print("Analyzed Tokens:", tokens)

## Lexical Search

In [None]:
# BM25

response = client.search(
    size=2,
    index="ecommerce-search",
    query={
        "match": {
            "description": {
                "query": "Comfortable furniture for a large balcony",
                "analyzer": "stop",
            }
        }
    },
)
hits = response["hits"]["hits"]

if not hits:
    print("No matches found")
else:
    for hit in hits:
        score = hit["_score"]
        product = hit["_source"]["product"]
        category = hit["_source"]["category"]
        description = hit["_source"]["description"]
        print(
            f"\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
        )

## Semantic Search with Dense Vector

In [None]:
# KNN

response = client.search(
    index="ecommerce-search",
    size=2,
    knn={
        "field": "description_vector.predicted_value",
        "k": 50,  # Number of nearest neighbors to return as top hits.
        "num_candidates": 500,  # Number of nearest neighbor candidates to consider per shard. Increasing num_candidates tends to improve the accuracy of the final k results.
        "query_vector_builder": {  # Object indicating how to build a query_vector. kNN search enables you to perform semantic search by using a previously deployed text embedding model.
            "text_embedding": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",  # Text embedding model id
                "model_text": "Comfortable furniture for a large balcony",  # Query
            }
        },
    },
)

for hit in response["hits"]["hits"]:

    score = hit["_score"]
    product = hit["_source"]["product"]
    category = hit["_source"]["category"]
    description = hit["_source"]["description"]
    print(
        f"\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
    )

## Semantic Search with Sparse Vector

In [None]:
# Elastic Learned Sparse Encoder - ELSER

response = client.search(
    index="ecommerce-search",
    size=2,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id": "elser_model",
                "model_text": "Comfortable furniture for a large balcony",
            }
        }
    },
)

for hit in response["hits"]["hits"]:

    score = hit["_score"]
    product = hit["_source"]["product"]
    category = hit["_source"]["category"]
    description = hit["_source"]["description"]
    print(
        f"\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
    )

## Hybrid Search - BM25+KNN linear combination

In [None]:
# BM25 + KNN (Linear Combination)

response = client.search(
    index="ecommerce-search",
    size=2,
    query={
        "bool": {
            "should": [
                {
                    "match": {
                        "description": {
                            "query": "A dining table and comfortable chairs for a large balcony",
                            "boost": 1,  # You can adjust the boost value
                        }
                    }
                }
            ]
        }
    },
    knn={
        "field": "description_vector.predicted_value",
        "k": 50,
        "num_candidates": 500,
        "boost": 1,  # You can adjust the boost value
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",
                "model_text": "A dining table and comfortable chairs for a large balcony",
            }
        },
    },
)

for hit in response["hits"]["hits"]:

    score = hit["_score"]
    product = hit["_source"]["product"]
    category = hit["_source"]["category"]
    description = hit["_source"]["description"]
    print(
        f"\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
    )

## Hybrid Search - BM25+KNN RRF

In [None]:
# BM25 + KNN (RRF)
# RRF functionality is in technical preview and may be changed or removed in a future release. The syntax will likely change before GA.

response = client.search(
    index="ecommerce-search",
    size=2,
    query={
        "bool": {
            "should": [
                {
                    "match": {
                        "description": {
                            "query": "A dining table and comfortable chairs for a large balcony"
                        }
                    }
                }
            ]
        }
    },
    knn={
        "field": "description_vector.predicted_value",
        "k": 50,
        "num_candidates": 500,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",
                "model_text": "A dining table and comfortable chairs for a large balcony",
            }
        },
    },
    rank={
        "rrf": {  # Reciprocal rank fusion
            "window_size": 50,  # This value determines the size of the individual result sets per query.
            "rank_constant": 20,  # This value determines how much influence documents in individual result sets per query have over the final ranked result set.
        }
    },
)

for hit in response["hits"]["hits"]:

    rank = hit["_rank"]
    category = hit["_source"]["category"]
    product = hit["_source"]["product"]
    description = hit["_source"]["description"]
    print(
        f"\nRank: {rank}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
    )

## Hybrid Search - BM25+ELSER linear combination

In [None]:
# BM25 + Elastic Learned Sparse Encoder (Linear Combination)

response = client.search(
    index="ecommerce-search",
    size=2,
    query={
        "bool": {
            "should": [
                {
                    "match": {
                        "description": {
                            "query": "A dining table and comfortable chairs for a large balcony",
                            "boost": 1,  # You can adjust the boost value
                        }
                    }
                },
                {
                    "text_expansion": {
                        "ml.tokens": {
                            "model_id": "elser_model",
                            "model_text": "A dining table and comfortable chairs for a large balcony",
                            "boost": 1,  # You can adjust the boost value
                        }
                    }
                },
            ]
        }
    },
)

for hit in response["hits"]["hits"]:

    score = hit["_score"]
    product = hit["_source"]["product"]
    category = hit["_source"]["category"]
    description = hit["_source"]["description"]
    print(
        f"\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n"
    )