# <font color='red'>Introduction to Retrievers</font>  Supporting Notebook

This notebook allows you to run the exampls from the [Search Labs blog - Introducing Retrievers -  Search All the Things!](https://www.elastic.co/search-labs/blog/elasticsearch-retrievers)

In this notebook you will:
- Download IMDB dataset from Kaggle
- Create a new Elasticsearch Serverless Search Project
- Create two inference services
- Deploy ELSER
- Deploy e5-small
- Create ingest pipeline
- Create mapping
- Ingest the IMDB data, creating embedding as part of ingest
- Scale down models for query load
- Run example retrievers

# <font color='Green'>Setup</font>  

In [1]:
!pip install -qqq pandas elasticsearch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m477.5/477.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.3/64.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import zipfile
import pandas as pd
from elasticsearch import Elasticsearch, helpers
from elasticsearch.exceptions import ConnectionTimeout
from elastic_transport import ConnectionError
from time import sleep
import time
import logging

# Get the logger for 'elastic_transport.node_pool'
logger = logging.getLogger("elastic_transport.node_pool")

# Set its level to ERROR
logger.setLevel(logging.ERROR)

# Suppress warnings from the elastic_transport module
logging.getLogger("elastic_transport").setLevel(logging.ERROR)

## Data Set Download

In [3]:
!kaggle datasets download -d ashpalsingh1525/imdb-movies-dataset

Dataset URL: https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset
License(s): Community Data License Agreement - Permissive - Version 1.0
Downloading imdb-movies-dataset.zip to /content
  0% 0.00/2.84M [00:00<?, ?B/s]
100% 2.84M/2.84M [00:00<00:00, 117MB/s]


In [4]:
with zipfile.ZipFile("/content/imdb-movies-dataset.zip", "r") as zip_ref:
    zip_ref.extractall("/content/")

## Create Elasticsearch Serverless project

### Create Trial Account (if you don't already have an Elastic Cloud account)

[Follow this link to create a free trial Elastic Cloud account](
https://cloud.elastic.co/registration?onboarding_token=vectorsearch&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs
)

### Create an Elastic Cloud API Key

[Follow the steps in the guide here](https://www.elastic.co/guide/en/cloud/current/ec-api-authentication.html)

When you create the key, ensure you select "Admin" access level

Copy the key someplace safe, you will use it in the next cell

## Elasticsearch Setup
When you run the cell below you will be prompted to enter your Cloud API Key

In [5]:
from getpass import getpass

ec_api_key = getpass("Enter your Elastic Cloud API key: ")

Enter your Elastic Cloud API key: ··········


In [6]:
import requests

headers = {
    "Authorization": f"ApiKey {ec_api_key}",
}

data = {
    "name": "Retrievers Demo",
    "alias": "retrievers-demo",
    "region_id": "aws-us-east-1",
    "optimized_for": "vector",
    "search_lake": {"search_power": 5, "boost_window": 0},
}


cloud_url = "https://api.elastic-cloud.com"
project_endpoint = "/api/v1/serverless/projects/elasticsearch"

response = requests.post(cloud_url + project_endpoint, headers=headers, json=data)

# Print the response
print(f"{response.status_code} - {response.text}")

201 - {"alias":"retrievers-demo-d1ff7a","cloud_id":"Retrievers_Demo:dXMtZWFzdC0xLmF3cy5lbGFzdGljLmNsb3VkJGQxZmY3YWZmMWE5YTQ1ODZhNDBkMzk1ZjZlMGJhMDk3LmVzJGQxZmY3YWZmMWE5YTQ1ODZhNDBkMzk1ZjZlMGJhMDk3Lmti","id":"d1ff7aff1a9a4586a40d395f6e0ba097","metadata":{"created_at":"2024-05-22T00:57:10.550790086Z","created_by":"3953873479","organization_id":"3953873479"},"name":"Retrievers Demo","region_id":"aws-us-east-1","endpoints":{"elasticsearch":"https://retrievers-demo-d1ff7a.es.us-east-1.aws.elastic.cloud","kibana":"https://retrievers-demo-d1ff7a.kb.us-east-1.aws.elastic.cloud"},"optimized_for":"vector","search_lake":{"boost_window":0,"search_power":5},"type":"elasticsearch","credentials":{"password":"p4Y1hd8j4qF5gl5E5d0S5Ya5","username":"admin"}}



Set project connection credentials

In [7]:
rj = response.json()
cloud_id = rj["cloud_id"]
cloud_username = rj["credentials"]["username"]
cloud_password = rj["credentials"]["password"]

### Create Elasticsearch connection

In [8]:
es = Elasticsearch(cloud_id=cloud_id, basic_auth=(cloud_username, cloud_password))

# Wait for project to be created and available
while True:
    sleep(0.5)
    try:
        if es.ping():
            print("project created")
            break
    except ConnectionError:
        pass

project created


### Deploy Elser and e5
The two blocks below will deploy the embedding models and auto-scale ML capacity

#### Deploy and start ELSER

In [9]:
try:
    resp = es.options(request_timeout=5).inference.put_model(
        task_type="sparse_embedding",
        inference_id="my-elser-model",
        body={
            "service": "elser",
            "service_settings": {"num_allocations": 64, "num_threads": 1},
        },
    )
except ConnectionTimeout:
    pass

#### Deploy and start e5-small

In [10]:
try:
    resp = es.inference.put_model(
        task_type="text_embedding",
        inference_id="my-e5-model",
        body={
            "service": "elasticsearch",
            "service_settings": {
                "num_allocations": 8,
                "num_threads": 1,
                "model_id": ".multilingual-e5-small_linux-x86_64",
            },
        },
    )
except ConnectionTimeout:
    pass

#### Check model deployment state
This will loop checking until both ELSER and e5 have been fully deployed

This can take a couple minutes if additional capacity needs to be allocated to run the models

In [11]:
from time import sleep
from elasticsearch.exceptions import ConnectionTimeout


def wait_for_models_to_start(es, models):
    model_status_map = {model: False for model in models}

    while not all(model_status_map.values()):
        try:
            model_status = es.ml.get_trained_models_stats()
        except ConnectionTimeout:
            print("A connection timeout error occurred.")
            continue

        for x in model_status["trained_model_stats"]:
            model_id = x["model_id"]
            # Skip this model if it's not in our list or it has already started
            if model_id not in models or model_status_map[model_id]:
                continue
            if "deployment_stats" in x:
                if (
                    "nodes" in x["deployment_stats"]
                    and len(x["deployment_stats"]["nodes"]) > 0
                ):
                    if (
                        x["deployment_stats"]["nodes"][0]["routing_state"][
                            "routing_state"
                        ]
                        == "started"
                    ):
                        print(f"{model_id} model deployed and started")
                        model_status_map[model_id] = True

        if not all(model_status_map.values()):
            sleep(0.5)


models = [".elser_model_2_linux-x86_64", ".multilingual-e5-small_linux-x86_64"]
wait_for_models_to_start(es, models)

.multilingual-e5-small_linux-x86_64 model deployed and started
.elser_model_2_linux-x86_64 model deployed and started


List Inference Endpoints

### Create index template and link to ingest pipeline

In [12]:
template_body = {
    "index_patterns": ["imdb_movies*"],
    "template": {
        "settings": {"index": {"default_pipeline": "elser_e5_embed"}},
        "mappings": {
            "properties": {
                "budget_x": {"type": "double"},
                "country": {"type": "keyword"},
                "crew": {"type": "text"},
                "date_x": {"type": "date", "format": "MM/dd/yyyy||MM/dd/yyyy[ ]"},
                "genre": {"type": "keyword"},
                "names": {"type": "text"},
                "names_sparse": {"type": "sparse_vector"},
                "names_dense": {"type": "dense_vector"},
                "orig_lang": {"type": "keyword"},
                "orig_title": {"type": "text"},
                "overview": {"type": "text"},
                "overview_sparse": {"type": "sparse_vector"},
                "overview_dense": {"type": "dense_vector"},
                "revenue": {"type": "double"},
                "score": {"type": "double"},
                "status": {"type": "keyword"},
            }
        },
    },
}

# Create the template
es.indices.put_index_template(name="imdb_movies", body=template_body)

ObjectApiResponse({'acknowledged': True})

### Create ingest pipeline

In [13]:
# Define the pipeline configuration
pipeline_body = {
    "processors": [
        {
            "inference": {
                "model_id": ".multilingual-e5-small_linux-x86_64",
                "description": "embed names with e5 to names_dense nested field",
                "input_output": [
                    {"input_field": "names", "output_field": "names_dense"}
                ],
            }
        },
        {
            "inference": {
                "model_id": ".multilingual-e5-small_linux-x86_64",
                "description": "embed overview with e5 to names_dense nested field",
                "input_output": [
                    {"input_field": "overview", "output_field": "overview_dense"}
                ],
            }
        },
        {
            "inference": {
                "model_id": ".elser_model_2_linux-x86_64",
                "description": "embed overview with .elser_model_2_linux-x86_64 to overview_sparse nested field",
                "input_output": [
                    {"input_field": "overview", "output_field": "overview_sparse"}
                ],
            }
        },
        {
            "inference": {
                "model_id": ".elser_model_2_linux-x86_64",
                "description": "embed names with .elser_model_2_linux-x86_64 to names_sparse nested field",
                "input_output": [
                    {"input_field": "names", "output_field": "names_sparse"}
                ],
            }
        },
    ],
    "on_failure": [
        {
            "append": {
                "field": "_source._ingest.inference_errors",
                "value": [
                    {
                        "message": "{{ _ingest.on_failure_message }}",
                        "pipeline": "{{_ingest.pipeline}}",
                        "timestamp": "{{{ _ingest.timestamp }}}",
                    }
                ],
            }
        }
    ],
}


# Create the pipeline
es.ingest.put_pipeline(id="elser_e5_embed", body=pipeline_body)

ObjectApiResponse({'acknowledged': True})

## Ingest Docs
This will
- Do a bit of pre-processing
- Bulk ingest the 10,178 IMDB records
- Generate sparse vector embedings using the ELSER model for `overview` and `names` fields
- Generate dense vector embedings using the ELSER model for `overview` and `names` fields

It generally takes around ~2 minutes to complete with the above allocation settings

In [14]:
# Load CSV data into a pandas DataFrame
df = pd.read_csv("/content/imdb_movies.csv")

# Replace all NaN values in DataFrame with None
df = df.where(pd.notnull(df), None)

# Convert DataFrame into a list of dictionaries
# Each dictionary represents a document to be indexed
documents = df.to_dict(orient="records")


# Define a function to generate actions for bulk API
def generate_bulk_actions(documents):
    for doc in documents:
        yield {
            "_index": "imdb_movies",
            "_source": doc,
        }


# Use the bulk helper to insert documents, 200 at a time
start_time = time.time()
helpers.bulk(es, generate_bulk_actions(documents), chunk_size=200)
end_time = time.time()

print(f"The function took {end_time - start_time} seconds to run")

The function took 180.07549405097961 seconds to run


## Scale down ELSER and e5 models
We don't need a large number of model allocations for test querying so we will scale each down to 1 allocation

In [15]:
for model_id in ["my-elser-model", "my-e5-model"]:
    result = es.perform_request(
        "POST",
        f"/_ml/trained_models/{model_id}/deployment/_update",
        headers={"content-type": "application/json", "accept": "application/json"},
        body={"number_of_allocations": 1},
    )

# <font color='Green'>Retriever tests</font>

We are going to search the `overview` field (either the text or embedding) in the dataset for movies using the search input <font color='orange'>clueless slackers</font>

Feel free to change the `movie_search` variable below to something else

In [16]:
movie_search = "clueless slackers"

## Standard - Search All the Text! - bm25

In [17]:
response = es.search(
    index="imdb_movies",
    body={
        "query": {"match": {"overview": movie_search}},
        "size": 3,
        "fields": ["names", "overview"],
        "_source": False,
    },
)

for hit in response["hits"]["hits"]:
    print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

Beavis and Butt-Head Do America
- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.

Mr. Popper's Penguins
- Jim Carrey stars as Tom Popper, a successful businessman who’s clueless when it comes to the really important things in life...until he inherits six “adorable” penguins, each with its own unique personality. Soon Tom’s rambunctious roommates turn his swank New York apartment into a snowy winter wonderland — and the rest of his world upside-down.

Spaceballs
- When the nefarious Dark Helmet hatches a plan to snatch Princess Vespa and steal her planet's air, space-bum-for-hire Lone Starr and his clueless sidekick fly to the rescue. Along the way, they meet Yogurt, who puts Lone Starr wise to the power of "The Schwartz." Can he master it in time to save the day?



## kNN - Search all the Dense Vectors!

In [18]:
response = es.search(
    index="imdb_movies",
    body={
        "retriever": {
            "knn": {
                "field": "overview_dense",
                "query_vector_builder": {
                    "text_embedding": {
                        "model_id": "my-e5-model",
                        "model_text": movie_search,
                    }
                },
                "k": 5,
                "num_candidates": 5,
            }
        },
        "size": 3,
        "fields": ["names", "overview"],
        "_source": False,
    },
)

for hit in response["hits"]["hits"]:
    print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

Beavis and Butt-Head Do America
- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.

Uncharted
- A young street-smart, Nathan Drake and his wisecracking partner Victor “Sully” Sullivan embark on a dangerous pursuit of “the greatest treasure never found” while also tracking clues that may lead to Nathan’s long-lost brother.

Crystal Skulls
- A millionaire philanthropist collects the famous Crystal Skulls trying to tap into their ancient powers. It is up to a team lead by a college professor whose father disappeared searching for the 13th skull to save the world when the first 12 skulls are united and reek havoc on the earth without the control of the 13th skull.



## text_expansion - Search all the Sparse Vectors! - elser


In [19]:
response = es.search(
    index="imdb_movies",
    body={
        "retriever": {
            "standard": {
                "query": {
                    "text_expansion": {
                        "overview_sparse": {
                            "model_id": "my-elser-model",
                            "model_text": movie_search,
                        }
                    }
                }
            }
        },
        "size": 3,
        "fields": ["names", "overview"],
        "_source": False,
    },
)

for hit in response["hits"]["hits"]:
    print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

Bill & Ted's Bogus Journey
- Amiable slackers Bill and Ted are once again roped into a fantastical adventure when De Nomolos, a villain from the future, sends evil robot duplicates of the two lads to terminate and replace them. The robot doubles actually succeed in killing Bill and Ted, but the two are determined to escape the afterlife, challenging the Grim Reaper to a series of games in order to return to the land of the living.

Beavis and Butt-Head Do America
- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.

Knocked Up
- A slacker and a career-driven woman accidentally conceive a child after a one-night stand. As they try to make the relationship work, they must navigate the challenges of parenthood and their differences in lifestyle and maturity.



## rrf - Combine All the Things!


In [20]:
response = es.search(
    index="imdb_movies",
    body={
        "retriever": {
            "rrf": {
                "retrievers": [
                    {"standard": {"query": {"term": {"overview": movie_search}}}},
                    {
                        "knn": {
                            "field": "overview_dense",
                            "query_vector_builder": {
                                "text_embedding": {
                                    "model_id": "my-e5-model",
                                    "model_text": movie_search,
                                }
                            },
                            "k": 5,
                            "num_candidates": 5,
                        }
                    },
                    {
                        "standard": {
                            "query": {
                                "text_expansion": {
                                    "overview_sparse": {
                                        "model_id": "my-elser-model",
                                        "model_text": movie_search,
                                    }
                                }
                            }
                        }
                    },
                ],
                "rank_window_size": 5,
                "rank_constant": 1,
            }
        },
        "size": 3,
        "fields": ["names", "overview"],
        "_source": False,
    },
)

for hit in response["hits"]["hits"]:
    print(f"{hit['fields']['names'][0]}\n- {hit['fields']['overview'][0]}\n")

Beavis and Butt-Head Do America
- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.

Bill & Ted's Bogus Journey
- Amiable slackers Bill and Ted are once again roped into a fantastical adventure when De Nomolos, a villain from the future, sends evil robot duplicates of the two lads to terminate and replace them. The robot doubles actually succeed in killing Bill and Ted, but the two are determined to escape the afterlife, challenging the Grim Reaper to a series of games in order to return to the land of the living.

Uncharted
- A young street-smart, Nathan Drake and his wisecracking partner Victor “Sully” Sullivan embark on a dangerous pursuit of “the greatest treasure never found” while also tracking clues that may lead to Nathan’s long-lost brother.

