supporting-blog-content/introducing-retrievers/retrievers_intro_notebook.ipynb (1,050 lines of code) (raw):
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# <font color='red'>Introduction to Retrievers</font> Supporting Notebook"
],
"metadata": {
"id": "QxlZPY-WO6oD"
}
},
{
"cell_type": "markdown",
"source": [
"This notebook allows you to run the exampls from the [Search Labs blog - Introducing Retrievers - Search All the Things!](https://www.elastic.co/search-labs/blog/elasticsearch-retrievers)\n",
"\n",
"In this notebook you will:\n",
"- Download IMDB dataset from Kaggle\n",
"- Create a new Elasticsearch Serverless Search Project\n",
"- Create two inference services\n",
"- Deploy ELSER\n",
"- Deploy e5-small\n",
"- Create ingest pipeline\n",
"- Create mapping\n",
"- Ingest the IMDB data, creating embedding as part of ingest\n",
"- Scale down models for query load\n",
"- Run example retrievers"
],
"metadata": {
"id": "pcBRkp3TEw2b"
}
},
{
"cell_type": "markdown",
"source": [
"# <font color='Green'>Setup</font> "
],
"metadata": {
"id": "JFlexKFMy4n5"
}
},
{
"cell_type": "code",
"source": [
"!pip install -qqq pandas elasticsearch"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d3oRV05q7p6_",
"outputId": "dda20a96-6969-4f02-f942-158c540ae309"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m477.5/477.5 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.3/64.3 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h"
]
}
]
},
{
"cell_type": "code",
"source": [
"import os\n",
"import zipfile\n",
"import pandas as pd\n",
"from elasticsearch import Elasticsearch, helpers\n",
"from elasticsearch.exceptions import ConnectionTimeout\n",
"from elastic_transport import ConnectionError\n",
"from time import sleep\n",
"import time\n",
"import logging\n",
"\n",
"# Get the logger for 'elastic_transport.node_pool'\n",
"logger = logging.getLogger(\"elastic_transport.node_pool\")\n",
"\n",
"# Set its level to ERROR\n",
"logger.setLevel(logging.ERROR)\n",
"\n",
"# Suppress warnings from the elastic_transport module\n",
"logging.getLogger(\"elastic_transport\").setLevel(logging.ERROR)"
],
"metadata": {
"id": "R8aownZy7f9e"
},
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Data Set Download"
],
"metadata": {
"id": "K7lJkY0QzIt3"
}
},
{
"cell_type": "code",
"source": [
"!kaggle datasets download -d ashpalsingh1525/imdb-movies-dataset"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "X9GwAcZFv62y",
"outputId": "f54a5f25-de2f-4bf2-ab40-8492fd9f63b9"
},
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Dataset URL: https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset\n",
"License(s): Community Data License Agreement - Permissive - Version 1.0\n",
"Downloading imdb-movies-dataset.zip to /content\n",
" 0% 0.00/2.84M [00:00<?, ?B/s]\n",
"100% 2.84M/2.84M [00:00<00:00, 117MB/s]\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"with zipfile.ZipFile(\"/content/imdb-movies-dataset.zip\", \"r\") as zip_ref:\n",
" zip_ref.extractall(\"/content/\")"
],
"metadata": {
"id": "2RcWHGd3v-U3"
},
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Create Elasticsearch Serverless project"
],
"metadata": {
"id": "Se4__RKk8DcM"
}
},
{
"cell_type": "markdown",
"source": [
"### Create Trial Account (if you don't already have an Elastic Cloud account)"
],
"metadata": {
"id": "qh2KgCk-RCFs"
}
},
{
"cell_type": "markdown",
"source": [
"[Follow this link to create a free trial Elastic Cloud account](\n",
"https://cloud.elastic.co/registration?onboarding_token=vectorsearch&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs\n",
")"
],
"metadata": {
"id": "FzuvaSvDRtWK"
}
},
{
"cell_type": "markdown",
"source": [
"### Create an Elastic Cloud API Key"
],
"metadata": {
"id": "5Vb1es6hmKZd"
}
},
{
"cell_type": "markdown",
"source": [
"[Follow the steps in the guide here](https://www.elastic.co/guide/en/cloud/current/ec-api-authentication.html)\n",
"\n",
"When you create the key, ensure you select \"Admin\" access level\n",
"\n",
"Copy the key someplace safe, you will use it in the next cell"
],
"metadata": {
"id": "RdGOJGfeRWgv"
}
},
{
"cell_type": "markdown",
"source": [
"## Elasticsearch Setup\n",
"When you run the cell below you will be prompted to enter your Cloud API Key"
],
"metadata": {
"id": "RsgZevjOzFp8"
}
},
{
"cell_type": "code",
"source": [
"from getpass import getpass\n",
"\n",
"ec_api_key = getpass(\"Enter your Elastic Cloud API key: \")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Lz8fwc3dtHUM",
"outputId": "fca753fe-b0b3-4767-9f47-607189004a74"
},
"execution_count": 5,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Enter your Elastic Cloud API key: ··········\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"import requests\n",
"\n",
"headers = {\n",
" \"Authorization\": f\"ApiKey {ec_api_key}\",\n",
"}\n",
"\n",
"data = {\n",
" \"name\": \"Retrievers Demo\",\n",
" \"alias\": \"retrievers-demo\",\n",
" \"region_id\": \"aws-us-east-1\",\n",
" \"optimized_for\": \"vector\",\n",
" \"search_lake\": {\"search_power\": 5, \"boost_window\": 0},\n",
"}\n",
"\n",
"\n",
"cloud_url = \"https://api.elastic-cloud.com\"\n",
"project_endpoint = \"/api/v1/serverless/projects/elasticsearch\"\n",
"\n",
"response = requests.post(cloud_url + project_endpoint, headers=headers, json=data)\n",
"\n",
"# Print the response\n",
"print(f\"{response.status_code} - {response.text}\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "h-lOUnts-d8a",
"outputId": "e9b7c525-f014-43b1-da77-4b409fcb58fb"
},
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"201 - {\"alias\":\"retrievers-demo-d1ff7a\",\"cloud_id\":\"Retrievers_Demo:dXMtZWFzdC0xLmF3cy5lbGFzdGljLmNsb3VkJGQxZmY3YWZmMWE5YTQ1ODZhNDBkMzk1ZjZlMGJhMDk3LmVzJGQxZmY3YWZmMWE5YTQ1ODZhNDBkMzk1ZjZlMGJhMDk3Lmti\",\"id\":\"d1ff7aff1a9a4586a40d395f6e0ba097\",\"metadata\":{\"created_at\":\"2024-05-22T00:57:10.550790086Z\",\"created_by\":\"3953873479\",\"organization_id\":\"3953873479\"},\"name\":\"Retrievers Demo\",\"region_id\":\"aws-us-east-1\",\"endpoints\":{\"elasticsearch\":\"https://retrievers-demo-d1ff7a.es.us-east-1.aws.elastic.cloud\",\"kibana\":\"https://retrievers-demo-d1ff7a.kb.us-east-1.aws.elastic.cloud\"},\"optimized_for\":\"vector\",\"search_lake\":{\"boost_window\":0,\"search_power\":5},\"type\":\"elasticsearch\",\"credentials\":{\"password\":\"p4Y1hd8j4qF5gl5E5d0S5Ya5\",\"username\":\"admin\"}}\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Set project connection credentials"
],
"metadata": {
"id": "Mr2fj6zHfDIx"
}
},
{
"cell_type": "code",
"source": [
"rj = response.json()\n",
"cloud_id = rj[\"cloud_id\"]\n",
"cloud_username = rj[\"credentials\"][\"username\"]\n",
"cloud_password = rj[\"credentials\"][\"password\"]"
],
"metadata": {
"id": "NohlsUagch98"
},
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Create Elasticsearch connection"
],
"metadata": {
"id": "Tq_UuqxPsxlJ"
}
},
{
"cell_type": "code",
"source": [
"es = Elasticsearch(cloud_id=cloud_id, basic_auth=(cloud_username, cloud_password))\n",
"\n",
"# Wait for project to be created and available\n",
"while True:\n",
" sleep(0.5)\n",
" try:\n",
" if es.ping():\n",
" print(\"project created\")\n",
" break\n",
" except ConnectionError:\n",
" pass"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AQepIFokswoR",
"outputId": "5230db4e-e474-4f8d-ad93-2c2fa4fc21f5"
},
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"project created\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Deploy Elser and e5\n",
"The two blocks below will deploy the embedding models and auto-scale ML capacity"
],
"metadata": {
"id": "2qeuIvZhFjyH"
}
},
{
"cell_type": "markdown",
"source": [
"#### Deploy and start ELSER"
],
"metadata": {
"id": "EbKE5-11F-HA"
}
},
{
"cell_type": "code",
"source": [
"try:\n",
" resp = es.options(request_timeout=5).inference.put_model(\n",
" task_type=\"sparse_embedding\",\n",
" inference_id=\"my-elser-model\",\n",
" body={\n",
" \"service\": \"elser\",\n",
" \"service_settings\": {\"num_allocations\": 64, \"num_threads\": 1},\n",
" },\n",
" )\n",
"except ConnectionTimeout:\n",
" pass"
],
"metadata": {
"id": "PLvy4pY6FnPF"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Deploy and start e5-small"
],
"metadata": {
"id": "4DVmAYW8GBAo"
}
},
{
"cell_type": "code",
"source": [
"try:\n",
" resp = es.inference.put_model(\n",
" task_type=\"text_embedding\",\n",
" inference_id=\"my-e5-model\",\n",
" body={\n",
" \"service\": \"elasticsearch\",\n",
" \"service_settings\": {\n",
" \"num_allocations\": 8,\n",
" \"num_threads\": 1,\n",
" \"model_id\": \".multilingual-e5-small_linux-x86_64\",\n",
" },\n",
" },\n",
" )\n",
"except ConnectionTimeout:\n",
" pass"
],
"metadata": {
"id": "9cp9b_E9GCpM"
},
"execution_count": 10,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Check model deployment state\n",
"This will loop checking until both ELSER and e5 have been fully deployed\n",
"\n",
"This can take a couple minutes if additional capacity needs to be allocated to run the models"
],
"metadata": {
"id": "Kgkx7cK0Stku"
}
},
{
"cell_type": "code",
"source": [
"from time import sleep\n",
"from elasticsearch.exceptions import ConnectionTimeout\n",
"\n",
"\n",
"def wait_for_models_to_start(es, models):\n",
" model_status_map = {model: False for model in models}\n",
"\n",
" while not all(model_status_map.values()):\n",
" try:\n",
" model_status = es.ml.get_trained_models_stats()\n",
" except ConnectionTimeout:\n",
" print(\"A connection timeout error occurred.\")\n",
" continue\n",
"\n",
" for x in model_status[\"trained_model_stats\"]:\n",
" model_id = x[\"model_id\"]\n",
" # Skip this model if it's not in our list or it has already started\n",
" if model_id not in models or model_status_map[model_id]:\n",
" continue\n",
" if \"deployment_stats\" in x:\n",
" if (\n",
" \"nodes\" in x[\"deployment_stats\"]\n",
" and len(x[\"deployment_stats\"][\"nodes\"]) > 0\n",
" ):\n",
" if (\n",
" x[\"deployment_stats\"][\"nodes\"][0][\"routing_state\"][\n",
" \"routing_state\"\n",
" ]\n",
" == \"started\"\n",
" ):\n",
" print(f\"{model_id} model deployed and started\")\n",
" model_status_map[model_id] = True\n",
"\n",
" if not all(model_status_map.values()):\n",
" sleep(0.5)\n",
"\n",
"\n",
"models = [\".elser_model_2_linux-x86_64\", \".multilingual-e5-small_linux-x86_64\"]\n",
"wait_for_models_to_start(es, models)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QDCjzqEwysKa",
"outputId": "cd526dec-e89a-4024-e5fb-dd7cee4fb8c3"
},
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
".multilingual-e5-small_linux-x86_64 model deployed and started\n",
".elser_model_2_linux-x86_64 model deployed and started\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"List Inference Endpoints"
],
"metadata": {
"id": "f_G_5YFpyjHF"
}
},
{
"cell_type": "markdown",
"source": [
"### Create index template and link to ingest pipeline"
],
"metadata": {
"id": "BxdJbREAxp8n"
}
},
{
"cell_type": "code",
"source": [
"template_body = {\n",
" \"index_patterns\": [\"imdb_movies*\"],\n",
" \"template\": {\n",
" \"settings\": {\"index\": {\"default_pipeline\": \"elser_e5_embed\"}},\n",
" \"mappings\": {\n",
" \"properties\": {\n",
" \"budget_x\": {\"type\": \"double\"},\n",
" \"country\": {\"type\": \"keyword\"},\n",
" \"crew\": {\"type\": \"text\"},\n",
" \"date_x\": {\"type\": \"date\", \"format\": \"MM/dd/yyyy||MM/dd/yyyy[ ]\"},\n",
" \"genre\": {\"type\": \"keyword\"},\n",
" \"names\": {\"type\": \"text\"},\n",
" \"names_sparse\": {\"type\": \"sparse_vector\"},\n",
" \"names_dense\": {\"type\": \"dense_vector\"},\n",
" \"orig_lang\": {\"type\": \"keyword\"},\n",
" \"orig_title\": {\"type\": \"text\"},\n",
" \"overview\": {\"type\": \"text\"},\n",
" \"overview_sparse\": {\"type\": \"sparse_vector\"},\n",
" \"overview_dense\": {\"type\": \"dense_vector\"},\n",
" \"revenue\": {\"type\": \"double\"},\n",
" \"score\": {\"type\": \"double\"},\n",
" \"status\": {\"type\": \"keyword\"},\n",
" }\n",
" },\n",
" },\n",
"}\n",
"\n",
"# Create the template\n",
"es.indices.put_index_template(name=\"imdb_movies\", body=template_body)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WvrQ2tOjxssp",
"outputId": "d1465aab-22d1-44dc-a02c-c5e00ad517a5"
},
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"ObjectApiResponse({'acknowledged': True})"
]
},
"metadata": {},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"source": [
"### Create ingest pipeline"
],
"metadata": {
"id": "4_Fx84iJsrF6"
}
},
{
"cell_type": "code",
"source": [
"# Define the pipeline configuration\n",
"pipeline_body = {\n",
" \"processors\": [\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \".multilingual-e5-small_linux-x86_64\",\n",
" \"description\": \"embed names with e5 to names_dense nested field\",\n",
" \"input_output\": [\n",
" {\"input_field\": \"names\", \"output_field\": \"names_dense\"}\n",
" ],\n",
" }\n",
" },\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \".multilingual-e5-small_linux-x86_64\",\n",
" \"description\": \"embed overview with e5 to names_dense nested field\",\n",
" \"input_output\": [\n",
" {\"input_field\": \"overview\", \"output_field\": \"overview_dense\"}\n",
" ],\n",
" }\n",
" },\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \".elser_model_2_linux-x86_64\",\n",
" \"description\": \"embed overview with .elser_model_2_linux-x86_64 to overview_sparse nested field\",\n",
" \"input_output\": [\n",
" {\"input_field\": \"overview\", \"output_field\": \"overview_sparse\"}\n",
" ],\n",
" }\n",
" },\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \".elser_model_2_linux-x86_64\",\n",
" \"description\": \"embed names with .elser_model_2_linux-x86_64 to names_sparse nested field\",\n",
" \"input_output\": [\n",
" {\"input_field\": \"names\", \"output_field\": \"names_sparse\"}\n",
" ],\n",
" }\n",
" },\n",
" ],\n",
" \"on_failure\": [\n",
" {\n",
" \"append\": {\n",
" \"field\": \"_source._ingest.inference_errors\",\n",
" \"value\": [\n",
" {\n",
" \"message\": \"{{ _ingest.on_failure_message }}\",\n",
" \"pipeline\": \"{{_ingest.pipeline}}\",\n",
" \"timestamp\": \"{{{ _ingest.timestamp }}}\",\n",
" }\n",
" ],\n",
" }\n",
" }\n",
" ],\n",
"}\n",
"\n",
"\n",
"# Create the pipeline\n",
"es.ingest.put_pipeline(id=\"elser_e5_embed\", body=pipeline_body)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "B8Gxm6mGstH0",
"outputId": "f955e1f3-5b79-4fe0-8e02-476691b3fb0f"
},
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"ObjectApiResponse({'acknowledged': True})"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"source": [
"## Ingest Docs\n",
"This will\n",
"- Do a bit of pre-processing\n",
"- Bulk ingest the 10,178 IMDB records\n",
"- Generate sparse vector embedings using the ELSER model for `overview` and `names` fields\n",
"- Generate dense vector embedings using the ELSER model for `overview` and `names` fields\n",
"\n",
"It generally takes around ~2 minutes to complete with the above allocation settings"
],
"metadata": {
"id": "JSJxIiaYEkJF"
}
},
{
"cell_type": "code",
"source": [
"# Load CSV data into a pandas DataFrame\n",
"df = pd.read_csv(\"/content/imdb_movies.csv\")\n",
"\n",
"# Replace all NaN values in DataFrame with None\n",
"df = df.where(pd.notnull(df), None)\n",
"\n",
"# Convert DataFrame into a list of dictionaries\n",
"# Each dictionary represents a document to be indexed\n",
"documents = df.to_dict(orient=\"records\")\n",
"\n",
"\n",
"# Define a function to generate actions for bulk API\n",
"def generate_bulk_actions(documents):\n",
" for doc in documents:\n",
" yield {\n",
" \"_index\": \"imdb_movies\",\n",
" \"_source\": doc,\n",
" }\n",
"\n",
"\n",
"# Use the bulk helper to insert documents, 200 at a time\n",
"start_time = time.time()\n",
"helpers.bulk(es, generate_bulk_actions(documents), chunk_size=200)\n",
"end_time = time.time()\n",
"\n",
"print(f\"The function took {end_time - start_time} seconds to run\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NgYlVtRaEEJQ",
"outputId": "c39fca25-e964-4325-83fe-8d83d7bf42cf"
},
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"The function took 180.07549405097961 seconds to run\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Scale down ELSER and e5 models\n",
"We don't need a large number of model allocations for test querying so we will scale each down to 1 allocation"
],
"metadata": {
"id": "O3GqVZcjLt-h"
}
},
{
"cell_type": "code",
"source": [
"for model_id in [\"my-elser-model\", \"my-e5-model\"]:\n",
" result = es.perform_request(\n",
" \"POST\",\n",
" f\"/_ml/trained_models/{model_id}/deployment/_update\",\n",
" headers={\"content-type\": \"application/json\", \"accept\": \"application/json\"},\n",
" body={\"number_of_allocations\": 1},\n",
" )"
],
"metadata": {
"id": "bfcjA76VqRMD"
},
"execution_count": 15,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# <font color='Green'>Retriever tests</font>"
],
"metadata": {
"id": "k_GPitWQzdIB"
}
},
{
"cell_type": "markdown",
"source": [
"We are going to search the `overview` field (either the text or embedding) in the dataset for movies using the search input <font color='orange'>clueless slackers</font>\n",
"\n",
"Feel free to change the `movie_search` variable below to something else"
],
"metadata": {
"id": "Nc2jogPBc3UZ"
}
},
{
"cell_type": "code",
"source": [
"movie_search = \"clueless slackers\""
],
"metadata": {
"id": "tb-M-FqSdHa0"
},
"execution_count": 16,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Standard - Search All the Text! - bm25"
],
"metadata": {
"id": "e1In-IVeERev"
}
},
{
"cell_type": "code",
"source": [
"response = es.search(\n",
" index=\"imdb_movies\",\n",
" body={\n",
" \"query\": {\"match\": {\"overview\": movie_search}},\n",
" \"size\": 3,\n",
" \"fields\": [\"names\", \"overview\"],\n",
" \"_source\": False,\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
" print(f\"{hit['fields']['names'][0]}\\n- {hit['fields']['overview'][0]}\\n\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MSCPBLRWEEjH",
"outputId": "7d9e5d1e-ad63-4165-94ae-6e6949112974"
},
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Beavis and Butt-Head Do America\n",
"- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.\n",
"\n",
"Mr. Popper's Penguins\n",
"- Jim Carrey stars as Tom Popper, a successful businessman who’s clueless when it comes to the really important things in life...until he inherits six “adorable” penguins, each with its own unique personality. Soon Tom’s rambunctious roommates turn his swank New York apartment into a snowy winter wonderland — and the rest of his world upside-down.\n",
"\n",
"Spaceballs\n",
"- When the nefarious Dark Helmet hatches a plan to snatch Princess Vespa and steal her planet's air, space-bum-for-hire Lone Starr and his clueless sidekick fly to the rescue. Along the way, they meet Yogurt, who puts Lone Starr wise to the power of \"The Schwartz.\" Can he master it in time to save the day?\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## kNN - Search all the Dense Vectors!"
],
"metadata": {
"id": "HNppwaFgzlyo"
}
},
{
"cell_type": "code",
"source": [
"response = es.search(\n",
" index=\"imdb_movies\",\n",
" body={\n",
" \"retriever\": {\n",
" \"knn\": {\n",
" \"field\": \"overview_dense\",\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"my-e5-model\",\n",
" \"model_text\": movie_search,\n",
" }\n",
" },\n",
" \"k\": 5,\n",
" \"num_candidates\": 5,\n",
" }\n",
" },\n",
" \"size\": 3,\n",
" \"fields\": [\"names\", \"overview\"],\n",
" \"_source\": False,\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
" print(f\"{hit['fields']['names'][0]}\\n- {hit['fields']['overview'][0]}\\n\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "M5LjVfhhELQB",
"outputId": "c621f71c-b8aa-42d8-d839-c2a33646024e"
},
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Beavis and Butt-Head Do America\n",
"- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.\n",
"\n",
"Uncharted\n",
"- A young street-smart, Nathan Drake and his wisecracking partner Victor “Sully” Sullivan embark on a dangerous pursuit of “the greatest treasure never found” while also tracking clues that may lead to Nathan’s long-lost brother.\n",
"\n",
"Crystal Skulls\n",
"- A millionaire philanthropist collects the famous Crystal Skulls trying to tap into their ancient powers. It is up to a team lead by a college professor whose father disappeared searching for the 13th skull to save the world when the first 12 skulls are united and reek havoc on the earth without the control of the 13th skull.\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## text_expansion - Search all the Sparse Vectors! - elser\n"
],
"metadata": {
"id": "rK8e9kPPzqLV"
}
},
{
"cell_type": "code",
"source": [
"response = es.search(\n",
" index=\"imdb_movies\",\n",
" body={\n",
" \"retriever\": {\n",
" \"standard\": {\n",
" \"query\": {\n",
" \"text_expansion\": {\n",
" \"overview_sparse\": {\n",
" \"model_id\": \"my-elser-model\",\n",
" \"model_text\": movie_search,\n",
" }\n",
" }\n",
" }\n",
" }\n",
" },\n",
" \"size\": 3,\n",
" \"fields\": [\"names\", \"overview\"],\n",
" \"_source\": False,\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
" print(f\"{hit['fields']['names'][0]}\\n- {hit['fields']['overview'][0]}\\n\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DBME24ECENT_",
"outputId": "868c950c-e568-4436-cca5-a6b0f30be70d"
},
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Bill & Ted's Bogus Journey\n",
"- Amiable slackers Bill and Ted are once again roped into a fantastical adventure when De Nomolos, a villain from the future, sends evil robot duplicates of the two lads to terminate and replace them. The robot doubles actually succeed in killing Bill and Ted, but the two are determined to escape the afterlife, challenging the Grim Reaper to a series of games in order to return to the land of the living.\n",
"\n",
"Beavis and Butt-Head Do America\n",
"- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.\n",
"\n",
"Knocked Up\n",
"- A slacker and a career-driven woman accidentally conceive a child after a one-night stand. As they try to make the relationship work, they must navigate the challenges of parenthood and their differences in lifestyle and maturity.\n",
"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## rrf - Combine All the Things!\n"
],
"metadata": {
"id": "tQcz94Xwzs5k"
}
},
{
"cell_type": "code",
"source": [
"response = es.search(\n",
" index=\"imdb_movies\",\n",
" body={\n",
" \"retriever\": {\n",
" \"rrf\": {\n",
" \"retrievers\": [\n",
" {\"standard\": {\"query\": {\"term\": {\"overview\": movie_search}}}},\n",
" {\n",
" \"knn\": {\n",
" \"field\": \"overview_dense\",\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"my-e5-model\",\n",
" \"model_text\": movie_search,\n",
" }\n",
" },\n",
" \"k\": 5,\n",
" \"num_candidates\": 5,\n",
" }\n",
" },\n",
" {\n",
" \"standard\": {\n",
" \"query\": {\n",
" \"text_expansion\": {\n",
" \"overview_sparse\": {\n",
" \"model_id\": \"my-elser-model\",\n",
" \"model_text\": movie_search,\n",
" }\n",
" }\n",
" }\n",
" }\n",
" },\n",
" ],\n",
" \"rank_window_size\": 5,\n",
" \"rank_constant\": 1,\n",
" }\n",
" },\n",
" \"size\": 3,\n",
" \"fields\": [\"names\", \"overview\"],\n",
" \"_source\": False,\n",
" },\n",
")\n",
"\n",
"for hit in response[\"hits\"][\"hits\"]:\n",
" print(f\"{hit['fields']['names'][0]}\\n- {hit['fields']['overview'][0]}\\n\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uS7iGJZFEP72",
"outputId": "cc983b10-cbd3-4fc4-9844-8ed59b00d803"
},
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Beavis and Butt-Head Do America\n",
"- Slacker duo Beavis and Butt-Head wake to discover their TV has been stolen. Their search for a new one takes them on a clueless adventure across America, during which they manage to accidentally become America's most wanted.\n",
"\n",
"Bill & Ted's Bogus Journey\n",
"- Amiable slackers Bill and Ted are once again roped into a fantastical adventure when De Nomolos, a villain from the future, sends evil robot duplicates of the two lads to terminate and replace them. The robot doubles actually succeed in killing Bill and Ted, but the two are determined to escape the afterlife, challenging the Grim Reaper to a series of games in order to return to the land of the living.\n",
"\n",
"Uncharted\n",
"- A young street-smart, Nathan Drake and his wisecracking partner Victor “Sully” Sullivan embark on a dangerous pursuit of “the greatest treasure never found” while also tracking clues that may lead to Nathan’s long-lost brother.\n",
"\n"
]
}
]
}
]
}