notebooks/images/image-similarity.ipynb (715 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "CepGq3Kvtdxi"
},
"source": [
"# How to implement Image search using Elasticsearch"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oMu1SW_TQQrU"
},
"source": [
"The workbook shows how to implement an Image search using Elasticsearch. You will index documents with image embeddings (generated or pre-generated) and then using NLP model be able to search using natural language description of the image.\n",
"\n",
"## Prerequisities\n",
"Before we begin, create an elastic cloud deployment and [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. \n",
"\n",
"If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VFcdr8IDQE_H"
},
"source": [
"### Install Python requirements\n",
"Before you start you need to install all required Python dependencies."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6WosfR55npKU",
"outputId": "033767ff-0eef-48cc-c9e7-efbf73c9cb67"
},
"outputs": [],
"source": [
"!pip install sentence-transformers==2.7.0 eland elasticsearch transformers torch tqdm Pillow streamlit"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "I0pRCbYMuMVn"
},
"outputs": [],
"source": [
"from elasticsearch import Elasticsearch\n",
"from elasticsearch.helpers import parallel_bulk\n",
"import requests\n",
"import os\n",
"import sys\n",
"\n",
"import zipfile\n",
"from tqdm.auto import tqdm\n",
"import pandas as pd\n",
"from PIL import Image\n",
"from sentence_transformers import SentenceTransformer\n",
"import urllib.request\n",
"\n",
"# import urllib.error\n",
"import json\n",
"from getpass import getpass"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eIV5lAnVt9L7"
},
"source": [
"### Upload NLP model for querying\n",
"\n",
"Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, download and install the [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) model, will transfer your search query into vector which will be used for the search over the set of images stored in Elasticsearch.\n",
"\n",
"To get your cloud id, go to [Elastic cloud](https://cloud.elastic.co) and `On the deployment overview page, copy down the Cloud ID.`\n",
"\n",
"To authenticate your request, You could use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Alternatively, you can use your cloud deployment username and password."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tVhL9jBnuAAQ"
},
"outputs": [],
"source": [
"!eland_import_hub_model --cloud-id $ELASTIC_CLOUD_ID --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 --task-type text_embedding --es-api-key $ELASTIC_API_KEY --start --clear-previous"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Klv3rywdUJBN"
},
"source": [
"### Connect to Elasticsearch cluster\n",
"Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "YwN8RmFY3FQI",
"outputId": "d0d0e31e-2ad2-46fe-ef8c-8c8bce7e1c48"
},
"outputs": [
{
"data": {
"text/plain": [
"ObjectApiResponse({'name': 'instance-0000000001', 'cluster_name': 'a72482be54904952ba46d53c3def7740', 'cluster_uuid': 'g8BE52TtT32pGBbRzP_oKA', 'version': {'number': '8.12.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '48a287ab9497e852de30327444b0809e55d46466', 'build_date': '2024-02-19T10:04:32.774273190Z', 'build_snapshot': False, 'lucene_version': '9.9.2', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = Elasticsearch(\n",
" cloud_id=ELASTIC_CLOUD_ID,\n",
" # basic_auth=(ELASTIC_CLOUD_USER, ELASTIC_CLOUD_PASSWORD),\n",
" api_key=ELASTIC_API_KEY,\n",
" request_timeout=600,\n",
")\n",
"\n",
"es.info() # should return cluster info"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IW-GIlH2OxB4"
},
"source": [
"### Create Index and mappings for Images\n",
"Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "xAkc1OVcOxy3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating index images\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/b0/0h5fbhnd0tz563nl779m3jv80000gn/T/ipykernel_57417/1485784368.py:45: DeprecationWarning: Passing transport options in the API method is deprecated. Use 'Elasticsearch.options()' instead.\n",
" es.indices.create(\n"
]
},
{
"data": {
"text/plain": [
"ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'images'})"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Destination Index name\n",
"INDEX_NAME = \"images\"\n",
"\n",
"# flag to check if index has to be deleted before creating\n",
"SHOULD_DELETE_INDEX = True\n",
"\n",
"INDEX_MAPPING = {\n",
" \"properties\": {\n",
" \"image_embedding\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 512,\n",
" \"index\": True,\n",
" \"similarity\": \"cosine\",\n",
" },\n",
" \"photo_id\": {\"type\": \"keyword\"},\n",
" \"photo_image_url\": {\"type\": \"keyword\"},\n",
" \"ai_description\": {\"type\": \"text\"},\n",
" \"photo_description\": {\"type\": \"text\"},\n",
" \"photo_url\": {\"type\": \"keyword\"},\n",
" \"photographer_first_name\": {\"type\": \"keyword\"},\n",
" \"photographer_last_name\": {\"type\": \"keyword\"},\n",
" \"photographer_username\": {\"type\": \"keyword\"},\n",
" \"exif_camera_make\": {\"type\": \"keyword\"},\n",
" \"exif_camera_model\": {\"type\": \"keyword\"},\n",
" \"exif_iso\": {\"type\": \"integer\"},\n",
" }\n",
"}\n",
"\n",
"# Index settings\n",
"INDEX_SETTINGS = {\n",
" \"index\": {\n",
" \"number_of_replicas\": \"1\",\n",
" \"number_of_shards\": \"1\",\n",
" \"refresh_interval\": \"5s\",\n",
" }\n",
"}\n",
"\n",
"# check if we want to delete index before creating the index\n",
"if SHOULD_DELETE_INDEX:\n",
" if es.indices.exists(index=INDEX_NAME):\n",
" print(\"Deleting existing %s\" % INDEX_NAME)\n",
" es.indices.delete(index=INDEX_NAME, ignore=[400, 404])\n",
"\n",
"print(\"Creating index %s\" % INDEX_NAME)\n",
"es.indices.create(\n",
" index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS, ignore=[400, 404]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NKE-j0kPUMn_"
},
"source": [
"### Get image dataset and embeddings\n",
"Download:\n",
"- The example image dataset is from [Unsplash](https://github.com/unsplash/datasets)\n",
"- The [Image embeddings](https://github.com/radoondas/flask-elastic-nlp/blob/main/embeddings/blogs/blogs-no-embeddings.json.zip) are pre-generated using CLIP model\n",
"\n",
"Then unzip both files."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zFGaPDRR5mqT",
"outputId": "0114cdd6-a714-41ab-9b46-3013bd36698a"
},
"outputs": [],
"source": [
"!curl -L https://unsplash.com/data/lite/1.2.0 -o unsplash-research-dataset-lite-1.2.0.zip\n",
"!curl -L https://raw.githubusercontent.com/radoondas/flask-elastic-nlp/main/embeddings/images/image-embeddings.json.zip -o image-embeddings.json.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MBh4AQ8i7C0-",
"outputId": "17a50b7f-f052-4b72-daa8-0e8fc630326f"
},
"outputs": [],
"source": [
"# Unzip downloaded files\n",
"UNSPLASH_ZIP_FILE = \"unsplash-research-dataset-lite-1.2.0.zip\"\n",
"EMBEDDINGS_ZIP_FILE = \"image-embeddings.json.zip\"\n",
"\n",
"with zipfile.ZipFile(UNSPLASH_ZIP_FILE, \"r\") as zip_ref:\n",
" print(\"Extracting file \", UNSPLASH_ZIP_FILE, \".\")\n",
" zip_ref.extractall(\"data/unsplash/\")\n",
"\n",
"with zipfile.ZipFile(EMBEDDINGS_ZIP_FILE, \"r\") as zip_ref:\n",
" print(\"Extracting file \", EMBEDDINGS_ZIP_FILE, \".\")\n",
" zip_ref.extractall(\"data/embeddings/\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qhZRdUyAQd-s"
},
"source": [
"# Import all pregenerated image embeddings\n",
"In this section you will import ~19k documents worth of pregenenerated image embeddings with metadata.\n",
"\n",
"The process downloads files with images information, merge them and index into Elasticsearch."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "32xrbSUXTODQ"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Indexed 1000 documents\n",
"Indexed 2000 documents\n",
"Indexed 3000 documents\n",
"Indexed 4000 documents\n",
"Indexed 5000 documents\n",
"Indexed 6000 documents\n",
"Indexed 7000 documents\n",
"Indexed 8000 documents\n",
"Indexed 9000 documents\n",
"Indexed 10000 documents\n",
"Indexed 11000 documents\n",
"Indexed 12000 documents\n",
"Indexed 13000 documents\n",
"Indexed 14000 documents\n",
"Indexed 15000 documents\n",
"Indexed 16000 documents\n",
"Indexed 17000 documents\n",
"Indexed 18000 documents\n",
"Indexed 19000 documents\n",
"Indexed 19833 image embeddings documents\n"
]
}
],
"source": [
"df_unsplash = pd.read_csv(\"data/unsplash/\" + \"photos.tsv000\", sep=\"\\t\", header=0)\n",
"\n",
"# follwing 8 lines are fix for inconsistent/incorrect data\n",
"df_unsplash[\"photo_description\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"ai_description\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"photographer_first_name\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"photographer_last_name\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"photographer_username\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"exif_camera_make\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"exif_camera_model\"].fillna(\"\", inplace=True)\n",
"df_unsplash[\"exif_iso\"].fillna(0, inplace=True)\n",
"## end of fix\n",
"\n",
"# read subset of columns from the original/downloaded dataset\n",
"df_unsplash_subset = df_unsplash[\n",
" [\n",
" \"photo_id\",\n",
" \"photo_url\",\n",
" \"photo_image_url\",\n",
" \"photo_description\",\n",
" \"ai_description\",\n",
" \"photographer_first_name\",\n",
" \"photographer_last_name\",\n",
" \"photographer_username\",\n",
" \"exif_camera_make\",\n",
" \"exif_camera_model\",\n",
" \"exif_iso\",\n",
" ]\n",
"]\n",
"\n",
"# read all pregenerated embeddings\n",
"df_embeddings = pd.read_json(\"data/embeddings/\" + \"image-embeddings.json\", lines=True)\n",
"\n",
"df_merged = pd.merge(df_unsplash_subset, df_embeddings, on=\"photo_id\", how=\"inner\")\n",
"\n",
"count = 0\n",
"for success, info in parallel_bulk(\n",
" client=es,\n",
" actions=gen_rows(df_merged),\n",
" thread_count=5,\n",
" chunk_size=1000,\n",
" index=INDEX_NAME,\n",
"):\n",
" if success:\n",
" count += 1\n",
" if count % 1000 == 0:\n",
" print(\"Indexed %s documents\" % str(count), flush=True)\n",
" sys.stdout.flush()\n",
" else:\n",
" print(\"Doc failed\", info)\n",
"\n",
"print(\"Indexed %s image embeddings documents\" % str(count), flush=True)\n",
"sys.stdout.flush()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-_i2CIpSz9vw"
},
"source": [
"# Query the image dataset\n",
"The next step is to run a query to search for images. The example query searches for `\"model_text\": \"Valentine day flowers\"` using the model `sentence-transformers__clip-vit-b-32-multilingual-v1` that we uploaded to Elasticsearch earlier.\n",
"\n",
"The process is carried out with a single query, even though internaly it consists of two tasks. One is to tramsform your search text into a vector using the NLP model and the second task is to run the vector search over the image dataset.\n",
"\n",
"```\n",
"POST images/_search\n",
"{\n",
" \"knn\": {\n",
" \"field\": \"image_embedding\",\n",
" \"k\": 5,\n",
" \"num_candidates\": 10,\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__clip-vit-b-32-multilingual-v1\",\n",
" \"model_text\": \"Valentine day flowers\"\n",
" }\n",
" }\n",
" },\n",
" \"fields\": [\n",
" \"photo_description\",\n",
" \"ai_description\",\n",
" \"photo_url\"\n",
" ],\n",
" \"_source\": false\n",
"}\n",
"```\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 375
},
"id": "wdicpvRlzmXG",
"outputId": "00550041-0aed-4f51-ccd3-18eb705ff7ed"
},
"outputs": [],
"source": [
"# Search queary\n",
"WHAT_ARE_YOU_LOOKING_FOR = \"Valentine day flowers\"\n",
"\n",
"source_fields = [\n",
" \"photo_description\",\n",
" \"ai_description\",\n",
" \"photo_url\",\n",
" \"photo_image_url\",\n",
" \"photographer_first_name\",\n",
" \"photographer_username\",\n",
" \"photographer_last_name\",\n",
" \"photo_id\",\n",
"]\n",
"query = {\n",
" \"field\": \"image_embedding\",\n",
" \"k\": 5,\n",
" \"num_candidates\": 100,\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__clip-vit-b-32-multilingual-v1\",\n",
" \"model_text\": WHAT_ARE_YOU_LOOKING_FOR,\n",
" }\n",
" },\n",
"}\n",
"\n",
"response = es.search(index=INDEX_NAME, fields=source_fields, knn=query, source=False)\n",
"\n",
"print(response.body)\n",
"\n",
"# the code writes the response into a file for the streamlit UI used in the optional step.\n",
"with open(\"json_data.json\", \"w\") as outfile:\n",
" json.dump(response.body[\"hits\"][\"hits\"], outfile)\n",
"\n",
"# Use the `loads()` method to load the JSON data\n",
"dfr = json.loads(json.dumps(response.body[\"hits\"][\"hits\"]))\n",
"# Pass the generated JSON data into a pandas dataframe\n",
"dfr = pd.DataFrame(dfr)\n",
"# Print the data frame\n",
"dfr\n",
"\n",
"results = pd.json_normalize(json.loads(json.dumps(response.body[\"hits\"][\"hits\"])))\n",
"# results\n",
"results[\n",
" [\n",
" \"_id\",\n",
" \"_score\",\n",
" \"fields.photo_id\",\n",
" \"fields.photo_image_url\",\n",
" \"fields.photo_description\",\n",
" \"fields.photographer_first_name\",\n",
" \"fields.photographer_last_name\",\n",
" \"fields.ai_description\",\n",
" \"fields.photo_url\",\n",
" ]\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ry62sfHFHFi9"
},
"source": [
"# [Optional] Simple streamlit UI\n",
"In the following section, you will view the response in a simple UI for better visualisation.\n",
"\n",
"The query in the previous step did write down a file response `json_data.json` for the UI to load and visualise.\n",
"\n",
"Follow the steps below to see the results in a table."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iUAbRqr8II-x"
},
"source": [
"### Install tunnel library"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RGEmAt2DjtN7",
"outputId": "f6c37d54-7e09-4e59-fc21-8a3db4fa840d"
},
"outputs": [],
"source": [
"!npm install localtunnel"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KUAfucnYITka"
},
"source": [
"### Create application"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9Wb7GOWMXFnF",
"outputId": "6db23ef3-b25e-4f80-a3cb-6d08c1c78c16"
},
"outputs": [],
"source": [
"%%writefile app.py\n",
"\n",
"import streamlit as st\n",
"import json\n",
"import pandas as pd\n",
"\n",
"\n",
"def get_image_preview(image_url):\n",
" \"\"\"Returns an HTML <img> tag with preview of the image.\"\"\"\n",
" return f\"\"\"<img src=\"{image_url}\" width=\"400\" />\"\"\"\n",
"\n",
"\n",
"def get_url_link(photo_url):\n",
" \"\"\"Returns an HTML <a> tag to the image page.\"\"\"\n",
" return f\"\"\"<a href=\"{photo_url}\" target=\"_blank\"> {photo_url} </a>\"\"\"\n",
"\n",
"\n",
"def main():\n",
" \"\"\"Creates a Streamlit app with a table of images.\"\"\"\n",
" data = json.load(open(\"json_data.json\"))\n",
" table = []\n",
" for image in data:\n",
" image_url = image[\"fields\"][\"photo_image_url\"][0]\n",
" image_preview = get_image_preview(image_url)\n",
" photo_url = image[\"fields\"][\"photo_url\"][0]\n",
" photo_url_link = get_url_link(photo_url)\n",
" table.append([image_preview, image[\"fields\"][\"photo_id\"][0],\n",
" image[\"fields\"][\"photographer_first_name\"][0],\n",
" image[\"fields\"][\"photographer_last_name\"][0],\n",
" image[\"fields\"][\"photographer_username\"][0],\n",
" photo_url_link])\n",
"\n",
" st.write(pd.DataFrame(table, columns=[\"Image\", \"ID\", \"First Name\", \"Last Name\",\n",
" \"Photographer username\", \"Photo url\"]).to_html(escape = False),\n",
" unsafe_allow_html=True)\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CjDhvbGhHuiz"
},
"source": [
"### Run app\n",
"Run the application and check your IP for the tunneling"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "851CeYi8jvuF",
"outputId": "46a64023-e990-4900-f482-5558237f08cc"
},
"outputs": [],
"source": [
"!streamlit run app.py &>/content/logs.txt & curl ipv4.icanhazip.com"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4OuSLFHyHy5M"
},
"source": [
"### Create the tunnel\n",
"Run the tunnel and use the link below to connect to the tunnel.\n",
"\n",
"Use the IP from the previous step to connect to the application"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "inF7ceBmjyE3",
"outputId": "559ce180-3f0f-4475-c9a9-46dc91389276"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[K\u001b[?25hnpx: installed 22 in 2.186s\n",
"your url is: https://nine-facts-act.loca.lt\n",
"^C\n"
]
}
],
"source": [
"!npx localtunnel --port 8501"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SbxbVzvQ7caR"
},
"source": [
"# Resources\n",
"\n",
"Blog: https://www.elastic.co/blog/implement-image-similarity-search-elastic\n",
"\n",
"GH : https://github.com/radoondas/flask-elastic-image-search\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}