# How to implement Image search using Elasticsearch

The workbook shows how to implement an Image search using Elasticsearch. You will index documents with image embeddings (generated or pre-generated) and then using NLP model be able to search using natural language description of the image.

## Prerequisities
Before we begin, create an elastic cloud deployment and [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. 

If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook).

### Install Python requirements
Before you start you need to install all required Python dependencies.

In [None]:
!pip install sentence-transformers==2.7.0 eland elasticsearch transformers torch tqdm Pillow streamlit

In [3]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
import requests
import os
import sys

import zipfile
from tqdm.auto import tqdm
import pandas as pd
from PIL import Image
from sentence_transformers import SentenceTransformer
import urllib.request

# import urllib.error
import json
from getpass import getpass

### Upload NLP model for querying

Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, download and install the [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) model, will transfer your search query into vector which will be used for the search over the set of images stored in Elasticsearch.

To get your cloud id, go to [Elastic cloud](https://cloud.elastic.co) and `On the deployment overview page, copy down the Cloud ID.`

To authenticate your request, You could use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Alternatively, you can use your cloud deployment username and password.

In [None]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

In [None]:
!eland_import_hub_model --cloud-id $ELASTIC_CLOUD_ID --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 --task-type text_embedding --es-api-key $ELASTIC_API_KEY --start --clear-previous

### Connect to Elasticsearch cluster
Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`.

In [9]:
es = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,
    # basic_auth=(ELASTIC_CLOUD_USER, ELASTIC_CLOUD_PASSWORD),
    api_key=ELASTIC_API_KEY,
    request_timeout=600,
)

es.info()  # should return cluster info

ObjectApiResponse({'name': 'instance-0000000001', 'cluster_name': 'a72482be54904952ba46d53c3def7740', 'cluster_uuid': 'g8BE52TtT32pGBbRzP_oKA', 'version': {'number': '8.12.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '48a287ab9497e852de30327444b0809e55d46466', 'build_date': '2024-02-19T10:04:32.774273190Z', 'build_snapshot': False, 'lucene_version': '9.9.2', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Create Index and mappings for Images
Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings.

In [10]:
# Destination Index name
INDEX_NAME = "images"

# flag to check if index has to be deleted before creating
SHOULD_DELETE_INDEX = True

INDEX_MAPPING = {
    "properties": {
        "image_embedding": {
            "type": "dense_vector",
            "dims": 512,
            "index": True,
            "similarity": "cosine",
        },
        "photo_id": {"type": "keyword"},
        "photo_image_url": {"type": "keyword"},
        "ai_description": {"type": "text"},
        "photo_description": {"type": "text"},
        "photo_url": {"type": "keyword"},
        "photographer_first_name": {"type": "keyword"},
        "photographer_last_name": {"type": "keyword"},
        "photographer_username": {"type": "keyword"},
        "exif_camera_make": {"type": "keyword"},
        "exif_camera_model": {"type": "keyword"},
        "exif_iso": {"type": "integer"},
    }
}

# Index settings
INDEX_SETTINGS = {
    "index": {
        "number_of_replicas": "1",
        "number_of_shards": "1",
        "refresh_interval": "5s",
    }
}

# check if we want to delete index before creating the index
if SHOULD_DELETE_INDEX:
    if es.indices.exists(index=INDEX_NAME):
        print("Deleting existing %s" % INDEX_NAME)
        es.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)
es.indices.create(
    index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS, ignore=[400, 404]
)

Creating index images


  es.indices.create(


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'images'})

### Get image dataset and embeddings
Download:
- The example image dataset is from [Unsplash](https://github.com/unsplash/datasets)
- The [Image embeddings](https://github.com/radoondas/flask-elastic-nlp/blob/main/embeddings/blogs/blogs-no-embeddings.json.zip) are pre-generated using CLIP model

Then unzip both files.

In [None]:
!curl -L https://unsplash.com/data/lite/1.2.0 -o unsplash-research-dataset-lite-1.2.0.zip
!curl -L https://raw.githubusercontent.com/radoondas/flask-elastic-nlp/main/embeddings/images/image-embeddings.json.zip -o image-embeddings.json.zip

In [None]:
# Unzip downloaded files
UNSPLASH_ZIP_FILE = "unsplash-research-dataset-lite-1.2.0.zip"
EMBEDDINGS_ZIP_FILE = "image-embeddings.json.zip"

with zipfile.ZipFile(UNSPLASH_ZIP_FILE, "r") as zip_ref:
    print("Extracting file ", UNSPLASH_ZIP_FILE, ".")
    zip_ref.extractall("data/unsplash/")

with zipfile.ZipFile(EMBEDDINGS_ZIP_FILE, "r") as zip_ref:
    print("Extracting file ", EMBEDDINGS_ZIP_FILE, ".")
    zip_ref.extractall("data/embeddings/")

# Import all pregenerated image embeddings
In this section you will import ~19k documents worth of pregenenerated image embeddings with metadata.

The process downloads files with images information, merge them and index into Elasticsearch.

In [20]:
df_unsplash = pd.read_csv("data/unsplash/" + "photos.tsv000", sep="\t", header=0)

# follwing 8 lines are fix for inconsistent/incorrect data
df_unsplash["photo_description"].fillna("", inplace=True)
df_unsplash["ai_description"].fillna("", inplace=True)
df_unsplash["photographer_first_name"].fillna("", inplace=True)
df_unsplash["photographer_last_name"].fillna("", inplace=True)
df_unsplash["photographer_username"].fillna("", inplace=True)
df_unsplash["exif_camera_make"].fillna("", inplace=True)
df_unsplash["exif_camera_model"].fillna("", inplace=True)
df_unsplash["exif_iso"].fillna(0, inplace=True)
## end of fix

# read subset of columns from the original/downloaded dataset
df_unsplash_subset = df_unsplash[
    [
        "photo_id",
        "photo_url",
        "photo_image_url",
        "photo_description",
        "ai_description",
        "photographer_first_name",
        "photographer_last_name",
        "photographer_username",
        "exif_camera_make",
        "exif_camera_model",
        "exif_iso",
    ]
]

# read all pregenerated embeddings
df_embeddings = pd.read_json("data/embeddings/" + "image-embeddings.json", lines=True)

df_merged = pd.merge(df_unsplash_subset, df_embeddings, on="photo_id", how="inner")

count = 0
for success, info in parallel_bulk(
    client=es,
    actions=gen_rows(df_merged),
    thread_count=5,
    chunk_size=1000,
    index=INDEX_NAME,
):
    if success:
        count += 1
        if count % 1000 == 0:
            print("Indexed %s documents" % str(count), flush=True)
            sys.stdout.flush()
    else:
        print("Doc failed", info)

print("Indexed %s image embeddings documents" % str(count), flush=True)
sys.stdout.flush()

Indexed 1000 documents
Indexed 2000 documents
Indexed 3000 documents
Indexed 4000 documents
Indexed 5000 documents
Indexed 6000 documents
Indexed 7000 documents
Indexed 8000 documents
Indexed 9000 documents
Indexed 10000 documents
Indexed 11000 documents
Indexed 12000 documents
Indexed 13000 documents
Indexed 14000 documents
Indexed 15000 documents
Indexed 16000 documents
Indexed 17000 documents
Indexed 18000 documents
Indexed 19000 documents
Indexed 19833 image embeddings documents


# Query the image dataset
The next step is to run a query to search for images. The example query searches for `"model_text": "Valentine day flowers"` using the model `sentence-transformers__clip-vit-b-32-multilingual-v1` that we uploaded to Elasticsearch earlier.

The process is carried out with a single query, even though internaly it consists of two tasks. One is to tramsform your search text into a vector using the NLP model and the second task is to run the vector search over the image dataset.

```
POST images/_search
{
  "knn": {
    "field": "image_embedding",
    "k": 5,
    "num_candidates": 10,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__clip-vit-b-32-multilingual-v1",
        "model_text": "Valentine day flowers"
      }
    }
  },
  "fields": [
    "photo_description",
    "ai_description",
    "photo_url"
  ],
  "_source": false
}
```



In [None]:
# Search queary
WHAT_ARE_YOU_LOOKING_FOR = "Valentine day flowers"

source_fields = [
    "photo_description",
    "ai_description",
    "photo_url",
    "photo_image_url",
    "photographer_first_name",
    "photographer_username",
    "photographer_last_name",
    "photo_id",
]
query = {
    "field": "image_embedding",
    "k": 5,
    "num_candidates": 100,
    "query_vector_builder": {
        "text_embedding": {
            "model_id": "sentence-transformers__clip-vit-b-32-multilingual-v1",
            "model_text": WHAT_ARE_YOU_LOOKING_FOR,
        }
    },
}

response = es.search(index=INDEX_NAME, fields=source_fields, knn=query, source=False)

print(response.body)

# the code writes the response into a file for the streamlit UI used in the optional step.
with open("json_data.json", "w") as outfile:
    json.dump(response.body["hits"]["hits"], outfile)

# Use the `loads()` method to load the JSON data
dfr = json.loads(json.dumps(response.body["hits"]["hits"]))
# Pass the generated JSON data into a pandas dataframe
dfr = pd.DataFrame(dfr)
# Print the data frame
dfr

results = pd.json_normalize(json.loads(json.dumps(response.body["hits"]["hits"])))
# results
results[
    [
        "_id",
        "_score",
        "fields.photo_id",
        "fields.photo_image_url",
        "fields.photo_description",
        "fields.photographer_first_name",
        "fields.photographer_last_name",
        "fields.ai_description",
        "fields.photo_url",
    ]
]

# [Optional] Simple streamlit UI
In the following section, you will view the response in a simple UI for better visualisation.

The query in the previous step did write down a file response `json_data.json` for the UI to load and visualise.

Follow the steps below to see the results in a table.

### Install tunnel library

In [None]:
!npm install localtunnel

### Create application

In [None]:
%%writefile app.py

import streamlit as st
import json
import pandas as pd


def get_image_preview(image_url):
    """Returns an HTML <img> tag with preview of the image."""
    return f"""<img src="{image_url}" width="400" />"""


def get_url_link(photo_url):
    """Returns an HTML <a> tag to the image page."""
    return f"""<a href="{photo_url}"  target="_blank"> {photo_url} </a>"""


def main():
    """Creates a Streamlit app with a table of images."""
    data = json.load(open("json_data.json"))
    table = []
    for image in data:
        image_url = image["fields"]["photo_image_url"][0]
        image_preview = get_image_preview(image_url)
        photo_url = image["fields"]["photo_url"][0]
        photo_url_link = get_url_link(photo_url)
        table.append([image_preview, image["fields"]["photo_id"][0],
                      image["fields"]["photographer_first_name"][0],
                      image["fields"]["photographer_last_name"][0],
                      image["fields"]["photographer_username"][0],
                      photo_url_link])

    st.write(pd.DataFrame(table, columns=["Image", "ID", "First Name", "Last Name",
                                          "Photographer username", "Photo url"]).to_html(escape = False),
             unsafe_allow_html=True)


if __name__ == "__main__":
    main()

### Run app
Run the application and check your IP for the tunneling

In [None]:
!streamlit run app.py &>/content/logs.txt & curl ipv4.icanhazip.com

### Create the tunnel
Run the tunnel and use the link below to connect to the tunnel.

Use the IP from the previous step to connect to the application

In [38]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 2.186s
your url is: https://nine-facts-act.loca.lt
^C


# Resources

Blog: https://www.elastic.co/blog/implement-image-similarity-search-elastic

GH  : https://github.com/radoondas/flask-elastic-image-search
