

# Multimodal RAG with Elasticsearch: The Gotham City Case



This notebook implements the Multimodal RAG (Retrieval-Augmented Generation) pipeline with Elasticsearch as described in the blog. We follow the same structure as the article, with each section explained and implemented in code.

## Environment Setup

First, we need to clone the repository that contains the complete project code.

In [None]:
!git clone https://github.com/elastic/elasticsearch-labs.git

In [None]:
import getpass

Let's navigate to the project directory where the necessary files are located:


In [None]:
cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham

Now let's configure the environment variables needed to connect to Elasticsearch and OpenAI. This is necessary for indexing and searching content, as well as generating the final report.


In [None]:
ELASTICSEARCH_URL = input("Enter the Elasticsearch endpoint url: ")
ELASTICSEARCH_API_KEY = getpass.getpass("Enter the Elasticsearch API key: ")
OPENAI_API_KEY = getpass.getpass("Enter the OpenAI API key: ")

In [None]:
import os

os.environ["ELASTICSEARCH_API_KEY"] = ELASTICSEARCH_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ELASTICSEARCH_URL"] = ELASTICSEARCH_URL


## Installing Dependencies

As mentioned in the blog, we need to install the specific dependencies, including the custom ImageBind fork:


In [None]:
# Install base dependencies
!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0
!pip install opencv-python-headless pillow numpy

# Install the specific ImageBind fork
!pip install git+https://github.com/hkchengrex/ImageBind.git

In [None]:
!pip -q install elasticsearch

In [None]:
!pip install python-dotenv

In [None]:
!pip install openai

In [None]:
!pip install soundfile

## Stage 1 - Collecting Crime Scene Clues

As explained in the blog, the first step is to verify that we have the correct directory structure and that the evidence files are present. We use `files_check.py` for this.

In [None]:
!python stages/01-stage/files_check.py

## Stage 2 - Generating Embeddings with ImageBind

Now we test the embedding generation for an image using ImageBind. As the blog explains, ImageBind allows us to generate embeddings for different modalities (image, audio, text) in a shared vector space.


In [None]:
!python stages/02-stage/test_embedding_generation.py

This script generates a 1024-dimensional embedding for a test image, confirming that the ImageBind model is working correctly.



## Stage 3 - Storage and Search in Elasticsearch

### Content Indexing

The next step is to index all multimodal evidence in Elasticsearch. This includes images, audio, text, and depth maps as described in the blog.

In [None]:
!python stages/03-stage/index_all_modalities.py


Each piece of evidence is now indexed in Elasticsearch with their respective embeddings, allowing for similarity search.

### Searching by Similarity Across Different Modalities

Now we can test searching for evidence by similarity using different modalities as queries. The blog describes how an input from one modality can retrieve results from all modalities.

#### Search by Audio


In [None]:
!python stages/03-stage/search_by_audio.py


This command uses an audio file as a query and retrieves the most similar evidence. In the case of Gotham, this helps identify connections between the audio of a sinister laugh and other evidence.

#### Search by Text

In [None]:
!python stages/03-stage/search_by_text.py


Here we use a text query ("Why so serious?") to find related evidence.

#### Search by Image


In [None]:
!python stages/03-stage/search_by_image.py

This script uses an image from the crime scene to find similar visual evidence.

#### Search by Depth Map


In [None]:
!python stages/03-stage/search_by_depth.py

As explained in the blog, depth maps can provide information about the 3D structure of the scene or objects, complementing the other modalities.

## Stage 4 - Evidence Analysis with LLM

Finally, we bring together all the retrieved evidence and use an LLM (GPT-4) to generate a forensic report that identifies the suspect based on the connections between the different modalities.


In [None]:
!python stages/04-stage/rag_crime_analyze.py


This is the final step of the Multimodal RAG pipeline, where the LLM analyzes the evidence retrieved from Elasticsearch and synthesizes it into a coherent report that identifies the Joker as the main suspect.

## Conclusion

We have thus completed the implementation of the complete Multimodal RAG pipeline with Elasticsearch, following all the steps described in the blog. This pipeline demonstrates how different types of media can be analyzed in an integrated way to provide richer insights and connections between evidence that would be difficult to identify manually.
