# BM25 and Self-querying retriever with elasticsearch and LangChain
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb)

This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and we use this for a BM25 example. 

In this example:
- we are going to ingest a sample dataset of movies outside of LangChain
- Customise the retrieval strategy in ElasticsearchStore to use just BM25
- use the self-query retrieval to transform question into a structured query
- Use the documents and RAG strategy to answer the question 

## Install packages


In [58]:
!python3 -m pip install -qU lark elasticsearch langchain langchain-elasticsearch openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Sample Dataset

In [59]:
docs = [
    {
        "text": "A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        "metadata": {
            "year": 1993,
            "rating": 7.7,
            "genre": "science fiction",
            "director": "Steven Spielberg",
            "title": "Jurassic Park",
        },
    },
    {
        "text": "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        "metadata": {
            "year": 2010,
            "director": "Christopher Nolan",
            "rating": 8.2,
            "title": "Inception",
        },
    },
    {
        "text": "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        "metadata": {
            "year": 2006,
            "director": "Satoshi Kon",
            "rating": 8.6,
            "title": "Paprika",
        },
    },
    {
        "text": "A bunch of normal-sized women are supremely wholesome and some men pine after them",
        "metadata": {
            "year": 2019,
            "director": "Greta Gerwig",
            "rating": 8.3,
            "title": "Little Women",
        },
    },
    {
        "text": "Toys come alive and have a blast doing so",
        "metadata": {
            "year": 1995,
            "genre": "animated",
            "director": "John Lasseter",
            "rating": 8.3,
            "title": "Toy Story",
        },
    },
    {
        "text": "Three men walk into the Zone, three men walk out of the Zone",
        "metadata": {
            "year": 1979,
            "rating": 9.9,
            "director": "Andrei Tarkovsky",
            "genre": "science fiction",
            "rating": 9.9,
            "title": "Stalker",
        },
    },
]

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.


We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step.

In [60]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
OPENAI_API_KEY = getpass("OpenAI API key: ")

client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

## Indexing data into Elasticsearch

We have chosen to index the data outside of Langchain to demonstrate how its possible to use Langchain for RAG and use the self-query retrieveral on any Elasticsearch index.

In [61]:
from elasticsearch import helpers

# create the index
client.indices.create(index="movies_self_query")

operations = [
    {
        "_index": "movies_self_query",
        "_id": i,
        "text": doc["text"],
        "metadata": doc["metadata"],
    }
    for i, doc in enumerate(docs)
]

# Add the documents to the index directly
response = helpers.bulk(
    client,
    operations,
)

## Setup query retriever

Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. 

We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)

In [62]:
from typing import List, Union
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.llms import OpenAI
from langchain_elasticsearch import ApproxRetrievalStrategy, ElasticsearchStore

# Add details about metadata fields
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. Can be either 'science fiction' or 'animated'.",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

document_content_description = "Brief summary of a movie"

# Set up openAI llm with sampling temperature 0
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)


class BM25RetrievalStrategy(ApproxRetrievalStrategy):

    def __init__(self):
        pass

    def query(
        self,
        query: Union[str, None],
        filter: List[dict],
        **kwargs,
    ):

        if query:
            query_clause = [
                {
                    "multi_match": {
                        "query": query,
                        "fields": ["text"],
                        "fuzziness": "AUTO",
                    }
                }
            ]
        else:
            query_clause = []

        bm25_query = {
            "query": {"bool": {"filter": filter, "must": query_clause}},
        }

        print("query", bm25_query)

        return bm25_query


vectorstore = ElasticsearchStore(
    index_name="movies_self_query",
    es_connection=client,
    strategy=BM25RetrievalStrategy(),
)

## BM25 Only Retriever 
One option is to customise the query to use BM25 only retrieval method. We can do this by overriding the `custom_query` function, specifying the query to use only `multi_match`.

In the example below, the self-query retriever is using the LLM to transform the question into a keyword and filter query (query: dreams, filter: year range). The custom query is then used to perform a BM25 based query on the keyword query and filter query.

This means that you dont have to vectorise all the documents if you want to perform a question / answerinf use-case on an existing Elasticsearch index. 

In [63]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """
Use the following context movies that matched the user question. Use the movies below only to answer the user's question.

If you don't know the answer, just say that you don't know, don't try to make up an answer.

----
{context}
----
Question: {question}
Answer:
"""
)

DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
title: {title}                                                                                   
year: {year}  
director: {director}     
---
"""
)


def _combine_documents(
    docs, document_prompt=DOCUMENT_PROMPT, document_separator="\n\n"
):
    print("docs:", docs)
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

chain.invoke(
    "Which director directed movies about dinosaurs that was released after the year 1992 but before 2007?"
)

query {'query': {'bool': {'filter': [{'bool': {'must': [{'match': {'metadata.genre': {'query': 'science fiction'}}}, {'range': {'metadata.year': {'gt': 1992}}}, {'range': {'metadata.year': {'lt': 2007}}}]}}], 'must': [{'multi_match': {'query': 'dinosaur', 'fields': ['text'], 'fuzziness': 'AUTO'}}]}}}
docs: [Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction', 'director': 'Steven Spielberg', 'title': 'Jurassic Park'})]


'Steven Spielberg directed Jurassic Park in 1993.'

In [64]:
client.indices.delete(index="movies_self_query")

ObjectApiResponse({'acknowledged': True})