# Similarity search with Langchain and SparseVectorRetrievalStrategy(ELSER Model)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/langchain-vector-store-using-elser.ipynb)


This workbook demonstrates similiarity search using [SparseVectorRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.SparseRetrievalStrategy.html#langchain.vectorstores.elasticsearch.SparseRetrievalStrategy) (ELSER). First,  we  split the documents into chunks using `langchain` and then index into elasticsearch through [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). 

The [SparseVectorRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.SparseRetrievalStrategy.html#langchain.vectorstores.elasticsearch.SparseRetrievalStrategy) converts each document into tokens and would be stored in `vector` field with datatype `rank_features`. Hence, the inference is handled within elasticsearch.

The `similarity_search` performs semantic search using `text_expansion` and sets the query.

We will then see how to filter the metadata within query. 




## Install packages and import modules


In [None]:
# install packages
!python3 -m pip install -qU langchain langchain-elasticsearch "elasticsearch<9" openai tiktoken

# import modules
from getpass import getpass
from langchain_elasticsearch import ElasticsearchStore
from urllib.request import urlopen
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.


We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily. We will be using text embedding from ELSER model `.elser_model_2`.

In [2]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

vector_store = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="workplace_index",
)

## Download the dataset 

Let's download the sample dataset and deserialize the document.

In [3]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)

workplace_docs = json.loads(response.read())

## Split Documents into Passages


We will chunk these documents into 500 token passages with an overlap of 0 tokens using a simple splitter. 

In [4]:
metadata = []
content = []

for doc in workplace_docs:
    content.append(doc["content"])
    metadata.append(
        {
            "name": doc["name"],
            "summary": doc["summary"],
            "rolePermissions": doc["rolePermissions"],
        }
    )

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)
docs = text_splitter.create_documents(content, metadatas=metadata)

## Index data into elasticsearch

Next, we will index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). We will use Cloud ID,  Password and Index name values set in the `Create cloud deployment` step.

In the instance, we will set `strategy` to [ElasticsearchStore.SparseVectorRetrievalStrategy()](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.SparseRetrievalStrategy.html#langchain.vectorstores.elasticsearch.SparseRetrievalStrategy)

Note: Before we begin indexing, ensure you have [downloaded and deployed ELSER model](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html#download-deploy-elser) in your deployment and is running in ml node. 


In [6]:
documents = vector_store.from_documents(
    docs,
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="workplace_index",
    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(
        model_id=".elser_model_2"
    ),
    bulk_kwargs={
        "request_timeout": 60,
    },
)

## Results functions
Next, we will create a small function to show the results of our query in human-readable outputs. This function would be used in our examples to display the results.

In [7]:
def showResults(output):
    print("Total results: ", len(output))
    for index in range(len(output)):
        print(output[index])

## Querying the dataset with similarity_search

Now that we have indexed our sample data to elasticsearch, we will perform a similarity search on query - `How does the compensation work?`. By default returns top `4` documents.

In [8]:
results = documents.similarity_search("How does the compensation work?")
showResults(results)

Total results:  4
page_content='Compensation Bands:
Based on the job levels, the following compensation bands have been established:
a. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.

b. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.

c. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.

d. Leadership-Level Band: This band comprises salary ranges for managers and team leaders responsible for guiding and overseeing their respective teams. It considers their le

  hits = self._store.search(


## Querying the dataset show top 10 documents

Now we will set `k=10` and try same query to see top 10 documents.  


In [9]:
results = documents.similarity_search("How does the compensation work?", k=10)
showResults(results)

Total results:  10
page_content='Compensation Bands:
Based on the job levels, the following compensation bands have been established:
a. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.

b. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.

c. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.

d. Leadership-Level Band: This band comprises salary ranges for managers and team leaders responsible for guiding and overseeing their respective teams. It considers their l

## Querying the dataset with filtering Metadata
We will now add metadata filtering by Keyword at query time, to match `rolePermissions` as `manager`. 

In [10]:
results = documents.similarity_search(
    "How does the compensation work",
    filter=[{"match": {"metadata.rolePermissions": "manager"}}],
)
showResults(results)

Total results:  4
page_content='Compensation Bands:
Based on the job levels, the following compensation bands have been established:
a. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.

b. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.

c. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.

d. Leadership-Level Band: This band comprises salary ranges for managers and team leaders responsible for guiding and overseeing their respective teams. It considers their le