notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb (409 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Self-querying retriever with elasticsearch and langchain\n",
"[](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb)\n",
"\n",
"This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and apply structured query to a vectorstore. \n",
"\n",
"Before we begin, we first split the documents into chunks with `langchain` and then using [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents), we create a `vectorstore` and index data to elasticsearch.\n",
"\n",
"\n",
"We will then see few examples query demonstrating full power of elasticsearch powered self-query retriever.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install packages and import modules\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
"source": [
"!python3 -m pip install -qU lark langchain langchain-elasticsearch openai\n",
"\n",
"from langchain.schema import Document\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain_elasticsearch import ElasticsearchStore\n",
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"from getpass import getpass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create documents \n",
"Next, we will create list of documents with summary of movies using [langchain Schema Document](https://api.python.langchain.com/en/latest/schema/langchain.schema.document.Document.html), containing each document's `page_content` and `metadata` .\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"docs = [\n",
" Document(\n",
" page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
" metadata={\n",
" \"year\": 1993,\n",
" \"rating\": 7.7,\n",
" \"genre\": \"science fiction\",\n",
" \"director\": \"Steven Spielberg\",\n",
" \"title\": \"Jurassic Park\",\n",
" },\n",
" ),\n",
" Document(\n",
" page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
" metadata={\n",
" \"year\": 2010,\n",
" \"director\": \"Christopher Nolan\",\n",
" \"rating\": 8.2,\n",
" \"title\": \"Inception\",\n",
" },\n",
" ),\n",
" Document(\n",
" page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
" metadata={\n",
" \"year\": 2006,\n",
" \"director\": \"Satoshi Kon\",\n",
" \"rating\": 8.6,\n",
" \"title\": \"Paprika\",\n",
" },\n",
" ),\n",
" Document(\n",
" page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
" metadata={\n",
" \"year\": 2019,\n",
" \"director\": \"Greta Gerwig\",\n",
" \"rating\": 8.3,\n",
" \"title\": \"Little Women\",\n",
" },\n",
" ),\n",
" Document(\n",
" page_content=\"Toys come alive and have a blast doing so\",\n",
" metadata={\n",
" \"year\": 1995,\n",
" \"genre\": \"animated\",\n",
" \"director\": \"John Lasseter\",\n",
" \"rating\": 8.3,\n",
" \"title\": \"Toy Story\",\n",
" },\n",
" ),\n",
" Document(\n",
" page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
" metadata={\n",
" \"year\": 1979,\n",
" \"rating\": 9.9,\n",
" \"director\": \"Andrei Tarkovsky\",\n",
" \"genre\": \"science fiction\",\n",
" \"rating\": 9.9,\n",
" \"title\": \"Stalker\",\n",
" },\n",
" ),\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to Elasticsearch\n",
"\n",
"ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n",
"\n",
"We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.\n",
"\n",
"\n",
"We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily. We would also send list of documents that we created in the previous step."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
"\n",
"# https://platform.openai.com/api-keys\n",
"OPENAI_API_KEY = getpass(\"OpenAI API key: \")\n",
"\n",
"embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
"\n",
"\n",
"vectorstore = ElasticsearchStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
" index_name=\"elasticsearch-self-query-demo\",\n",
" es_cloud_id=ELASTIC_CLOUD_ID,\n",
" es_api_key=ELASTIC_API_KEY,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup query retriever\n",
"\n",
"Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. \n",
"\n",
"We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add details about metadata fields\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"genre\",\n",
" description=\"The genre of the movie. Can be either 'science fiction' or 'animated'.\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"year\",\n",
" description=\"The year the movie was released\",\n",
" type=\"integer\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"director\",\n",
" description=\"The name of the movie director\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
" ),\n",
"]\n",
"\n",
"document_content_description = \"Brief summary of a movie\"\n",
"\n",
"# Set up openAI llm with sampling temperature 0\n",
"llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n",
"\n",
"# instantiate retriever\n",
"retriever = SelfQueryRetriever.from_llm(\n",
" llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test retriever with simple query\n",
"\n",
"We will test the retriever with a simple query: `What are some movies about dream`. \n",
"\n",
"The output shows all the relevant documents to the query."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6}),\n",
" Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n",
" Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3})]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example only specifies a relevant query\n",
"retriever.get_relevant_documents(\"What are some movies about dream\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test retriever with simple query and filter\n",
"\n",
"We will now test the retriever with a query: `Has Andrei Tarkovsky directed any science fiction movies`. \n",
"\n",
"This query has a filter on the metadata `genre` and `director`. \n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\n",
" \"Has Andrei Tarkovsky directed any science fiction movies\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiate retriever to filter k documents\n",
"\n",
"We will now instantiate retriever again to fetch k number of documents. We can do this my setting `enable_limit=True` when instantiating the retriever. \n",
"\n",
"We will then test retriever to filter k documents."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"retriever = SelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
" enable_limit=True,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the retriever to filter k documents\n",
"\n",
"We will now test the retriever with a query: `what are two movies about dream`. \n",
"\n",
"The output would show exactly `2` documents. "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\"what are two movies about dream\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test retriever for complex queries\n",
"\n",
"We will try some complex queries with filters and `1 limit`.\n",
"\n",
"\n",
"Query: `Show that one movie which was about dream and was released after the year 1992 but before 2007?`. \n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\n",
" \"Show that one movie which was about dream and was released after the year 1992 but before 2007?\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11.4 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.3"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}