notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb (409 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Self-querying retriever with elasticsearch and langchain\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb)\n", "\n", "This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and apply structured query to a vectorstore. \n", "\n", "Before we begin, we first split the documents into chunks with `langchain` and then using [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents), we create a `vectorstore` and index data to elasticsearch.\n", "\n", "\n", "We will then see few examples query demonstrating full power of elasticsearch powered self-query retriever.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install packages and import modules\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "!python3 -m pip install -qU lark langchain langchain-elasticsearch openai\n", "\n", "from langchain.schema import Document\n", "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain_elasticsearch import ElasticsearchStore\n", "from langchain.llms import OpenAI\n", "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", "from langchain.chains.query_constructor.base import AttributeInfo\n", "from getpass import getpass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create documents \n", "Next, we will create list of documents with summary of movies using [langchain Schema Document](https://api.python.langchain.com/en/latest/schema/langchain.schema.document.Document.html), containing each document's `page_content` and `metadata` .\n", "\n" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "docs = [\n", " Document(\n", " page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", " metadata={\n", " \"year\": 1993,\n", " \"rating\": 7.7,\n", " \"genre\": \"science fiction\",\n", " \"director\": \"Steven Spielberg\",\n", " \"title\": \"Jurassic Park\",\n", " },\n", " ),\n", " Document(\n", " page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", " metadata={\n", " \"year\": 2010,\n", " \"director\": \"Christopher Nolan\",\n", " \"rating\": 8.2,\n", " \"title\": \"Inception\",\n", " },\n", " ),\n", " Document(\n", " page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", " metadata={\n", " \"year\": 2006,\n", " \"director\": \"Satoshi Kon\",\n", " \"rating\": 8.6,\n", " \"title\": \"Paprika\",\n", " },\n", " ),\n", " Document(\n", " page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", " metadata={\n", " \"year\": 2019,\n", " \"director\": \"Greta Gerwig\",\n", " \"rating\": 8.3,\n", " \"title\": \"Little Women\",\n", " },\n", " ),\n", " Document(\n", " page_content=\"Toys come alive and have a blast doing so\",\n", " metadata={\n", " \"year\": 1995,\n", " \"genre\": \"animated\",\n", " \"director\": \"John Lasseter\",\n", " \"rating\": 8.3,\n", " \"title\": \"Toy Story\",\n", " },\n", " ),\n", " Document(\n", " page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n", " metadata={\n", " \"year\": 1979,\n", " \"rating\": 9.9,\n", " \"director\": \"Andrei Tarkovsky\",\n", " \"genre\": \"science fiction\",\n", " \"rating\": 9.9,\n", " \"title\": \"Stalker\",\n", " },\n", " ),\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to Elasticsearch\n", "\n", "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n", "\n", "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.\n", "\n", "\n", "We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily. We would also send list of documents that we created in the previous step." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "# https://platform.openai.com/api-keys\n", "OPENAI_API_KEY = getpass(\"OpenAI API key: \")\n", "\n", "embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n", "\n", "\n", "vectorstore = ElasticsearchStore.from_documents(\n", " docs,\n", " embeddings,\n", " index_name=\"elasticsearch-self-query-demo\",\n", " es_cloud_id=ELASTIC_CLOUD_ID,\n", " es_api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup query retriever\n", "\n", "Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. \n", "\n", "We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add details about metadata fields\n", "metadata_field_info = [\n", " AttributeInfo(\n", " name=\"genre\",\n", " description=\"The genre of the movie. Can be either 'science fiction' or 'animated'.\",\n", " type=\"string or list[string]\",\n", " ),\n", " AttributeInfo(\n", " name=\"year\",\n", " description=\"The year the movie was released\",\n", " type=\"integer\",\n", " ),\n", " AttributeInfo(\n", " name=\"director\",\n", " description=\"The name of the movie director\",\n", " type=\"string\",\n", " ),\n", " AttributeInfo(\n", " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", " ),\n", "]\n", "\n", "document_content_description = \"Brief summary of a movie\"\n", "\n", "# Set up openAI llm with sampling temperature 0\n", "llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n", "\n", "# instantiate retriever\n", "retriever = SelfQueryRetriever.from_llm(\n", " llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test retriever with simple query\n", "\n", "We will test the retriever with a simple query: `What are some movies about dream`. \n", "\n", "The output shows all the relevant documents to the query." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n", " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6}),\n", " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n", " Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3})]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example only specifies a relevant query\n", "retriever.get_relevant_documents(\"What are some movies about dream\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test retriever with simple query and filter\n", "\n", "We will now test the retriever with a query: `Has Andrei Tarkovsky directed any science fiction movies`. \n", "\n", "This query has a filter on the metadata `genre` and `director`. \n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "retriever.get_relevant_documents(\n", " \"Has Andrei Tarkovsky directed any science fiction movies\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instantiate retriever to filter k documents\n", "\n", "We will now instantiate retriever again to fetch k number of documents. We can do this my setting `enable_limit=True` when instantiating the retriever. \n", "\n", "We will then test retriever to filter k documents." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "retriever = SelfQueryRetriever.from_llm(\n", " llm,\n", " vectorstore,\n", " document_content_description,\n", " metadata_field_info,\n", " enable_limit=True,\n", " verbose=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test the retriever to filter k documents\n", "\n", "We will now test the retriever with a query: `what are two movies about dream`. \n", "\n", "The output would show exactly `2` documents. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n", " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "retriever.get_relevant_documents(\"what are two movies about dream\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test retriever for complex queries\n", "\n", "We will try some complex queries with filters and `1 limit`.\n", "\n", "\n", "Query: `Show that one movie which was about dream and was released after the year 1992 but before 2007?`. \n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "retriever.get_relevant_documents(\n", " \"Show that one movie which was about dream and was released after the year 1992 but before 2007?\"\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.4 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 2 }