notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb (425 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BM25 and Self-querying retriever with elasticsearch and LangChain\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/chatbot-with-bm25-only-example.ipynb)\n", "\n", "This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and we use this for a BM25 example. \n", "\n", "In this example:\n", "- we are going to ingest a sample dataset of movies outside of LangChain\n", "- Customise the retrieval strategy in ElasticsearchStore to use just BM25\n", "- use the self-query retrieval to transform question into a structured query\n", "- Use the documents and RAG strategy to answer the question " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install packages\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "!python3 -m pip install -qU lark elasticsearch langchain langchain-elasticsearch openai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Dataset" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "docs = [\n", " {\n", " \"text\": \"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", " \"metadata\": {\n", " \"year\": 1993,\n", " \"rating\": 7.7,\n", " \"genre\": \"science fiction\",\n", " \"director\": \"Steven Spielberg\",\n", " \"title\": \"Jurassic Park\",\n", " },\n", " },\n", " {\n", " \"text\": \"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", " \"metadata\": {\n", " \"year\": 2010,\n", " \"director\": \"Christopher Nolan\",\n", " \"rating\": 8.2,\n", " \"title\": \"Inception\",\n", " },\n", " },\n", " {\n", " \"text\": \"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", " \"metadata\": {\n", " \"year\": 2006,\n", " \"director\": \"Satoshi Kon\",\n", " \"rating\": 8.6,\n", " \"title\": \"Paprika\",\n", " },\n", " },\n", " {\n", " \"text\": \"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", " \"metadata\": {\n", " \"year\": 2019,\n", " \"director\": \"Greta Gerwig\",\n", " \"rating\": 8.3,\n", " \"title\": \"Little Women\",\n", " },\n", " },\n", " {\n", " \"text\": \"Toys come alive and have a blast doing so\",\n", " \"metadata\": {\n", " \"year\": 1995,\n", " \"genre\": \"animated\",\n", " \"director\": \"John Lasseter\",\n", " \"rating\": 8.3,\n", " \"title\": \"Toy Story\",\n", " },\n", " },\n", " {\n", " \"text\": \"Three men walk into the Zone, three men walk out of the Zone\",\n", " \"metadata\": {\n", " \"year\": 1979,\n", " \"rating\": 9.9,\n", " \"director\": \"Andrei Tarkovsky\",\n", " \"genre\": \"science fiction\",\n", " \"rating\": 9.9,\n", " \"title\": \"Stalker\",\n", " },\n", " },\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to Elasticsearch\n", "\n", "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n", "\n", "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.\n", "\n", "\n", "We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily. We would also send list of documents that we created in the previous step." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "from elasticsearch import Elasticsearch\n", "from getpass import getpass\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "# https://platform.openai.com/api-keys\n", "OPENAI_API_KEY = getpass(\"OpenAI API key: \")\n", "\n", "client = Elasticsearch(\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing data into Elasticsearch\n", "\n", "We have chosen to index the data outside of Langchain to demonstrate how its possible to use Langchain for RAG and use the self-query retrieveral on any Elasticsearch index." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "from elasticsearch import helpers\n", "\n", "# create the index\n", "client.indices.create(index=\"movies_self_query\")\n", "\n", "operations = [\n", " {\n", " \"_index\": \"movies_self_query\",\n", " \"_id\": i,\n", " \"text\": doc[\"text\"],\n", " \"metadata\": doc[\"metadata\"],\n", " }\n", " for i, doc in enumerate(docs)\n", "]\n", "\n", "# Add the documents to the index directly\n", "response = helpers.bulk(\n", " client,\n", " operations,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup query retriever\n", "\n", "Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. \n", "\n", "We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "from typing import List, Union\n", "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", "from langchain.chains.query_constructor.base import AttributeInfo\n", "from langchain.llms import OpenAI\n", "from langchain_elasticsearch import ApproxRetrievalStrategy, ElasticsearchStore\n", "\n", "# Add details about metadata fields\n", "metadata_field_info = [\n", " AttributeInfo(\n", " name=\"genre\",\n", " description=\"The genre of the movie. Can be either 'science fiction' or 'animated'.\",\n", " type=\"string or list[string]\",\n", " ),\n", " AttributeInfo(\n", " name=\"year\",\n", " description=\"The year the movie was released\",\n", " type=\"integer\",\n", " ),\n", " AttributeInfo(\n", " name=\"director\",\n", " description=\"The name of the movie director\",\n", " type=\"string\",\n", " ),\n", " AttributeInfo(\n", " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", " ),\n", "]\n", "\n", "document_content_description = \"Brief summary of a movie\"\n", "\n", "# Set up openAI llm with sampling temperature 0\n", "llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n", "\n", "\n", "class BM25RetrievalStrategy(ApproxRetrievalStrategy):\n", "\n", " def __init__(self):\n", " pass\n", "\n", " def query(\n", " self,\n", " query: Union[str, None],\n", " filter: List[dict],\n", " **kwargs,\n", " ):\n", "\n", " if query:\n", " query_clause = [\n", " {\n", " \"multi_match\": {\n", " \"query\": query,\n", " \"fields\": [\"text\"],\n", " \"fuzziness\": \"AUTO\",\n", " }\n", " }\n", " ]\n", " else:\n", " query_clause = []\n", "\n", " bm25_query = {\n", " \"query\": {\"bool\": {\"filter\": filter, \"must\": query_clause}},\n", " }\n", "\n", " print(\"query\", bm25_query)\n", "\n", " return bm25_query\n", "\n", "\n", "vectorstore = ElasticsearchStore(\n", " index_name=\"movies_self_query\",\n", " es_connection=client,\n", " strategy=BM25RetrievalStrategy(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BM25 Only Retriever \n", "One option is to customise the query to use BM25 only retrieval method. We can do this by overriding the `custom_query` function, specifying the query to use only `multi_match`.\n", "\n", "In the example below, the self-query retriever is using the LLM to transform the question into a keyword and filter query (query: dreams, filter: year range). The custom query is then used to perform a BM25 based query on the keyword query and filter query.\n", "\n", "This means that you dont have to vectorise all the documents if you want to perform a question / answerinf use-case on an existing Elasticsearch index. " ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "query {'query': {'bool': {'filter': [{'bool': {'must': [{'match': {'metadata.genre': {'query': 'science fiction'}}}, {'range': {'metadata.year': {'gt': 1992}}}, {'range': {'metadata.year': {'lt': 2007}}}]}}], 'must': [{'multi_match': {'query': 'dinosaur', 'fields': ['text'], 'fuzziness': 'AUTO'}}]}}}\n", "docs: [Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction', 'director': 'Steven Spielberg', 'title': 'Jurassic Park'})]\n" ] }, { "data": { "text/plain": [ "'Steven Spielberg directed Jurassic Park in 1993.'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain.schema.runnable import RunnableParallel, RunnablePassthrough\n", "from langchain.prompts import ChatPromptTemplate, PromptTemplate\n", "from langchain.schema import format_document\n", "\n", "retriever = SelfQueryRetriever.from_llm(\n", " llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n", ")\n", "\n", "LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(\n", " \"\"\"\n", "Use the following context movies that matched the user question. Use the movies below only to answer the user's question.\n", "\n", "If you don't know the answer, just say that you don't know, don't try to make up an answer.\n", "\n", "----\n", "{context}\n", "----\n", "Question: {question}\n", "Answer:\n", "\"\"\"\n", ")\n", "\n", "DOCUMENT_PROMPT = PromptTemplate.from_template(\n", " \"\"\"\n", "---\n", "title: {title} \n", "year: {year} \n", "director: {director} \n", "---\n", "\"\"\"\n", ")\n", "\n", "\n", "def _combine_documents(\n", " docs, document_prompt=DOCUMENT_PROMPT, document_separator=\"\\n\\n\"\n", "):\n", " print(\"docs:\", docs)\n", " doc_strings = [format_document(doc, document_prompt) for doc in docs]\n", " return document_separator.join(doc_strings)\n", "\n", "\n", "_context = RunnableParallel(\n", " context=retriever | _combine_documents,\n", " question=RunnablePassthrough(),\n", ")\n", "\n", "chain = _context | LLM_CONTEXT_PROMPT | llm\n", "\n", "chain.invoke(\n", " \"Which director directed movies about dinosaurs that was released after the year 1992 but before 2007?\"\n", ")" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ObjectApiResponse({'acknowledged': True})" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.indices.delete(index=\"movies_self_query\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.4 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 2 }