notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# JSON load, Extraction and Ingest with ELSER Example\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb)\n", "\n", "This workbook demonstrates how to load a JSON file, create passages and ingest into Elasticsearch. \n", "\n", "In this example we will:\n", "- load the JSON using jq\n", "- chunk the text with LangChain document splitter\n", "- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n", "\n", "We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zQlYpYkI46Ff", "outputId": "83677846-8a6a-4b49-fde0-16d473778814" }, "outputs": [], "source": [ "!pip install -qU langchain_community langchain \"elasticsearch<9\" tiktoken langchain-elasticsearch jq" ] }, { "cell_type": "markdown", "metadata": { "id": "GCZR7-zK810e" }, "source": [ "## Connecting to Elasticsearch" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "DofNZ2w25nIr" }, "outputs": [], "source": [ "from elasticsearch import Elasticsearch\n", "from getpass import getpass\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "client = Elasticsearch(\n", " # For local development\n", " # \"http://localhost:9200\",\n", " # basic_auth=(\"elastic\", \"changeme\")\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "zv6hKYWr8-Mg" }, "source": [ "## Deploying ELSER" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1U4ffD2K9BkJ" }, "outputs": [], "source": [ "import time\n", "\n", "model = \".elser_model_2\"\n", "\n", "try:\n", " client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n", "except:\n", " pass\n", "\n", "while True:\n", " status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n", "\n", " if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n", " print(model + \" is downloaded and ready to be deployed.\")\n", " break\n", " else:\n", " print(model + \" is downloading or not ready to be deployed.\")\n", " time.sleep(5)\n", "\n", "client.ml.start_trained_model_deployment(\n", " model_id=model, number_of_allocations=1, wait_for=\"starting\"\n", ")\n", "\n", "while True:\n", " status = client.ml.get_trained_models_stats(\n", " model_id=model,\n", " )\n", " if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n", " print(model + \" has been successfully deployed.\")\n", " break\n", " else:\n", " print(model + \" is currently being deployed.\")\n", " time.sleep(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "wqYXqJxn9JsA" }, "source": [ "## Loading a JSON file, creating chunks into docs\n", "This will load the webpage from the url provided, and then chunk the html text into passage docs." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "7bN32vunqIk2" }, "outputs": [], "source": [ "from langchain_community.document_loaders import JSONLoader\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "from urllib.request import urlopen\n", "import json\n", "\n", "# Change the URL to the desired dataset\n", "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json\"\n", "\n", "response = urlopen(url)\n", "data = json.load(response)\n", "\n", "with open(\"temp.json\", \"w\") as json_file:\n", " json.dump(data, json_file)\n", "\n", "\n", "# Metadata function to extract metadata from the record\n", "def metadata_func(record: dict, metadata: dict) -> dict:\n", " metadata[\"name\"] = record.get(\"name\")\n", " metadata[\"summary\"] = record.get(\"summary\")\n", " metadata[\"url\"] = record.get(\"url\")\n", " metadata[\"category\"] = record.get(\"category\")\n", " metadata[\"updated_at\"] = record.get(\"updated_at\")\n", "\n", " return metadata\n", "\n", "\n", "# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/\n", "# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders\n", "loader = JSONLoader(\n", " file_path=\"temp.json\",\n", " jq_schema=\".[]\",\n", " content_key=\"content\",\n", " metadata_func=metadata_func,\n", ")\n", "\n", "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n", " chunk_size=512, chunk_overlap=256\n", ")\n", "docs = loader.load_and_split(text_splitter=text_splitter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingesting the passages into Elasticsearch\n", "This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0xtdeIJI9N9-" }, "outputs": [], "source": [ "from langchain_elasticsearch import ElasticsearchStore\n", "\n", "INDEX_NAME = \"json_chunked_index\"\n", "\n", "ElasticsearchStore.from_documents(\n", " docs,\n", " es_connection=client,\n", " index_name=INDEX_NAME,\n", " strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n", " bulk_kwargs={\n", " \"request_timeout\": 180,\n", " },\n", ")" ] } ], "metadata": { "colab": { "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb (237 lines of code) (raw):