# JSON load, Extraction and Ingest with ELSER Example
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb)

This workbook demonstrates how to load a JSON file,  create passages and ingest into Elasticsearch. 

In this example we will:
- load the JSON using jq
- chunk the text with LangChain document splitter
- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. 

We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages.

In [None]:
!pip install -qU langchain_community langchain "elasticsearch<9" tiktoken langchain-elasticsearch jq

## Connecting to Elasticsearch

In [2]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

client = Elasticsearch(
    # For local development
    # "http://localhost:9200",
    # basic_auth=("elastic", "changeme")
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

## Deploying ELSER

In [None]:
import time

model = ".elser_model_2"

try:
    client.ml.put_trained_model(model_id=model, input={"field_names": ["text_field"]})
except:
    pass

while True:
    status = client.ml.get_trained_models(model_id=model, include="definition_status")

    if status["trained_model_configs"][0]["fully_defined"]:
        print(model + " is downloaded and ready to be deployed.")
        break
    else:
        print(model + " is downloading or not ready to be deployed.")
    time.sleep(5)

client.ml.start_trained_model_deployment(
    model_id=model, number_of_allocations=1, wait_for="starting"
)

while True:
    status = client.ml.get_trained_models_stats(
        model_id=model,
    )
    if status["trained_model_stats"][0]["deployment_stats"]["state"] == "started":
        print(model + " has been successfully deployed.")
        break
    else:
        print(model + " is currently being deployed.")
    time.sleep(5)

## Loading a JSON file, creating chunks into docs
This will load the webpage from the url provided, and then chunk the html text into passage docs.

In [7]:
from langchain_community.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from urllib.request import urlopen
import json

# Change the URL to the desired dataset
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json"

response = urlopen(url)
data = json.load(response)

with open("temp.json", "w") as json_file:
    json.dump(data, json_file)


# Metadata function to extract metadata from the record
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["name"] = record.get("name")
    metadata["summary"] = record.get("summary")
    metadata["url"] = record.get("url")
    metadata["category"] = record.get("category")
    metadata["updated_at"] = record.get("updated_at")

    return metadata


# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)
docs = loader.load_and_split(text_splitter=text_splitter)

## Ingesting the passages into Elasticsearch
This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME.

In [None]:
from langchain_elasticsearch import ElasticsearchStore

INDEX_NAME = "json_chunked_index"

ElasticsearchStore.from_documents(
    docs,
    es_connection=client,
    index_name=INDEX_NAME,
    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),
    bulk_kwargs={
        "request_timeout": 180,
    },
)