# PDF Extraction and Ingest with ELSER Example
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb)

This workbook demonstrates how to extract the contents of a single PDF, create passages and ingest into Elasticsearch. 

In this example we will:
- load the PDF using pypdf
- chunk the text with LangChain document splitter
- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. 

We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages.

In [None]:
!pip install -qU  pypdf langchain_community langchain "elasticsearch<9" tiktoken langchain-elasticsearch

## Connecting to Elasticsearch

In [16]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

client = Elasticsearch(
    # For local development
    # "http://localhost:9200",
    # basic_auth=("elastic", "changeme")
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

## Deploying ELSER

In [None]:
import time

model = ".elser_model_2"

try:
    client.ml.put_trained_model(model_id=model, input={"field_names": ["text_field"]})
except:
    pass

while True:
    status = client.ml.get_trained_models(model_id=model, include="definition_status")

    if status["trained_model_configs"][0]["fully_defined"]:
        print(model + " is downloaded and ready to be deployed.")
        break
    else:
        print(model + " is downloading or not ready to be deployed.")
    time.sleep(5)

client.ml.start_trained_model_deployment(
    model_id=model, number_of_allocations=1, wait_for="starting"
)

while True:
    status = client.ml.get_trained_models_stats(
        model_id=model,
    )
    if status["trained_model_stats"][0]["deployment_stats"]["state"] == "started":
        print(model + " has been successfully deployed.")
        break
    else:
        print(model + " is currently being deployed.")
    time.sleep(5)

## Importing PDF chunks into Index
This will load the PDF from the url provided, and then chunk the text into passage docs.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Change to any PDF of your choice
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf")

data = loader.load()

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)
docs = loader.load_and_split(text_splitter=text_splitter)

## Ingesting the passages into Elasticsearch
This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME.

In [None]:
from langchain_elasticsearch import ElasticsearchStore

INDEX_NAME = "pdf_chunked_index"

ElasticsearchStore.from_documents(
    docs,
    es_connection=client,
    index_name=INDEX_NAME,
    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),
    bulk_kwargs={
        "request_timeout": 60,
    },
)