notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb (210 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PDF Extraction and Ingest with ELSER Example\n",
"[](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb)\n",
"\n",
"This workbook demonstrates how to extract the contents of a single PDF, create passages and ingest into Elasticsearch. \n",
"\n",
"In this example we will:\n",
"- load the PDF using pypdf\n",
"- chunk the text with LangChain document splitter\n",
"- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n",
"\n",
"We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zQlYpYkI46Ff",
"outputId": "83677846-8a6a-4b49-fde0-16d473778814"
},
"outputs": [],
"source": [
"!pip install -qU pypdf langchain_community langchain \"elasticsearch<9\" tiktoken langchain-elasticsearch"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GCZR7-zK810e"
},
"source": [
"## Connecting to Elasticsearch"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "DofNZ2w25nIr"
},
"outputs": [],
"source": [
"from elasticsearch import Elasticsearch\n",
"from getpass import getpass\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
"\n",
"client = Elasticsearch(\n",
" # For local development\n",
" # \"http://localhost:9200\",\n",
" # basic_auth=(\"elastic\", \"changeme\")\n",
" cloud_id=ELASTIC_CLOUD_ID,\n",
" api_key=ELASTIC_API_KEY,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zv6hKYWr8-Mg"
},
"source": [
"## Deploying ELSER"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1U4ffD2K9BkJ"
},
"outputs": [],
"source": [
"import time\n",
"\n",
"model = \".elser_model_2\"\n",
"\n",
"try:\n",
" client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n",
"except:\n",
" pass\n",
"\n",
"while True:\n",
" status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n",
"\n",
" if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n",
" print(model + \" is downloaded and ready to be deployed.\")\n",
" break\n",
" else:\n",
" print(model + \" is downloading or not ready to be deployed.\")\n",
" time.sleep(5)\n",
"\n",
"client.ml.start_trained_model_deployment(\n",
" model_id=model, number_of_allocations=1, wait_for=\"starting\"\n",
")\n",
"\n",
"while True:\n",
" status = client.ml.get_trained_models_stats(\n",
" model_id=model,\n",
" )\n",
" if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n",
" print(model + \" has been successfully deployed.\")\n",
" break\n",
" else:\n",
" print(model + \" is currently being deployed.\")\n",
" time.sleep(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wqYXqJxn9JsA"
},
"source": [
"## Importing PDF chunks into Index\n",
"This will load the PDF from the url provided, and then chunk the text into passage docs."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "7bN32vunqIk2"
},
"outputs": [],
"source": [
"from langchain_community.document_loaders import PyPDFLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"# Change to any PDF of your choice\n",
"loader = PyPDFLoader(\"https://arxiv.org/pdf/2103.15348.pdf\")\n",
"\n",
"data = loader.load()\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
" chunk_size=512, chunk_overlap=256\n",
")\n",
"docs = loader.load_and_split(text_splitter=text_splitter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ingesting the passages into Elasticsearch\n",
"This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0xtdeIJI9N9-"
},
"outputs": [],
"source": [
"from langchain_elasticsearch import ElasticsearchStore\n",
"\n",
"INDEX_NAME = \"pdf_chunked_index\"\n",
"\n",
"ElasticsearchStore.from_documents(\n",
" docs,\n",
" es_connection=client,\n",
" index_name=INDEX_NAME,\n",
" strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n",
" bulk_kwargs={\n",
" \"request_timeout\": 60,\n",
" },\n",
")"
]
}
],
"metadata": {
"colab": {
"include_colab_link": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}