notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PDF Extraction and Ingest with ELSER Example\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb)\n", "\n", "This workbook demonstrates how to extract the contents of a single PDF, create passages and ingest into Elasticsearch. \n", "\n", "In this example we will:\n", "- load the PDF using pypdf\n", "- chunk the text with LangChain document splitter\n", "- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n", "\n", "We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zQlYpYkI46Ff", "outputId": "83677846-8a6a-4b49-fde0-16d473778814" }, "outputs": [], "source": [ "!pip install -qU pypdf langchain_community langchain \"elasticsearch<9\" tiktoken langchain-elasticsearch" ] }, { "cell_type": "markdown", "metadata": { "id": "GCZR7-zK810e" }, "source": [ "## Connecting to Elasticsearch" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "DofNZ2w25nIr" }, "outputs": [], "source": [ "from elasticsearch import Elasticsearch\n", "from getpass import getpass\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "client = Elasticsearch(\n", " # For local development\n", " # \"http://localhost:9200\",\n", " # basic_auth=(\"elastic\", \"changeme\")\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "zv6hKYWr8-Mg" }, "source": [ "## Deploying ELSER" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1U4ffD2K9BkJ" }, "outputs": [], "source": [ "import time\n", "\n", "model = \".elser_model_2\"\n", "\n", "try:\n", " client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n", "except:\n", " pass\n", "\n", "while True:\n", " status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n", "\n", " if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n", " print(model + \" is downloaded and ready to be deployed.\")\n", " break\n", " else:\n", " print(model + \" is downloading or not ready to be deployed.\")\n", " time.sleep(5)\n", "\n", "client.ml.start_trained_model_deployment(\n", " model_id=model, number_of_allocations=1, wait_for=\"starting\"\n", ")\n", "\n", "while True:\n", " status = client.ml.get_trained_models_stats(\n", " model_id=model,\n", " )\n", " if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n", " print(model + \" has been successfully deployed.\")\n", " break\n", " else:\n", " print(model + \" is currently being deployed.\")\n", " time.sleep(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "wqYXqJxn9JsA" }, "source": [ "## Importing PDF chunks into Index\n", "This will load the PDF from the url provided, and then chunk the text into passage docs." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "7bN32vunqIk2" }, "outputs": [], "source": [ "from langchain_community.document_loaders import PyPDFLoader\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", "# Change to any PDF of your choice\n", "loader = PyPDFLoader(\"https://arxiv.org/pdf/2103.15348.pdf\")\n", "\n", "data = loader.load()\n", "\n", "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n", " chunk_size=512, chunk_overlap=256\n", ")\n", "docs = loader.load_and_split(text_splitter=text_splitter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingesting the passages into Elasticsearch\n", "This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0xtdeIJI9N9-" }, "outputs": [], "source": [ "from langchain_elasticsearch import ElasticsearchStore\n", "\n", "INDEX_NAME = \"pdf_chunked_index\"\n", "\n", "ElasticsearchStore.from_documents(\n", " docs,\n", " es_connection=client,\n", " index_name=INDEX_NAME,\n", " strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n", " bulk_kwargs={\n", " \"request_timeout\": 60,\n", " },\n", ")" ] } ], "metadata": { "colab": { "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/ingestion-and-chunking/pdf-chunking-ingest.ipynb (210 lines of code) (raw):