demo-python/code/data-chunking/textsplit-data-chunking-example.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Azure AI Search text splitter data chunking example\n", "\n", "This notebook uses the Text Split skill in Azure AI Search to chunk text. This approach takes a dependency on indexers and skillsets.\n", "\n", "The notebook complements the [Chunking large documents for vector search solutions](https://learn.microsoft.com/azure/search/vector-search-how-to-chunk-documents) article in the Azure AI Search documentation.\n", "\n", "## Prerequisites\n", "\n", "* An Azure subscription.\n", "* An Azure Storage account with blob storage.\n", "* Azure AI Search, Standard (S1) or higher to accommodate the large PDF.\n", "* Azure OpenAI with a text embedding model, either text-embedding-ada-002 or a text-embedding-3 model.\n", "\n", "You can use keys for authenticated connections, or you can set up managed identities and role assignments for service-to-service communication. For role-based access, an Azure AI Search service needs **Cognitive Services OpenAI User** on Azure OpenAI and **Storage Blob Data Reader** on Azure Storage.\n", "\n", "### Set up a Python virtual environment in Visual Studio Code\n", "\n", "1. Open the Command Palette (Ctrl+Shift+P).\n", "1. Search for **Python: Create Environment**.\n", "1. Select **Venv**.\n", "1. Select a Python interpreter. Choose 3.10 or later.\n", "\n", "It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install packages" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install --quiet -r requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load .env file (Copy .env-sample to .env and update accordingly)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from azure.identity import DefaultAzureCredential\n", "from azure.core.credentials import AzureKeyCredential\n", "import os\n", "\n", "load_dotenv() # take environment variables from .env.\n", "\n", "search_endpoint = os.environ[\"AZURE_SEARCH_SERVICE_ENDPOINT\"]\n", "search_index = os.environ[\"AZURE_SEARCH_INDEX\"]\n", "search_datasource = os.environ[\"AZURE_SEARCH_DATASOURCE\"]\n", "search_skillset = os.environ[\"AZURE_SEARCH_SKILLSET\"]\n", "search_indexer = os.environ[\"AZURE_SEARCH_INDEXER\"]\n", "azure_openai_endpoint = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\n", "azure_openai_embedding_deployment_id = os.environ[\"AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID\"]\n", "blob_container = os.environ[\"AZURE_BLOB_CONTAINER\"]\n", "blob_connection_string = os.environ[\"AZURE_BLOB_CONNECTION_STRING\"]\n", "blob_account_url = os.environ[\"AZURE_BLOB_ACCOUNT_URL\"]\n", "\n", "search_credential = AzureKeyCredential(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) if len(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) > 0 else DefaultAzureCredential()\n", "azure_openai_key = os.environ[\"AZURE_OPENAI_KEY\"] if len(os.environ[\"AZURE_OPENAI_KEY\"]) > 0 else None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload sample PDF for chunking\n", "\n", "This step creates a blob container on Azure Storage and loads a single large PDF into the container." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from azure.storage.blob import BlobServiceClient\n", "\n", "def open_blob_client():\n", " # Set max_block_size and max_single_put_size due to large PDF transfers\n", " # See https://learn.microsoft.com/azure/storage/blobs/storage-blobs-tune-upload-download-python\n", " if not blob_connection_string.startswith(\"ResourceId\"):\n", " return BlobServiceClient.from_connection_string(\n", " blob_connection_string,\n", " max_block_size=1024*1024*8, # 8 MiB\n", " max_single_put_size=1024*1024*8 # 8 MiB\n", " )\n", " return BlobServiceClient(\n", " account_url=blob_account_url,\n", " credential=DefaultAzureCredential(),\n", " max_block_size=1024*1024*8, # 8 MiB\n", " max_single_put_size=1024*1024*8 # 8 MiB\n", " )\n", "\n", "blob_client = open_blob_client()\n", "container_client = blob_client.get_container_client(blob_container)\n", "if not container_client.exists():\n", " container_client.create_container()\n", "\n", "file_path = os.path.join(\"..\", \"..\", \"..\", \"data\", \"nasa-ebooks\", \"earth_at_night_508.pdf\")\n", "blob_name = os.path.basename(file_path)\n", "blob_client = container_client.get_blob_client(blob_name)\n", "if not blob_client.exists():\n", " with open(file_path, \"rb\") as f:\n", " blob_client.upload_blob(data=f, overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the index and run the indexer\n", "\n", "This step creates an index, data source, skillset, and indexer on Azure AI Search. The skillset includes a text split skill that chunks the data and calls the embedding model on Azure OpenAI. Vector and non-vector chunks are indexed in Azure AI Search." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running indexer\n" ] } ], "source": [ "from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient\n", "from lib.common import (\n", " create_search_index,\n", " create_search_datasource,\n", " create_search_skillset,\n", " create_search_indexer\n", ")\n", "\n", "search_index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential)\n", "index = create_search_index(\n", " search_index,\n", " azure_openai_endpoint,\n", " azure_openai_embedding_deployment_id,\n", " azure_openai_key\n", ")\n", "search_index_client.create_or_update_index(index)\n", "\n", "search_indexer_client = SearchIndexerClient(endpoint=search_endpoint, credential=search_credential)\n", "\n", "data_source = create_search_datasource(\n", " search_datasource,\n", " blob_connection_string,\n", " blob_container\n", ")\n", "search_indexer_client.create_or_update_data_source_connection(data_source)\n", "\n", "skillset = create_search_skillset(\n", " search_skillset,\n", " search_index,\n", " azure_openai_endpoint,\n", " azure_openai_embedding_deployment_id,\n", " azure_openai_key,\n", " text_split_mode='pages',\n", " maximum_page_length=2000,\n", " page_overlap_length=500\n", ")\n", "search_indexer_client.create_or_update_skillset(skillset)\n", "\n", "indexer = create_search_indexer(\n", " indexer_name=search_indexer,\n", " index_name=search_index,\n", " datasource_name=search_datasource,\n", " skillset_name=search_skillset\n", ")\n", "search_indexer_client.create_or_update_indexer(indexer)\n", "search_indexer_client.run_indexer(search_indexer)\n", "\n", "print(\"Running indexer\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show chunk character length and token length histogram\n", "\n", "If you get an error, check the Azure portal to make sure the index exists and the indexer ran successfully. Also, if you use role-based access, make sure you have individual permissions to read an index. For more information, see [Quickstart: Connect without keys](https://learn.microsoft.com/azure/search/search-get-started-rbac)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from lib.common import (\n", " get_chunks,\n", " get_token_length,\n", " plot_chunk_histogram\n", ")\n", "\n", "search_client = search_index_client.get_search_client(search_index)\n", "chunks = get_chunks(search_client)\n", "\n", "plot_chunk_histogram(chunks, length_fn=len, title=\"Distribution of Chunk Character Length\", xlabel=\"Chunk Character Length\")\n", "plot_chunk_histogram(chunks, length_fn=get_token_length, title=\"Distribution of Chunk Token Length\", xlabel=\"Chunk Token Length\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }

demo-python/code/data-chunking/textsplit-data-chunking-example.ipynb (261 lines of code) (raw):