demo-python/code/indexers/document-intelligence-custom-skill/document-intelligence-custom-skill.ipynb (298 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use semantic chunking in Document Intelligence with Azure AI Search\n",
"\n",
"This sample demonstrates how to use the [semantic chunking capabilities of Document Intelligence](https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-retrieval-augmented-generation) with Azure AI Search.\n",
"\n",
"Semantic chunking divides the text into chunks based on semantic understanding. Division boundaries are focused on sentence subject, maintaining semantic consistency within each chunk. It's useful for text summarization, sentiment analysis, and document classification tasks. It can significantly improve the experience of conversational chat and generative AI because the upstream inputs available to the model are of higher quality.\n",
"\n",
"The sample uses a custom skill to add semantic chunking to an indexing pipeline in Azure AI Search. The sample content might be familiar to you if you've run other Azure AI Search samples in the past. It's a collection of PDFs about employee healthcare, benefits, and employment guidance. In constrast with other samples that use fixed-size data chunking, this sample uses semantic chunking during indexing.\n",
"\n",
"This notebook takes you directly to queries that execute on the search engine so that you can evaluate the quality of the response based on the value added by semantic chunking. There's no generative AI or chat step to obscure the effect of semantic chunking. \n",
"\n",
"If you want to learn more about how the solution is built, review the source code in this folder. To try out semantic chunking with chat models and generative AI, you can adapt the [RAG tutorial](https://github.com/Azure-Samples/azure-search-python-samples/tree/main/Tutorial-RAG), replacing the fixed-length data chunking with the semantic chunking component of this sample.\n",
"\n",
"## Prerequisites\n",
"\n",
"+ Follow the instructions in the [readme](./readme.md) to deploy all Azure resources, including sample data and the search index.\n",
"\n",
"+ Check your search service to make sure the *document-intelligence-index* exists. If you don't see an index, revisit the readme and run the `setup_search_service` script.\n",
"\n",
"+ Don't add an `.env` file to this folder. Environment variables are read from the `azd` deployment. You can type `azd env get-values` to review the variables.\n",
"\n",
"## Grant permissions\n",
"\n",
"The deployment script creates an Azure AI Search resource that's configured for Azure role-based access. By default, you inherit the **Owner** role assignment of your Azure subscription. This section explains which additional permissions are required to run the queries.\n",
"\n",
"### Assign roles on Azure AI Search\n",
"\n",
"If you are an **Owner**, you have permission to create and manage objects on your search service, but you don't have read-access to indexes.\n",
"\n",
"You can use the Azure portal and the **Access Control (IAM)** page to give yourself **Search Index Data Reader** permissions. Or, [use the Azure CLI to assign roles](https://learn.microsoft.com/azure/role-based-access-control/role-assignments-cli) for a command line approach. Currently, you can't use `azd` to assign roles.\n",
"\n",
"1. Modify the following command to use valid values:\n",
"\n",
" ```azurecli\n",
" az role assignment create --assignee \"55555555-5555-5555-5555-555555555555\" --role \"Search Index Data Reader\" --scope \"/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-YOUR-ENVIRONMENT-NAME/providers/Microsoft.Search/searchServices/srch-RANDOM-STRING\"\n",
" ```\n",
"\n",
" - The assignee is your personal identity token or your account name such as such as *patlong@contoso.com*.\n",
" - The subscription ID can be obtained from `az account show`.\n",
" - If you used the script to deploy resources, check the Azure portal or the status message of the deployment script for the resource group and search service name.\n",
"\n",
"1. Paste the modified into the command line and press **Enter**.\n",
"\n",
"If successful, the command returns information about the role assignment.\n",
"\n",
"Later, if you run queries and get an \"HttpResponseError: Operation returned an invalid status 'Forbidden'\" error, revisit this step to make sure you completed this step.\n",
"\n",
"### Assign roles on Azure OpenAI\n",
"\n",
"The queries in this notebook are provided in plain text (\"What's a performance review?\"). At query time, the search engine calls a vectorizer to generate an embedding of the query string, which is then passed back to the search engine for vector search over the vector fields in the index.\n",
"\n",
"To give the search service access to the embedding model, assign to it the **Cognitive Services OpenAI User** permissions on Azure OpenAI.\n",
"\n",
"1. Modify the following command to use valid values:\n",
"\n",
" ```azurecli\n",
" az role assignment create --assignee \"55555555-5555-5555-5555-555555555555\" --role \"Cognitive Services OpenAI User\" --scope \"/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-YOUR-ENVIRONMENT-NAME/providers/Microsoft.CognitiveServices/accounts/cog-RANDOM-STRING\"\n",
" ```\n",
"\n",
" - The assignee is the system-assigned managed identity of your Azure AI Search service.\n",
" - The subscription ID can be obtained from `az account show`.\n",
" - If you used the script to deploy resources, check the Azure portal or the status message of the deployment script for the resource group and Azure OpenAI account name.\n",
"\n",
"1. Paste the modified into the command line and press **Enter**.\n",
"\n",
"Later, if you run queries and get \"HttpResponseError: () Could not complete vectorization action. The vectorization endpoint returned status code '401' (Unauthorized)\", followed by \"Message: Could not complete vectorization action. The vectorization endpoint returned status code '401' (Unauthorized).\", then revisit this step to give yourself permissions.\n",
"\n",
"## Get started\n",
"\n",
"Install the packages and load the libraries that are necessary for running the queries in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install azure-search-documents==11.6.0b4 --quiet\n",
"! pip install python-dotenv azure-identity --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load all environment variables from the azd deployment\n",
"import subprocess\n",
"from io import StringIO\n",
"from dotenv import load_dotenv\n",
"result = subprocess.run([\"azd\", \"env\", \"get-values\"], stdout=subprocess.PIPE)\n",
"load_dotenv(stream=StringIO(result.stdout.decode(\"utf-8\")))"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"search_url = f\"https://{os.environ['AZURE_SEARCH_SERVICE']}.search.windows.net\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a vector similarity search\n",
"\n",
"This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from azure.search.documents import SearchClient\n",
"from azure.search.documents.models import VectorizableTextQuery\n",
"from azure.identity import DefaultAzureCredential\n",
"# Pure Vector Search\n",
"query = \"What's a performance review?\" \n",
" \n",
"search_client = SearchClient(search_url, \"document-intelligence-index\", credential=DefaultAzureCredential())\n",
"vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields=\"vector\", exhaustive=True)\n",
"# Use the below query to pass in the raw vector query instead of the query vectorization\n",
"# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields=\"vector\")\n",
" \n",
"results = search_client.search( \n",
" search_text=None, \n",
" vector_queries= [vector_query],\n",
" select=[\"parent_id\", \"chunk_id\", \"chunk_headers\", \"chunk\"],\n",
" top=1\n",
") \n",
" \n",
"for result in results:\n",
" print(f\"parent_id: {result['parent_id']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"Chunk Headers: {result['chunk_headers']}\")\n",
" print(f\"Content: {result['chunk']}\") \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a hybrid search\n",
"\n",
"Search using text and vectors combined for more relevant results"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Hybrid Search\n",
"query = \"What's the difference between the health plans?\" \n",
" \n",
"vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields=\"vector\", exhaustive=True)\n",
" \n",
"results = search_client.search( \n",
" search_text=query, \n",
" vector_queries= [vector_query],\n",
" select=[\"parent_id\", \"chunk_id\", \"chunk_headers\", \"chunk\"],\n",
" top=1\n",
") \n",
" \n",
"for result in results: \n",
" print(f\"parent_id: {result['parent_id']}\") \n",
" print(f\"chunk_id: {result['chunk_id']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"Chunk Headers: {result['chunk_headers']}\")\n",
" print(f\"Content: {result['chunk']}\") \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use chunk headers to improve search results\n",
"\n",
"Semantic chunking retrieves section headers if they are available. Use them to improve your search results\n",
"\n",
"Note that semantic chunking from document intelligence automatically converts tables to Markdown form"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Hybrid Search\n",
"query = \"How does the cost between health plans compare?\" \n",
" \n",
"vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields=\"vector\", exhaustive=True)\n",
" \n",
"results = search_client.search( \n",
" search_text=query, \n",
" vector_queries= [vector_query],\n",
" select=[\"parent_id\", \"chunk_id\", \"chunk_headers\", \"chunk\"],\n",
" search_fields=[\"chunk\", \"chunk_headers\"],\n",
" top=1\n",
") \n",
" \n",
"for result in results: \n",
" print(f\"parent_id: {result['parent_id']}\") \n",
" print(f\"chunk_id: {result['chunk_id']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"Chunk Headers: {result['chunk_headers']}\")\n",
" print(f\"Content: {result['chunk']}\") \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perform a hybrid search + Semantic reranking"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType\n",
"\n",
"# Semantic Hybrid Search\n",
"query = \"What's a performance review?\"\n",
"\n",
"vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields=\"vector\", exhaustive=True)\n",
"\n",
"results = search_client.search( \n",
" search_text=query,\n",
" vector_queries=[vector_query],\n",
" select=[\"parent_id\", \"chunk_id\", \"chunk\"],\n",
" query_type=QueryType.SEMANTIC, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,\n",
" top=2\n",
")\n",
"\n",
"semantic_answers = results.get_answers()\n",
"for answer in semantic_answers:\n",
" if answer.highlights:\n",
" print(f\"Semantic Answer: {answer.highlights}\")\n",
" else:\n",
" print(f\"Semantic Answer: {answer.text}\")\n",
" print(f\"Semantic Answer Score: {answer.score}\\n\")\n",
"\n",
"for result in results:\n",
" print(f\"parent_id: {result['parent_id']}\") \n",
" print(f\"chunk_id: {result['chunk_id']}\") \n",
" print(f\"Score: {result['@search.score']}\") \n",
" print(f\"Content: {result['chunk']}\") \n",
"\n",
" captions = result[\"@search.captions\"]\n",
" if captions:\n",
" caption = captions[0]\n",
" if caption.highlights:\n",
" print(f\"Caption: {caption.highlights}\\n\")\n",
" else:\n",
" print(f\"Caption: {caption.text}\\n\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}