demo-python/code/integrated-vectorization/azure-search-integrated-vectorization-sample.ipynb (957 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Azure AI Search integrated vectorization sample\n", "\n", "This Python notebook demonstrates the [integrated vectorization](https://learn.microsoft.com/azure/search/vector-search-integrated-vectorization) features of Azure AI Search that are currently in public preview. \n", "\n", "Integrated vectorization takes a dependency on indexers and skillsets, using the Text Split skill for data chunking, and the AzureOpenAIEmbedding skill and your Azure OpenAI resorce for embedding.\n", "\n", "This example uses PDFs from the `data/documents` folder for chunking, embedding, indexing, and queries.\n", "\n", "### Prerequisites\n", "\n", "+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).\n", " \n", "+ Azure AI Search, any tier, but we recommend Basic or higher for this workload. [Enable semantic ranker](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run a hybrid query with semantic ranking.\n", "\n", "+ A deployment of the `text-embedding-ada-002` model on Azure OpenAI.\n", "\n", "+ Azure Blob Storage. This notebook connects to your storage account and loads a container with the sample PDFs.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up a Python virtual environment in Visual Studio Code\n", "\n", "1. Open the Command Palette (Ctrl+Shift+P).\n", "1. Search for **Python: Create Environment**.\n", "1. Select **Venv**.\n", "1. Select a Python interpreter. Choose 3.10 or later.\n", "\n", "It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "! pip install -r azure-search-integrated-vectorization-sample-requirements.txt --quiet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load .env file (Copy .env-sample to .env and update accordingly)\n", "\n", "Optionally, you can test the following features of integrated vectorization using this notebook by setting the appropriate environment variables below:\n", "\n", "1. [OCR](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr) every page using the built-in OCR functionality. This allows you to add page numbers for every chunk that is extracted. It requires an [AI Services account](https://learn.microsoft.com/en-us/azure/search/cognitive-search-attach-cognitive-services)\n", " 1. Set `USE_OCR` to true and specify `AZURE_AI_SERVICES_KEY` if using key-based authentication, and specify `AZURE_AI_SERVICES_ENDPOINT`.\n", "1. Use the [Document Layout Skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout) to convert PDFs and other compatible documents to markdown. It requires an [AI Services account](https://learn.microsoft.com/en-us/azure/search/cognitive-search-attach-cognitive-services) and a search service in a [supported region](https://learn.microsoft.com/en-us/azure/search/cognitive-search-attach-cognitive-services)\n", " 1. Set `USE_LAYOUT` to true and specify `AZURE_AI_SERVICES_KEY` if using key-based authentication, and specify `AZURE_AI_SERVICES_ENDPOINT`.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from azure.identity import DefaultAzureCredential\n", "from azure.core.credentials import AzureKeyCredential\n", "import os\n", "\n", "load_dotenv(override=True) # take environment variables from .env.\n", "\n", "# Variables not used here do not need to be updated in your .env file\n", "endpoint = os.environ[\"AZURE_SEARCH_SERVICE_ENDPOINT\"]\n", "credential = AzureKeyCredential(os.getenv(\"AZURE_SEARCH_ADMIN_KEY\")) if os.getenv(\"AZURE_SEARCH_ADMIN_KEY\") else DefaultAzureCredential()\n", "index_name = os.getenv(\"AZURE_SEARCH_INDEX\", \"int-vec\")\n", "blob_connection_string = os.environ[\"BLOB_CONNECTION_STRING\"]\n", "# search blob datasource connection string is optional - defaults to blob connection string\n", "# This field is only necessary if you are using MI to connect to the data source\n", "# https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-credentials-and-connection-strings\n", "search_blob_connection_string = os.getenv(\"SEARCH_BLOB_DATASOURCE_CONNECTION_STRING\", blob_connection_string)\n", "blob_container_name = os.getenv(\"BLOB_CONTAINER_NAME\", \"int-vec\")\n", "azure_openai_endpoint = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\n", "azure_openai_key = os.getenv(\"AZURE_OPENAI_KEY\")\n", "azure_openai_embedding_deployment = os.getenv(\"AZURE_OPENAI_EMBEDDING_DEPLOYMENT\", \"text-embedding-3-large\")\n", "azure_openai_model_name = os.getenv(\"AZURE_OPENAI_EMBEDDING_MODEL_NAME\", \"text-embedding-3-large\")\n", "azure_openai_model_dimensions = int(os.getenv(\"AZURE_OPENAI_EMBEDDING_DIMENSIONS\", 1024))\n", "# This field is only necessary if you want to use OCR to scan PDFs in the datasource or use the Document Layout skill without a key\n", "azure_ai_services_endpoint = os.getenv(\"AZURE_AI_SERVICES_ENDPOINT\", \"\")\n", "# This field is only necessary if you want to use OCR to scan PDFs in the data source or use the Document Layout skill and you want to authenticate using a key to Azure AI Services\n", "azure_ai_services_key = os.getenv(\"AZURE_AI_SERVICES_KEY\", \"\")\n", "\n", "# set USE_OCR to enable OCR to add page numbers. It cannot be combined with the document layout skill\n", "use_ocr = os.getenv(\"USE_OCR\", \"false\") == \"true\"\n", "# set USE_LAYOUT to enable Document Intelligence Layout skill for chunking by markdown. It cannot be combined with the built-in OCR\n", "use_document_layout = os.getenv(\"USE_LAYOUT\", \"false\") == \"true\"\n", "# Deepest nesting level in markdown that should be considered. See https://learn.microsoft.com/azure/search/cognitive-search-skill-document-intelligence-layout to learn more\n", "document_layout_depth = os.getenv(\"LAYOUT_MARKDOWN_HEADER_DEPTH\", \"h3\")\n", "# OCR must be used to add page numbers\n", "add_page_numbers = use_ocr\n", "\n", "if use_ocr and use_document_layout:\n", " raise Exception(\"You can only specify one of USE_OCR or USE_LAYOUT\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to Blob Storage and load documents\n", "\n", "Retrieve documents from Blob Storage. You can use the sample documents in the data/documents folder. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setup sample data in demo-container\n" ] } ], "source": [ "from azure.storage.blob import BlobServiceClient \n", "import glob\n", "\n", "def upload_sample_documents(\n", " blob_connection_string: str,\n", " blob_container_name: str,\n", " documents_directory: str,\n", " # Set to false if you want to use credentials included in the blob connection string\n", " # Otherwise your identity will be used as credentials\n", " use_user_identity: bool = True,\n", " ):\n", " # Connect to Blob Storage\n", " blob_service_client = BlobServiceClient.from_connection_string(logging_enable=True, conn_str=blob_connection_string, credential=DefaultAzureCredential() if use_user_identity else None)\n", " container_client = blob_service_client.get_container_client(blob_container_name)\n", " if not container_client.exists():\n", " container_client.create_container()\n", "\n", " pdf_files = glob.glob(os.path.join(documents_directory, '*.pdf'))\n", " for file in pdf_files:\n", " with open(file, \"rb\") as data:\n", " name = os.path.basename(file)\n", " if not container_client.get_blob_client(name).exists():\n", " container_client.upload_blob(name=name, data=data)\n", "\n", "def upload_documents():\n", " upload_sample_documents(\n", " blob_connection_string=blob_connection_string,\n", " blob_container_name=blob_container_name,\n", " documents_directory=os.path.join(\"..\", \"..\", \"..\", \"data\", \"benefitdocs\")\n", " )\n", "\n", "def upload_documents_with_ocr():\n", " upload_sample_documents(\n", " blob_connection_string=blob_connection_string,\n", " blob_container_name=blob_container_name,\n", " documents_directory = os.path.join(\"..\", \"..\", \"..\", \"data\", \"ocrdocuments\")\n", " )\n", "\n", "def upload_documents_with_layout():\n", " upload_sample_documents(\n", " blob_connection_string=blob_connection_string,\n", " blob_container_name=blob_container_name,\n", " documents_directory = os.path.join(\"..\", \"..\", \"..\", \"data\", \"layoutdocuments\")\n", " )\n", "\n", "if use_ocr:\n", " upload_documents_with_ocr()\n", "elif use_document_layout:\n", " upload_documents_with_layout()\n", "else:\n", " upload_documents()\n", "\n", "print(f\"Setup sample data in {blob_container_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a blob data source connector on Azure AI Search" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data source 'my-demo-index-blob' created or updated\n" ] } ], "source": [ "from azure.search.documents.indexes import SearchIndexerClient\n", "from azure.search.documents.indexes.models import (\n", " SearchIndexerDataContainer,\n", " SearchIndexerDataSourceConnection\n", ")\n", "from azure.search.documents.indexes.models import NativeBlobSoftDeleteDeletionDetectionPolicy\n", "\n", "# Create a data source \n", "indexer_client = SearchIndexerClient(endpoint, credential)\n", "container = SearchIndexerDataContainer(name=blob_container_name)\n", "data_source_connection = SearchIndexerDataSourceConnection(\n", " name=f\"{index_name}-blob\",\n", " type=\"azureblob\",\n", " connection_string=search_blob_connection_string,\n", " container=container,\n", " data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy()\n", ")\n", "data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)\n", "\n", "print(f\"Data source '{data_source.name}' created or updated\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a search index\n", "\n", "Vector and nonvector content is stored in a search index." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my-demo-index created\n" ] } ], "source": [ "from azure.search.documents.indexes import SearchIndexClient\n", "from azure.search.documents.indexes.models import (\n", " SearchField,\n", " SearchFieldDataType,\n", " VectorSearch,\n", " HnswAlgorithmConfiguration,\n", " VectorSearchProfile,\n", " AzureOpenAIVectorizer,\n", " AzureOpenAIVectorizerParameters,\n", " SemanticConfiguration,\n", " SemanticSearch,\n", " SemanticPrioritizedFields,\n", " SemanticField,\n", " SearchIndex\n", ")\n", "\n", "# Create a search index \n", "index_client = SearchIndexClient(endpoint=endpoint, credential=credential) \n", "fields = [ \n", " SearchField(name=\"parent_id\", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True), \n", " SearchField(name=\"title\", type=SearchFieldDataType.String), \n", " SearchField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name=\"keyword\"), \n", " SearchField(name=\"chunk\", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False), \n", " SearchField(name=\"vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_model_dimensions, vector_search_profile_name=\"myHnswProfile\"), \n", "]\n", "\n", "if add_page_numbers:\n", " fields.append(\n", " SearchField(name=\"page_number\", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False)\n", " )\n", "\n", "if use_document_layout:\n", " fields.extend([\n", " SearchField(name=\"header_1\", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),\n", " SearchField(name=\"header_2\", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),\n", " SearchField(name=\"header_3\", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False)\n", " ])\n", " \n", "# Configure the vector search configuration \n", "vector_search = VectorSearch( \n", " algorithms=[ \n", " HnswAlgorithmConfiguration(name=\"myHnsw\"),\n", " ], \n", " profiles=[ \n", " VectorSearchProfile( \n", " name=\"myHnswProfile\", \n", " algorithm_configuration_name=\"myHnsw\", \n", " vectorizer_name=\"myOpenAI\", \n", " )\n", " ], \n", " vectorizers=[ \n", " AzureOpenAIVectorizer( \n", " vectorizer_name=\"myOpenAI\", \n", " kind=\"azureOpenAI\", \n", " parameters=AzureOpenAIVectorizerParameters( \n", " resource_url=azure_openai_endpoint, \n", " deployment_name=azure_openai_embedding_deployment,\n", " model_name=azure_openai_model_name,\n", " api_key=azure_openai_key,\n", " ),\n", " ), \n", " ], \n", ") \n", " \n", "semantic_config = SemanticConfiguration( \n", " name=\"my-semantic-config\", \n", " prioritized_fields=SemanticPrioritizedFields( \n", " content_fields=[SemanticField(field_name=\"chunk\")],\n", " title_field=SemanticField(field_name=\"title\")\n", " ), \n", ")\n", " \n", "# Create the semantic search with the configuration \n", "semantic_search = SemanticSearch(configurations=[semantic_config]) \n", " \n", "# Create the search index\n", "index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search) \n", "result = index_client.create_or_update_index(index) \n", "print(f\"{result.name} created\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a skillset\n", "\n", "Skills drive integrated vectorization. [Text Split](https://learn.microsoft.com/azure/search/cognitive-search-skill-textsplit) provides data chunking. [AzureOpenAIEmbedding](https://learn.microsoft.com/azure/search/cognitive-search-skill-azure-openai-embedding) handles calls to Azure OpenAI, using the connection information you provide in the environment variables. An [indexer projection](https://learn.microsoft.com/azure/search/index-projections-concept-intro) specifies secondary indexes used for chunked data." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my-demo-index-skillset created\n" ] } ], "source": [ "from azure.search.documents.indexes.models import (\n", " SplitSkill,\n", " InputFieldMappingEntry,\n", " OutputFieldMappingEntry,\n", " AzureOpenAIEmbeddingSkill,\n", " OcrSkill,\n", " SearchIndexerIndexProjection,\n", " SearchIndexerIndexProjectionSelector,\n", " SearchIndexerIndexProjectionsParameters,\n", " IndexProjectionMode,\n", " SearchIndexerSkillset,\n", " AIServicesAccountKey,\n", " AIServicesAccountIdentity,\n", " DocumentIntelligenceLayoutSkill\n", ")\n", "\n", "# Create a skillset name \n", "skillset_name = f\"{index_name}-skillset\"\n", "\n", "def create_ocr_skillset():\n", " ocr_skill = OcrSkill(\n", " description=\"OCR skill to scan PDFs and other images with text\",\n", " context=\"/document/normalized_images/*\",\n", " line_ending=\"Space\",\n", " default_language_code=\"en\",\n", " should_detect_orientation=True,\n", " inputs=[\n", " InputFieldMappingEntry(name=\"image\", source=\"/document/normalized_images/*\")\n", " ],\n", " outputs=[\n", " OutputFieldMappingEntry(name=\"text\", target_name=\"text\"),\n", " OutputFieldMappingEntry(name=\"layoutText\", target_name=\"layoutText\")\n", " ]\n", " )\n", "\n", " split_skill = SplitSkill( \n", " description=\"Split skill to chunk documents\", \n", " text_split_mode=\"pages\", \n", " context=\"/document/normalized_images/*\", \n", " maximum_page_length=2000, \n", " page_overlap_length=500, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/normalized_images/*/text\"), \n", " ], \n", " outputs=[ \n", " OutputFieldMappingEntry(name=\"textItems\", target_name=\"pages\") \n", " ]\n", " )\n", "\n", " embedding_skill = AzureOpenAIEmbeddingSkill( \n", " description=\"Skill to generate embeddings via Azure OpenAI\", \n", " context=\"/document/normalized_images/*/pages/*\", \n", " resource_url=azure_openai_endpoint, \n", " deployment_name=azure_openai_embedding_deployment, \n", " model_name=azure_openai_model_name,\n", " dimensions=azure_openai_model_dimensions,\n", " api_key=azure_openai_key, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/normalized_images/*/pages/*\"), \n", " ], \n", " outputs=[\n", " OutputFieldMappingEntry(name=\"embedding\", target_name=\"vector\") \n", " ]\n", " )\n", "\n", " index_projections = SearchIndexerIndexProjection( \n", " selectors=[ \n", " SearchIndexerIndexProjectionSelector( \n", " target_index_name=index_name, \n", " parent_key_field_name=\"parent_id\", \n", " source_context=\"/document/normalized_images/*/pages/*\", \n", " mappings=[\n", " InputFieldMappingEntry(name=\"chunk\", source=\"/document/normalized_images/*/pages/*\"), \n", " InputFieldMappingEntry(name=\"vector\", source=\"/document/normalized_images/*/pages/*/vector\"),\n", " InputFieldMappingEntry(name=\"title\", source=\"/document/metadata_storage_name\"),\n", " InputFieldMappingEntry(name=\"page_number\", source=\"/document/normalized_images/*/pageNumber\")\n", " ]\n", " )\n", " ], \n", " parameters=SearchIndexerIndexProjectionsParameters( \n", " projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS \n", " ) \n", " )\n", "\n", " skills = [ocr_skill, split_skill, embedding_skill]\n", "\n", " return SearchIndexerSkillset( \n", " name=skillset_name, \n", " description=\"Skillset to chunk documents and generating embeddings\", \n", " skills=skills, \n", " index_projection=index_projections,\n", " cognitive_services_account=AIServicesAccountKey(key=azure_ai_services_key, subdomain_url=azure_ai_services_endpoint) if azure_ai_services_key else AIServicesAccountIdentity(identity=None, subdomain_url=azure_ai_services_endpoint)\n", " )\n", "\n", "def create_layout_skillset():\n", " layout_skill = DocumentIntelligenceLayoutSkill(\n", " description=\"Layout skill to read documents\",\n", " context=\"/document\",\n", " output_mode=\"oneToMany\",\n", " markdown_header_depth=\"h3\",\n", " inputs=[\n", " InputFieldMappingEntry(name=\"file_data\", source=\"/document/file_data\")\n", " ],\n", " outputs=[\n", " OutputFieldMappingEntry(name=\"markdown_document\", target_name=\"markdownDocument\")\n", " ]\n", " )\n", "\n", " split_skill = SplitSkill( \n", " description=\"Split skill to chunk documents\", \n", " text_split_mode=\"pages\", \n", " context=\"/document/markdownDocument/*\", \n", " maximum_page_length=2000, \n", " page_overlap_length=500, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/markdownDocument/*/content\"), \n", " ], \n", " outputs=[ \n", " OutputFieldMappingEntry(name=\"textItems\", target_name=\"pages\") \n", " ]\n", " )\n", "\n", " embedding_skill = AzureOpenAIEmbeddingSkill( \n", " description=\"Skill to generate embeddings via Azure OpenAI\", \n", " context=\"/document/markdownDocument/*/pages/*\", \n", " resource_url=azure_openai_endpoint, \n", " deployment_name=azure_openai_embedding_deployment, \n", " model_name=azure_openai_model_name,\n", " dimensions=azure_openai_model_dimensions,\n", " api_key=azure_openai_key, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/markdownDocument/*/pages/*\"), \n", " ], \n", " outputs=[\n", " OutputFieldMappingEntry(name=\"embedding\", target_name=\"vector\") \n", " ]\n", " )\n", "\n", " index_projections = SearchIndexerIndexProjection( \n", " selectors=[ \n", " SearchIndexerIndexProjectionSelector( \n", " target_index_name=index_name, \n", " parent_key_field_name=\"parent_id\", \n", " source_context=\"/document/markdownDocument/*/pages/*\", \n", " mappings=[\n", " InputFieldMappingEntry(name=\"chunk\", source=\"/document/markdownDocument/*/pages/*\"), \n", " InputFieldMappingEntry(name=\"vector\", source=\"/document/markdownDocument/*/pages/*/vector\"),\n", " InputFieldMappingEntry(name=\"title\", source=\"/document/metadata_storage_name\"),\n", " InputFieldMappingEntry(name=\"header_1\", source=\"/document/markdownDocument/*/sections/h1\"),\n", " InputFieldMappingEntry(name=\"header_2\", source=\"/document/markdownDocument/*/sections/h2\"),\n", " InputFieldMappingEntry(name=\"header_3\", source=\"/document/markdownDocument/*/sections/h3\"),\n", " ]\n", " )\n", " ], \n", " parameters=SearchIndexerIndexProjectionsParameters( \n", " projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS \n", " ) \n", " )\n", "\n", " skills = [layout_skill, split_skill, embedding_skill]\n", "\n", " return SearchIndexerSkillset( \n", " name=skillset_name, \n", " description=\"Skillset to chunk documents and generating embeddings\", \n", " skills=skills, \n", " index_projection=index_projections,\n", " cognitive_services_account=AIServicesAccountKey(key=azure_ai_services_key, subdomain_url=azure_ai_services_endpoint) if azure_ai_services_key else AIServicesAccountIdentity(identity=None, subdomain_url=azure_ai_services_endpoint)\n", " )\n", "\n", "def create_skillset():\n", " split_skill = SplitSkill( \n", " description=\"Split skill to chunk documents\", \n", " text_split_mode=\"pages\", \n", " context=\"/document\", \n", " maximum_page_length=2000, \n", " page_overlap_length=500, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/content\"), \n", " ], \n", " outputs=[ \n", " OutputFieldMappingEntry(name=\"textItems\", target_name=\"pages\") \n", " ]\n", " )\n", "\n", " embedding_skill = AzureOpenAIEmbeddingSkill( \n", " description=\"Skill to generate embeddings via Azure OpenAI\", \n", " context=\"/document/pages/*\", \n", " resource_url=azure_openai_endpoint, \n", " deployment_name=azure_openai_embedding_deployment, \n", " model_name=azure_openai_model_name,\n", " dimensions=azure_openai_model_dimensions,\n", " api_key=azure_openai_key, \n", " inputs=[ \n", " InputFieldMappingEntry(name=\"text\", source=\"/document/pages/*\"), \n", " ], \n", " outputs=[\n", " OutputFieldMappingEntry(name=\"embedding\", target_name=\"vector\") \n", " ]\n", " )\n", "\n", " index_projections = SearchIndexerIndexProjection( \n", " selectors=[ \n", " SearchIndexerIndexProjectionSelector( \n", " target_index_name=index_name, \n", " parent_key_field_name=\"parent_id\", \n", " source_context=\"/document/pages/*\", \n", " mappings=[\n", " InputFieldMappingEntry(name=\"chunk\", source=\"/document/pages/*\"), \n", " InputFieldMappingEntry(name=\"vector\", source=\"/document/pages/*/vector\"),\n", " InputFieldMappingEntry(name=\"title\", source=\"/document/metadata_storage_name\")\n", " ]\n", " )\n", " ], \n", " parameters=SearchIndexerIndexProjectionsParameters( \n", " projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS \n", " ) \n", " )\n", "\n", " skills = [split_skill, embedding_skill]\n", "\n", " return SearchIndexerSkillset( \n", " name=skillset_name, \n", " description=\"Skillset to chunk documents and generating embeddings\", \n", " skills=skills, \n", " index_projection=index_projections\n", " )\n", "\n", "skillset = create_ocr_skillset() if use_ocr else create_layout_skillset() if use_document_layout else create_skillset()\n", " \n", "client = SearchIndexerClient(endpoint, credential) \n", "client.create_or_update_skillset(skillset) \n", "print(f\"{skillset.name} created\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create an indexer" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " my-demo-index-indexer is created and running. If queries return no results, please wait a bit and try again.\n" ] } ], "source": [ "from azure.search.documents.indexes.models import (\n", " SearchIndexer,\n", " IndexingParameters,\n", " IndexingParametersConfiguration,\n", " BlobIndexerImageAction\n", ")\n", "\n", "# Create an indexer \n", "indexer_name = f\"{index_name}-indexer\" \n", "\n", "indexer_parameters = None\n", "if use_ocr:\n", " indexer_parameters = IndexingParameters(\n", " configuration=IndexingParametersConfiguration(\n", " image_action=BlobIndexerImageAction.GENERATE_NORMALIZED_IMAGE_PER_PAGE,\n", " query_timeout=None))\n", "elif use_document_layout:\n", " indexer_parameters = IndexingParameters(\n", " configuration=IndexingParametersConfiguration(\n", " allow_skillset_to_read_file_data=True,\n", " query_timeout=None))\n", "\n", "indexer = SearchIndexer( \n", " name=indexer_name, \n", " description=\"Indexer to index documents and generate embeddings\", \n", " skillset_name=skillset_name, \n", " target_index_name=index_name, \n", " data_source_name=data_source.name,\n", " parameters=indexer_parameters\n", ") \n", "\n", "indexer_client = SearchIndexerClient(endpoint, credential) \n", "indexer_result = indexer_client.create_or_update_indexer(indexer) \n", " \n", "# Run the indexer \n", "indexer_client.run_indexer(indexer_name) \n", "print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.') \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Perform a vector similarity search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.\n", "\n", "If you indexed the health plan PDF file, send queries that ask plan-related questions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "parent_id: aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1\n", "chunk_id: def295033b1d_aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1_pages_1\n", "Score: 0.80918294\n", "Content: a variety of in-network providers, including primary care \n", "physicians, specialists, hospitals, and pharmacies. This plan does not offer coverage for emergency \n", "services, mental health and substance abuse coverage, or out-of-network services.\n", "\n", "Comparison of Plans \n", "Both plans offer coverage for routine physicals, well-child visits, immunizations, and other preventive \n", "care services. The plans also cover preventive care services such as mammograms, colonoscopies, and \n", "other cancer screenings. \n", "\n", "Northwind Health Plus offers more comprehensive coverage than Northwind Standard. This plan offers \n", "coverage for emergency services, both in-network and out-of-network, as well as mental health and \n", "substance abuse coverage. Northwind Standard does not offer coverage for emergency services, mental \n", "health and substance abuse coverage, or out-of-network services. \n", "\n", "Both plans offer coverage for prescription drugs. Northwind Health Plus offers a wider range of \n", "prescription drug coverage than Northwind Standard. Northwind Health Plus covers generic, brand-\n", "name, and specialty drugs, while Northwind Standard only covers generic and brand-name drugs. \n", "\n", "Both plans offer coverage for vision and dental services. Northwind Health Plus offers coverage for vision \n", "exams, glasses, and contact lenses, as well as dental exams, cleanings, and fillings. Northwind Standard \n", "only offers coverage for vision exams and glasses. \n", "\n", "Both plans offer coverage for medical services. Northwind Health Plus offers coverage for hospital stays, \n", "doctor visits, lab tests, and X-rays. Northwind Standard only offers coverage for doctor visits and lab \n", "tests. \n", "\n", "Northwind Health Plus is a comprehensive plan that offers more coverage than Northwind Standard. \n", "Northwind Health Plus offers coverage for emergency services, mental health and substance abuse \n", "coverage, and out-of-network services, while Northwind Standard does not. Northwind Health Plus also\n" ] } ], "source": [ "from azure.search.documents import SearchClient\n", "from azure.search.documents.models import VectorizableTextQuery\n", "\n", "# Pure Vector Search\n", "query = \"Which is more comprehensive, Northwind Health Plus vs Northwind Standard?\"\n", "if use_ocr:\n", " query = \"Who is the national director?\"\n", "if use_document_layout:\n", " query = \"What is contoso?\"\n", " \n", "search_client = SearchClient(endpoint, index_name, credential=credential)\n", "vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields=\"vector\", exhaustive=True)\n", "# Use the below query to pass in the raw vector query instead of the query vectorization\n", "# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields=\"vector\")\n", " \n", "results = search_client.search( \n", " search_text=None, \n", " vector_queries= [vector_query],\n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"parent_id: {result['parent_id']}\") \n", " print(f\"chunk_id: {result['chunk_id']}\") \n", " if add_page_numbers:\n", " print(f\"page_number: {result['page_number']}\")\n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['chunk']}\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform a hybrid search" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "parent_id: aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1\n", "chunk_id: def295033b1d_aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1_pages_1\n", "Score: 0.03333333507180214\n", "Content: a variety of in-network providers, including primary care \n", "physicians, specialists, hospitals, and pharmacies. This plan does not offer coverage for emergency \n", "services, mental health and substance abuse coverage, or out-of-network services.\n", "\n", "Comparison of Plans \n", "Both plans offer coverage for routine physicals, well-child visits, immunizations, and other preventive \n", "care services. The plans also cover preventive care services such as mammograms, colonoscopies, and \n", "other cancer screenings. \n", "\n", "Northwind Health Plus offers more comprehensive coverage than Northwind Standard. This plan offers \n", "coverage for emergency services, both in-network and out-of-network, as well as mental health and \n", "substance abuse coverage. Northwind Standard does not offer coverage for emergency services, mental \n", "health and substance abuse coverage, or out-of-network services. \n", "\n", "Both plans offer coverage for prescription drugs. Northwind Health Plus offers a wider range of \n", "prescription drug coverage than Northwind Standard. Northwind Health Plus covers generic, brand-\n", "name, and specialty drugs, while Northwind Standard only covers generic and brand-name drugs. \n", "\n", "Both plans offer coverage for vision and dental services. Northwind Health Plus offers coverage for vision \n", "exams, glasses, and contact lenses, as well as dental exams, cleanings, and fillings. Northwind Standard \n", "only offers coverage for vision exams and glasses. \n", "\n", "Both plans offer coverage for medical services. Northwind Health Plus offers coverage for hospital stays, \n", "doctor visits, lab tests, and X-rays. Northwind Standard only offers coverage for doctor visits and lab \n", "tests. \n", "\n", "Northwind Health Plus is a comprehensive plan that offers more coverage than Northwind Standard. \n", "Northwind Health Plus offers coverage for emergency services, mental health and substance abuse \n", "coverage, and out-of-network services, while Northwind Standard does not. Northwind Health Plus also\n" ] } ], "source": [ "# Hybrid Search\n", "query = \"Which is more comprehensive, Northwind Health Plus vs Northwind Standard?\" \n", "if use_ocr:\n", " query = \"Who is the national director?\"\n", "if use_document_layout:\n", " query = \"What is contoso?\"\n", "\n", "search_client = SearchClient(endpoint, index_name, credential=credential)\n", "vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields=\"vector\", exhaustive=True)\n", " \n", "results = search_client.search( \n", " search_text=query, \n", " vector_queries= [vector_query],\n", " select=[\"parent_id\", \"chunk_id\", \"chunk\"],\n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"parent_id: {result['parent_id']}\") \n", " print(f\"chunk_id: {result['chunk_id']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['chunk']}\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform a hybrid search + semantic reranking" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Semantic Answer: <em>Northwind Health Plus </em>is a<em> comprehensive </em>plan that<em> offers more coverage than Northwind Standard.</em> The table below shows a cost comparison between the different health plans offered by Contoso Electronics: Next Steps We hope that this information has been helpful in understanding the differences between Northwind Health Plus and Northwind Stan...\n", "Semantic Answer Score: 0.9670000076293945\n", "\n", "parent_id: aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1\n", "chunk_id: def295033b1d_aHR0cHM6Ly9oZWlkaXN0c3RvcmFnZWRlbW9lYXN0dXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RlbW8tY29udGFpbmVyL0JlbmVmaXRfT3B0aW9ucy5wZGY1_pages_1\n", "Reranker Score: 3.378175973892212\n", "Content: a variety of in-network providers, including primary care \n", "physicians, specialists, hospitals, and pharmacies. This plan does not offer coverage for emergency \n", "services, mental health and substance abuse coverage, or out-of-network services.\n", "\n", "Comparison of Plans \n", "Both plans offer coverage for routine physicals, well-child visits, immunizations, and other preventive \n", "care services. The plans also cover preventive care services such as mammograms, colonoscopies, and \n", "other cancer screenings. \n", "\n", "Northwind Health Plus offers more comprehensive coverage than Northwind Standard. This plan offers \n", "coverage for emergency services, both in-network and out-of-network, as well as mental health and \n", "substance abuse coverage. Northwind Standard does not offer coverage for emergency services, mental \n", "health and substance abuse coverage, or out-of-network services. \n", "\n", "Both plans offer coverage for prescription drugs. Northwind Health Plus offers a wider range of \n", "prescription drug coverage than Northwind Standard. Northwind Health Plus covers generic, brand-\n", "name, and specialty drugs, while Northwind Standard only covers generic and brand-name drugs. \n", "\n", "Both plans offer coverage for vision and dental services. Northwind Health Plus offers coverage for vision \n", "exams, glasses, and contact lenses, as well as dental exams, cleanings, and fillings. Northwind Standard \n", "only offers coverage for vision exams and glasses. \n", "\n", "Both plans offer coverage for medical services. Northwind Health Plus offers coverage for hospital stays, \n", "doctor visits, lab tests, and X-rays. Northwind Standard only offers coverage for doctor visits and lab \n", "tests. \n", "\n", "Northwind Health Plus is a comprehensive plan that offers more coverage than Northwind Standard. \n", "Northwind Health Plus offers coverage for emergency services, mental health and substance abuse \n", "coverage, and out-of-network services, while Northwind Standard does not. Northwind Health Plus also\n", "Caption: The plans also cover preventive care services such as mammograms, colonoscopies, and other cancer screenings. <em>Northwind Health Plus </em>offers more<em> comprehensive </em>coverage than<em> Northwind Standard.</em> This plan offers coverage for emergency services, both in-network and out-of-network, as well as mental health and substance abuse coverage. Northwind.\n", "\n" ] } ], "source": [ "from azure.search.documents.models import (\n", " QueryType,\n", " QueryCaptionType,\n", " QueryAnswerType\n", ")\n", "# Semantic Hybrid Search\n", "query = \"Which is more comprehensive, Northwind Health Plus vs Northwind Standard?\"\n", "if use_ocr:\n", " query = \"Who is the national director?\"\n", "if use_document_layout:\n", " query = \"What is contoso?\"\n", "\n", "search_client = SearchClient(endpoint, index_name, credential)\n", "vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields=\"vector\", exhaustive=True)\n", "\n", "results = search_client.search( \n", " search_text=query,\n", " vector_queries=[vector_query],\n", " select=[\"parent_id\", \"chunk_id\", \"chunk\"],\n", " query_type=QueryType.SEMANTIC,\n", " semantic_configuration_name='my-semantic-config',\n", " query_caption=QueryCaptionType.EXTRACTIVE,\n", " query_answer=QueryAnswerType.EXTRACTIVE,\n", " top=1\n", ")\n", "\n", "semantic_answers = results.get_answers()\n", "if semantic_answers:\n", " for answer in semantic_answers:\n", " if answer.highlights:\n", " print(f\"Semantic Answer: {answer.highlights}\")\n", " else:\n", " print(f\"Semantic Answer: {answer.text}\")\n", " print(f\"Semantic Answer Score: {answer.score}\\n\")\n", "\n", "for result in results:\n", " print(f\"parent_id: {result['parent_id']}\") \n", " print(f\"chunk_id: {result['chunk_id']}\") \n", " print(f\"Reranker Score: {result['@search.reranker_score']}\")\n", " print(f\"Content: {result['chunk']}\") \n", "\n", " captions = result[\"@search.captions\"]\n", " if captions:\n", " caption = captions[0]\n", " if caption.highlights:\n", " print(f\"Caption: {caption.highlights}\\n\")\n", " else:\n", " print(f\"Caption: {caption.text}\\n\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }