demo-python/code/utilities/index-backup-restore/azure-search-backup-and-restore.ipynb (248 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Azure AI Search backup and restore sample\n", "\n", "**This unofficial code sample is offered \"as-is\" and might not work for all customers and scenarios. If you run into difficulties, you should manually recreate and reload your search index on a new search service.**\n", "\n", "This notebook demonstrates how to back up and restore a search index and migrate it to another instance of Azure AI Search. The target instance can be a different tier and configuration, but make sure it has available storage and quota, and that the [region has the features you require](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/?products=search).\n", "\n", "> **Note**: Azure AI Search now supports [service upgrades](https://learn.microsoft.com/azure/search/search-how-to-upgrade) and [pricing tier changes](https://learn.microsoft.com/azure/search/search-capacity-planning#change-your-pricing-tier). If you're backing up and restoring your index for migration to a higher capacity service, you now have other options.\n", "\n", "### Prerequisites\n", "\n", "+ The search index has 100,000 documents or less. For larger indexes, use [Resumable backup and restore](../resumable-index-backup-restore/backup-and-restore.ipynb). \n", "\n", "+ The search index you're backing up must have a `key` field that is `filterable` and `sortable`. If your document key doesn't meet this criteria, you can create and populate a new key field, and remove the `key=true` flag from the previous key field. \n", "\n", "+ Only fields marked as `retrievable` can be successfully backed up and restored. You can toggle `retrievable` between true and false on any field, but as of this writing, the Azure portal doesn't allow you to modify `retrievable` on vector fields. As a workaround, use an Azure SDK or Postman with an Update Index REST call.\n", "\n", " Setting `retrievable` to true doesn't increase index size. A `retrievable` action pulls from content that already exists in your index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up a Python virtual environment in Visual Studio Code\n", "\n", "1. Open the Command Palette (Ctrl+Shift+P).\n", "1. Search for **Python: Create Environment**.\n", "1. Select **Venv**.\n", "1. Select a Python interpreter. Choose 3.10 or later.\n", "\n", "It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install -r azure-search-backup-and-restore-requirements.txt --quiet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load .env file (Copy .env-sample to .env and update accordingly)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from azure.identity import DefaultAzureCredential\n", "from azure.core.credentials import AzureKeyCredential\n", "import os\n", "\n", "load_dotenv(override=True) # take environment variables from .env.\n", "\n", "# Variables not used here do not need to be updated in your .env file\n", "source_endpoint = os.environ[\"AZURE_SEARCH_SERVICE_ENDPOINT\"]\n", "source_credential = AzureKeyCredential(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) if len(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) > 0 else DefaultAzureCredential()\n", "source_index_name = os.environ[\"AZURE_SEARCH_INDEX\"]\n", "# Default to same service for copying index\n", "target_endpoint = os.environ[\"AZURE_TARGET_SEARCH_SERVICE_ENDPOINT\"] if len(os.environ[\"AZURE_TARGET_SEARCH_SERVICE_ENDPOINT\"]) > 0 else source_endpoint\n", "target_credential = AzureKeyCredential(os.environ[\"AZURE_TARGET_SEARCH_ADMIN_KEY\"]) if len(os.environ[\"AZURE_TARGET_SEARCH_ADMIN_KEY\"]) > 0 else DefaultAzureCredential()\n", "target_index_name = os.environ[\"AZURE_TARGET_SEARCH_INDEX\"] " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This script demonstrates backing up and restoring an Azure AI Search index between two services. The `backup_and_restore_index` function retrieves the source index definition, creates a new target index, backs up all documents, and restores them to the target index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.search.documents import SearchClient \n", "from azure.search.documents.indexes import SearchIndexClient\n", "import tqdm \n", " \n", "def create_clients(endpoint, credential, index_name): \n", " search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential) \n", " index_client = SearchIndexClient(endpoint=endpoint, credential=credential) \n", " return search_client, index_client\n", "\n", "def total_count(search_client):\n", " response = search_client.search(include_total_count=True, search_text=\"*\", top=0)\n", " return response.get_count()\n", " \n", "def search_results_with_filter(search_client, key_field_name):\n", " last_item = None\n", " response = search_client.search(search_text=\"*\", top=100000, order_by=key_field_name).by_page()\n", " while True:\n", " for page in response:\n", " page = list(page)\n", " if len(page) > 0:\n", " last_item = page[-1]\n", " yield page\n", " else:\n", " last_item = None\n", " \n", " if last_item:\n", " response = search_client.search(search_text=\"*\", top=100000, order_by=key_field_name, filter=f\"{key_field_name} gt '{last_item[key_field_name]}'\").by_page()\n", " else:\n", " break\n", "\n", "def search_results_without_filter(search_client):\n", " response = search_client.search(search_text=\"*\", top=100000).by_page()\n", " for page in response:\n", " page = list(page)\n", " yield page\n", "\n", "def backup_and_restore_index(source_endpoint, source_key, source_index_name, target_endpoint, target_key, target_index_name): \n", " # Create search and index clients \n", " source_search_client, source_index_client = create_clients(source_endpoint, source_key, source_index_name) \n", " target_search_client, target_index_client = create_clients(target_endpoint, target_key, target_index_name) \n", " \n", " # Get the source index definition \n", " source_index = source_index_client.get_index(name=source_index_name)\n", " non_retrievable_fields = []\n", " for field in source_index.fields:\n", " if field.hidden == True:\n", " non_retrievable_fields.append(field)\n", " if field.key == True:\n", " key_field = field\n", "\n", " if not key_field:\n", " raise Exception(\"Key Field Not Found\")\n", " \n", " if len(non_retrievable_fields) > 0:\n", " print(f\"WARNING: The following fields are not marked as retrievable and cannot be backed up and restored: {', '.join(f.name for f in non_retrievable_fields)}\")\n", " \n", " # Create target index with the same definition \n", " source_index.name = target_index_name\n", " target_index_client.create_or_update_index(source_index)\n", " \n", " document_count = total_count(source_search_client)\n", " can_use_filter = key_field.sortable and key_field.filterable\n", " if not can_use_filter:\n", " print(\"WARNING: The key field is not filterable or not sortable. A maximum of 100,000 records can be backed up and restored.\")\n", " # Backup and restore documents \n", " all_documents = search_results_with_filter(source_search_client, key_field.name) if can_use_filter else search_results_without_filter(source_search_client)\n", "\n", " print(\"Backing up and restoring documents:\") \n", " failed_documents = 0 \n", " failed_keys = [] \n", " with tqdm.tqdm(total=document_count) as progress_bar: \n", " for page in all_documents:\n", " result = target_search_client.upload_documents(documents=page) \n", " progress_bar.update(len(result)) \n", " \n", " for item in result: \n", " if item.succeeded is not True: \n", " failed_documents += 1\n", " failed_keys.append(page[result.index_of(item)].id) \n", " print(f\"Document upload error: {item.error.message}\") \n", " \n", " if failed_documents > 0: \n", " print(f\"Failed documents: {failed_documents}\") \n", " print(f\"Failed document keys: {failed_keys}\") \n", " else: \n", " print(\"All documents uploaded successfully.\") \n", " \n", " print(f\"Successfully backed up '{source_index_name}' and restored to '{target_index_name}'\") \n", " return source_search_client, target_search_client, all_documents \n", "\n", "source_search_client, target_search_client, all_documents = backup_and_restore_index(source_endpoint, source_credential, source_index_name, target_endpoint, target_credential, target_index_name) \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Use document counts to verify a successful restore. The verify_counts function compares document counts between source and target indexes after backup and restore. It prints a message indicating if the document counts match or not.\n", "\n", "Storage usage won't be exactly the same as the original index. It's expected to see small variations in consumed storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def verify_counts(source_search_client, target_search_client): \n", " source_document_count = source_search_client.get_document_count() \n", " target_document_count = target_search_client.get_document_count() \n", " \n", " print(f\"Source document count: {source_document_count}\") \n", " print(f\"Target document count: {target_document_count}\") \n", " \n", " if source_document_count == target_document_count: \n", " print(\"Document counts match.\") \n", " else: \n", " print(\"Document counts do not match.\") \n", " \n", "# Call the verify_counts function with the search_clients returned by the backup_and_restore_index function \n", "verify_counts(source_search_client, target_search_client) \n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }