demo-python/code/community-integration/hugging-face/azure-search-vector-python-huggingface-model-sample.ipynb (563 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Azure AI Search with Hugging Face embedding models\n", "\n", "This code demonstrates how to use Azure AI Search with a Hugging Face embedding model, [E5-small-v2](https://huggingface.co/intfloat/e5-small-v2) and the Azure AI Search Documents Python SDK.\n", "\n", "Azure AI Search on any tier supports vector workloads, but we recommend Basic or higher for this demo. [Enable semantic ranker](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run the hybrid semantic query example at the end of this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up a Python virtual environment in Visual Studio Code\n", "\n", "1. Open the Command Palette (Ctrl+Shift+P).\n", "1. Search for **Python: Create Environment**.\n", "1. Select **Venv**.\n", "1. Select a Python interpreter. Choose 3.10 or later.\n", "\n", "It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "! pip install -r azure-search-vector-python-huggingface-model-sample-requirements.txt --quiet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load .env file\n", "\n", "Copy `/code/.env-sample` to an `.env` file in the sample folder, and update accordingly. The search service must exist, but the search index is created and loaded during code execution. Provide a unique name for the index. Endpoint and API key can be found in the Azure portal." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from azure.identity import DefaultAzureCredential\n", "from azure.core.credentials import AzureKeyCredential\n", "import os\n", "\n", "load_dotenv(override=True) # take environment variables from .env.\n", "\n", "# Variables not used here do not need to be updated in your .env file\n", "endpoint = os.environ[\"AZURE_SEARCH_SERVICE_ENDPOINT\"]\n", "credential = AzureKeyCredential(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) if len(os.environ[\"AZURE_SEARCH_ADMIN_KEY\"]) > 0 else DefaultAzureCredential()\n", "index_name = os.environ[\"AZURE_SEARCH_INDEX\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Create embeddings\n", "\n", "Reads the local text-sample.json file, generates embeddings using the pre-trained E5-small-V2 embeddings model, and exports the vectorized output to a local file that can be consumed during indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer \n", "import os\n", "import json\n", "\n", "model = SentenceTransformer('intfloat/e5-small-v2') \n", "sample_path = os.path.join(\"..\", \"..\", \"..\", \"..\", \"data\", \"text-sample.json\")\n", "with open(sample_path, 'r', encoding='utf-8') as file: \n", " data = json.load(file) \n", " \n", "for item in data: \n", " title = item['title'] \n", " content = item['content'] \n", " title_embeddings = model.encode(title, normalize_embeddings=True) \n", " content_embeddings = model.encode(content, normalize_embeddings=True) \n", " item['titleVector'] = title_embeddings.tolist() \n", " item['contentVector'] = content_embeddings.tolist() \n", "\n", "output_path = os.path.join(\"data\", \"docVectors-e5.json\")\n", "with open(output_path, \"w\") as f: \n", " json.dump(data, f) \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Create a search index\n", "\n", "Create a search index schema with a vector search configuration." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "huggingface created\n" ] } ], "source": [ "from azure.search.documents.indexes import SearchIndexClient\n", "from azure.search.documents.indexes.models import (\n", " SimpleField,\n", " SearchFieldDataType,\n", " SearchableField,\n", " SearchField,\n", " VectorSearch,\n", " HnswAlgorithmConfiguration,\n", " VectorSearchProfile,\n", " SemanticConfiguration,\n", " SemanticSearch,\n", " SemanticField,\n", " SemanticPrioritizedFields,\n", " SearchIndex\n", ")\n", "\n", "# Create a search index \n", "index_client = SearchIndexClient(endpoint=endpoint, credential=credential) \n", "fields = [ \n", " SimpleField(name=\"id\", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True), \n", " SearchableField(name=\"title\", type=SearchFieldDataType.String), \n", " SearchableField(name=\"content\", type=SearchFieldDataType.String), \n", " SearchableField(name=\"category\", type=SearchFieldDataType.String, filterable=True), \n", " SearchField(name=\"titleVector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), \n", " searchable=True, vector_search_dimensions=384, vector_search_profile_name=\"myHnswProfile\"), \n", " SearchField(name=\"contentVector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), \n", " searchable=True, vector_search_dimensions=384, vector_search_profile_name=\"myHnswProfile\"), \n", "] \n", " \n", "# Configure the vector search configuration \n", "vector_search = VectorSearch( \n", " algorithms=[ \n", " HnswAlgorithmConfiguration( \n", " name=\"myHnsw\"\n", " ), \n", " ], \n", " profiles=[ \n", " VectorSearchProfile( \n", " name=\"myHnswProfile\", \n", " algorithm_configuration_name=\"myHnsw\", \n", " ), \n", " ], \n", ") \n", " \n", "semantic_config = SemanticConfiguration( \n", " name=\"my-semantic-config\", \n", " prioritized_fields=SemanticPrioritizedFields( \n", " title_field=SemanticField(field_name=\"title\"), \n", " keywords_fields=[SemanticField(field_name=\"category\")], \n", " content_fields=[SemanticField(field_name=\"content\")] \n", " ) \n", ") \n", " \n", "# Create the semantic settings with the configuration \n", "semantic_search = SemanticSearch(configurations=[semantic_config]) \n", " \n", "# Create the search index with the semantic settings \n", "index = SearchIndex(name=index_name, fields=fields, \n", " vector_search=vector_search, semantic_search=semantic_search) \n", "result = index_client.create_or_update_index(index) \n", "print(f'{result.name} created') \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Upload and store embeddings\n", "\n", "This step uploads the JSON document containing your embeddings and sends it to a search client for indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploaded 108 documents\n" ] } ], "source": [ "from azure.search.documents import SearchClient\n", "import json\n", "\n", "# Upload some documents to the index\n", "output_path = os.path.join(\"data\", \"docVectors-e5.json\")\n", "with open(output_path, 'r') as file: \n", " documents = json.load(file) \n", "search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)\n", "result = search_client.upload_documents(documents) \n", "print(f\"Uploaded {len(documents)} documents\") " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a vector similarity search" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Azure DevOps\n", "Score: 0.8424989\n", "Content: Azure DevOps is a suite of services that help you plan, build, and deploy applications. It includes Azure Boards for work item tracking, Azure Repos for source code management, Azure Pipelines for continuous integration and continuous deployment, Azure Test Plans for manual and automated testing, and Azure Artifacts for package management. DevOps supports a wide range of programming languages, frameworks, and platforms, making it easy to integrate with your existing development tools and processes. It also integrates with other Azure services, such as Azure App Service and Azure Functions.\n", "Category: Developer Tools\n", "\n" ] } ], "source": [ "from azure.search.documents.models import VectorizedQuery\n", "from sentence_transformers import SentenceTransformer \n", "\n", "\n", "model = SentenceTransformer('intfloat/e5-small-v2') \n", "# Pure Vector Search \n", "query = \"tools for software development\" \n", " \n", "search_client = SearchClient(endpoint, index_name, credential=credential) \n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector\")\n", " \n", "results = search_client.search( \n", " search_text=None,\n", " vector_queries= [vector_query], \n", " select=[\"title\", \"content\", \"category\"], \n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"Title: {result['title']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['content']}\") \n", " print(f\"Category: {result['category']}\\n\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a cross-field vector search" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Azure DevTest Labs\n", "Score: 0.01666666753590107\n", "Content: Azure DevTest Labs is a fully managed service that enables you to create, manage, and share development and test environments in Azure. It provides features like custom templates, cost management, and integration with Azure DevOps. DevTest Labs supports various platforms, such as Windows, Linux, and Kubernetes. You can use Azure DevTest Labs to improve your application development lifecycle, reduce your costs, and ensure the consistency of your environments. It also integrates with other Azure services, such as Azure Virtual Machines and Azure App Service.\n", "Category: Developer Tools\n", "\n" ] } ], "source": [ "# Pure Vector Search \n", "query = \"tools for software development\" \n", " \n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector, titleVector\")\n", " \n", "results = search_client.search( \n", " search_text=None,\n", " vector_queries= [vector_query], \n", " select=[\"title\", \"content\", \"category\"],\n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"Title: {result['title']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['content']}\") \n", " print(f\"Category: {result['category']}\\n\") \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a multi-vector search" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Azure DevTest Labs\n", "Score: 0.01666666753590107\n", "Content: Azure DevTest Labs is a fully managed service that enables you to create, manage, and share development and test environments in Azure. It provides features like custom templates, cost management, and integration with Azure DevOps. DevTest Labs supports various platforms, such as Windows, Linux, and Kubernetes. You can use Azure DevTest Labs to improve your application development lifecycle, reduce your costs, and ensure the consistency of your environments. It also integrates with other Azure services, such as Azure Virtual Machines and Azure App Service.\n", "Category: Developer Tools\n", "\n" ] } ], "source": [ "# Multi-Vector Search \n", "query = \"tools for software development\" \n", " \n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query_1 = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"titleVector\")\n", "vector_query_2 = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector\")\n", " \n", "results = search_client.search( \n", " search_text=None, \n", " vector_queries=[vector_query_1, vector_query_2], \n", " select=[\"title\", \"content\", \"category\"],\n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"Title: {result['title']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['content']}\") \n", " print(f\"Category: {result['category']}\\n\") \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a pure vector search with a filter" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Azure DevOps\n", "Score: 0.8424989\n", "Content: Azure DevOps is a suite of services that help you plan, build, and deploy applications. It includes Azure Boards for work item tracking, Azure Repos for source code management, Azure Pipelines for continuous integration and continuous deployment, Azure Test Plans for manual and automated testing, and Azure Artifacts for package management. DevOps supports a wide range of programming languages, frameworks, and platforms, making it easy to integrate with your existing development tools and processes. It also integrates with other Azure services, such as Azure App Service and Azure Functions.\n", "Category: Developer Tools\n", "\n" ] } ], "source": [ "# Pure Vector Search \n", "query = \"tools for software development\" \n", " \n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector\")\n", " \n", "results = search_client.search( \n", " search_text=None,\n", " vector_queries= [vector_query], \n", " filter=\"category eq 'Developer Tools'\",\n", " select=[\"title\", \"content\", \"category\"], \n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"Title: {result['title']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['content']}\") \n", " print(f\"Category: {result['category']}\\n\") \n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a hybrid search" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Azure Storage\n", "Score: 0.03333333507180214\n", "Content: Azure Storage is a scalable, durable, and highly available cloud storage service that supports a variety of data types, including blobs, files, queues, and tables. It provides a massively scalable object store for unstructured data. Storage supports data redundancy and geo-replication, ensuring high durability and availability. It offers a variety of data access and management options, including REST APIs, SDKs, and Azure Portal. You can secure your data using encryption at rest and in transit.\n", "Category: Storage\n", "\n" ] } ], "source": [ "# Pure Vector Search \n", "query = \"scalable storage solution\" \n", " \n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector\")\n", " \n", "results = search_client.search( \n", " search_text=query,\n", " vector_queries= [vector_query], \n", " select=[\"title\", \"content\", \"category\"], \n", " top=1\n", ") \n", " \n", "for result in results: \n", " print(f\"Title: {result['title']}\") \n", " print(f\"Score: {result['@search.score']}\") \n", " print(f\"Content: {result['content']}\") \n", " print(f\"Category: {result['category']}\\n\") \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Perform a semantic hybrid search" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Semantic Answer: Azure Cognitive Search is<em> a fully managed search-as-a-service that enables you to build rich search experiences for your applications.</em> It provides features like full-text search, faceted navigation, and filters. Azure Cognitive Search supports various data sources, such as Azure SQL Database, Azure Blob Storage, and Azure Cosmos DB.\n", "Semantic Answer Score: 0.9814453125\n", "\n", "Title: Azure Cognitive Search\n", "Reranker Score: 3.0556066036224365\n", "Content: Azure Cognitive Search is a fully managed search-as-a-service that enables you to build rich search experiences for your applications. It provides features like full-text search, faceted navigation, and filters. Azure Cognitive Search supports various data sources, such as Azure SQL Database, Azure Blob Storage, and Azure Cosmos DB. You can use Azure Cognitive Search to index your data, create custom scoring profiles, and integrate with other Azure services. It also integrates with other Azure services, such as Azure Cognitive Services and Azure Machine Learning.\n", "Category: AI + Machine Learning\n", "Caption: Azure Cognitive Search is a fully managed search-as-a-service that enables you to build rich search experiences for your applications. It provides features like full-text search, faceted navigation, and filters.<em> Azure</em> Cognitive<em> Search</em> supports various data sources, such as Azure SQL Database, Azure Blob Storage, and Azure Cosmos DB.\n", "\n" ] } ], "source": [ "from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType\n", "\n", "# Semantic Hybrid Search\n", "query = \"what is azure search?\"\n", "\n", "query_embeddings = model.encode(query, normalize_embeddings=True) \n", "vector_query = VectorizedQuery(vector=query_embeddings.tolist(), k_nearest_neighbors=1, fields=\"contentVector\")\n", "\n", "results = search_client.search( \n", " search_text=query, \n", " vector_queries=[vector_query],\n", " select=[\"title\", \"content\", \"category\"],\n", " query_type=QueryType.SEMANTIC, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,\n", " top=1\n", ")\n", "\n", "semantic_answers = results.get_answers()\n", "for answer in semantic_answers:\n", " if answer.highlights:\n", " print(f\"Semantic Answer: {answer.highlights}\")\n", " else:\n", " print(f\"Semantic Answer: {answer.text}\")\n", " print(f\"Semantic Answer Score: {answer.score}\\n\")\n", "\n", "for result in results:\n", " print(f\"Title: {result['title']}\")\n", " print(f\"Reranker Score: {result['@search.reranker_score']}\")\n", " print(f\"Content: {result['content']}\")\n", " print(f\"Category: {result['category']}\")\n", "\n", " captions = result[\"@search.captions\"]\n", " if captions:\n", " caption = captions[0]\n", " if caption.highlights:\n", " print(f\"Caption: {caption.highlights}\\n\")\n", " else:\n", " print(f\"Caption: {caption.text}\\n\")\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }