# Azure AI Search CSV integrated vectorization sample

This Python notebook demonstrates the [integrated vectorization](https://learn.microsoft.com/azure/search/vector-search-integrated-vectorization) and [CSV indexing](https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs) features of Azure AI Search that are currently in public preview. 

Integrated vectorization takes a dependency on indexers and skillsets and the AzureOpenAIEmbedding skill and your Azure OpenAI resorce for embedding.

This example uses a CSV from the `csv_data` folder for chunking, embedding, indexing, and queries.

### Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).
 
+ Azure AI Search, any tier, but we recommend Basic or higher for this workload. [Enable semantic ranker](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run a hybrid query with semantic ranking.

+ A deployment of the `text-embedding-3-large` model on Azure OpenAI.

+ A deployment of the `gpt-4o` or `gpt-4o-mini` model on Azure OpenAI. 

+ Azure Blob Storage. This notebook connects to your storage account and loads a container with the sample CSV.


### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [1]:
! pip install -r csv-indexer-requirements.txt --quiet

### Load .env file (Copy .env-sample to .env and update accordingly)

In [21]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env.

# Variables not used here do not need to be updated in your .env file
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
# You do not need a key if you are using keyless authentication
# To learn more, please visit https://learn.microsoft.com/azure/search/search-security-rbac
credential = AzureKeyCredential(os.getenv("AZURE_SEARCH_ADMIN_KEY")) if os.getenv("AZURE_SEARCH_ADMIN_KEY") else DefaultAzureCredential()
index_name = os.getenv("AZURE_SEARCH_INDEX", "csv-vec")
blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
# search blob datasource connection string is optional - defaults to blob connection string
# This field is only necessary if you are using MI to connect to the data source
# https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-credentials-and-connection-strings
search_blob_connection_string = os.getenv("SEARCH_BLOB_DATASOURCE_CONNECTION_STRING", blob_connection_string)
blob_container_name = os.getenv("BLOB_CONTAINER_NAME", "csv-vec")
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
# You do not need a key if you are using keyless authentication
# To learn more, please visit https://learn.microsoft.com/azure/search/search-howto-managed-identities-data-sources and https://learn.microsoft.com/azure/developer/ai/keyless-connections
azure_openai_key = os.getenv("AZURE_OPENAI_KEY")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")
azure_openai_model_dimensions = int(os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", 1024))
# NOTE: The chat deployment should support JSON Schema
# To learn more, please see
# https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs#supported-models
azure_openai_chat_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4o")
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-10-21")


## Connect to Blob Storage and load documents

Retrieve documents from Blob Storage. You can use the sample documents in the data/documents folder.  

In [7]:
from azure.storage.blob import BlobServiceClient  
import glob

def upload_sample_documents(
        blob_connection_string: str,
        blob_container_name: str,
        use_user_identity: bool = True
    ):
    # Connect to Blob Storage
    blob_service_client = BlobServiceClient.from_connection_string(conn_str=blob_connection_string, credential=DefaultAzureCredential() if use_user_identity else None)
    container_client = blob_service_client.get_container_client(blob_container_name)
    if not container_client.exists():
        container_client.create_container()

    documents_directory = "csv_data"
    csv_files = glob.glob(os.path.join(documents_directory, '*.csv'))
    for file in csv_files:
        with open(file, "rb") as data:
            name = os.path.basename(file)
            if not container_client.get_blob_client(name).exists():
                container_client.upload_blob(name=name, data=data)

upload_sample_documents(
    blob_connection_string=blob_connection_string,
    blob_container_name=blob_container_name,
    # Set to false if you want to use credentials included in the blob connection string
    # Otherwise your identity will be used as credentials
    use_user_identity=True
)
print(f"Setup sample data in {blob_container_name}")

Setup sample data in demo-container


## Create a blob data source connector on Azure AI Search

In [8]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SoftDeleteColumnDeletionDetectionPolicy
)

# Create a data source
# NOTE: To remove records from a search index, add a column to the row "IsDeleted" set to "True". The next indexer run will remove this record
# To learn more please visit https://learn.microsoft.com/en-us/azure/search/search-howto-index-one-to-many-blobs
indexer_client = SearchIndexerClient(endpoint, credential)
container = SearchIndexerDataContainer(name=blob_container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=search_blob_connection_string,
    container=container,
    data_deletion_detection_policy=SoftDeleteColumnDeletionDetectionPolicy(soft_delete_column_name="IsDeleted", soft_delete_marker_value="True")
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'my-demo-index-blob' created or updated


## Create a search index

Vector and nonvector content is stored in a search index.

In [9]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex
)

# Create a search index
# NOTE: You must adjust these fields based on your CSV Schema.
# There is no chunking of the description or title fields in this sample.
# There is a separate AzureSearch_DocumentKey for the key automatically generated by the indexer
# Learn more at https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)  
fields = [  
    SearchField(name="AzureSearch_DocumentKey",  key=True, type=SearchFieldDataType.String),
    SearchField(name="ID", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),  
    SearchField(name="Name", type=SearchFieldDataType.String, filterable=True),  
    SearchField(name="Age", type=SearchFieldDataType.Int32, sortable=True, filterable=True, facetable=False),  
    SearchField(name="Title", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
    SearchField(name="Description", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
    SearchField(name="TitleVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_model_dimensions, vector_search_profile_name="myHnswProfile"),
    SearchField(name="DescriptionVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_model_dimensions, vector_search_profile_name="myHnswProfile"),
]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",
            vectorizer_name="myOpenAI",
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            vectorizer_name="myOpenAI",  
            parameters=AzureOpenAIVectorizerParameters(  
                resource_url=azure_openai_endpoint,  
                deployment_name=azure_openai_embedding_deployment,
                model_name=azure_openai_model_name,
                api_key=azure_openai_key,
            ),
        ),  
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="Title"),
        content_fields=[SemanticField(field_name="Description")]  
    ),  
)

# Create the semantic search with the configuration  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  


my-demo-index created


## Create a skillset

Skills drive integrated vectorization. [AzureOpenAIEmbedding](https://learn.microsoft.com/azure/search/cognitive-search-skill-azure-openai-embedding) handles calls to Azure OpenAI, using the connection information you provide in the environment variables.

In [10]:
from azure.search.documents.indexes.models import (
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    SearchIndexerSkillset
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"
  
title_embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate title embeddings via Azure OpenAI",  
    context="/document",  
    resource_url=azure_openai_endpoint,  
    deployment_name=azure_openai_embedding_deployment,  
    model_name=azure_openai_model_name,
    dimensions=azure_openai_model_dimensions,
    api_key=azure_openai_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/Title"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="TitleVector")  
    ],  
)

description_embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate description embeddings via Azure OpenAI",  
    context="/document",  
    resource_url=azure_openai_endpoint,  
    deployment_name=azure_openai_embedding_deployment,  
    model_name=azure_openai_model_name,
    dimensions=azure_openai_model_dimensions,
    api_key=azure_openai_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/Description"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="DescriptionVector")  
    ],  
)  

skills = [title_embedding_skill, description_embedding_skill]

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to generate embeddings",  
    skills=skills
)
  
client = SearchIndexerClient(endpoint, credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  


my-demo-index-skillset created


## Create an indexer

In [11]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping,
    FieldMappingFunction,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerParsingMode
)

# Create an indexer  
indexer_name = f"{index_name}-indexer"  
indexer_parameters = IndexingParameters(
        configuration=IndexingParametersConfiguration(
            parsing_mode=BlobIndexerParsingMode.DELIMITED_TEXT,
            query_timeout=None,
            first_line_contains_headers=True))

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    parameters=indexer_parameters,
    field_mappings=[FieldMapping(source_field_name="AzureSearch_DocumentKey", target_field_name="AzureSearch_DocumentKey", mapping_function=FieldMappingFunction(name="base64Encode"))],
    output_field_mappings=[
        FieldMapping(source_field_name="/document/TitleVector", target_field_name="TitleVector"),
        FieldMapping(source_field_name="/document/DescriptionVector", target_field_name="DescriptionVector")
    ]
)  

indexer_client = SearchIndexerClient(endpoint, credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f'{indexer_name} is created and running. If queries return no results, please wait a bit and try again.')  


my-demo-index-indexer is created and running. If queries return no results, please wait a bit and try again.


## Perform a hybrid search

This example shows a hybrid vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.
Ask a zoo employment related question that can be answered just using the title and description fields

In [22]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
query = "Cleans fish tanks"
  
search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="TitleVector,DescriptionVector")
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=50, fields="vector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["ID", "Name", "Title", "Description"],
    top=3
)  
  
for result in results:
    print(f"Score: {result['@search.score']}")  
    print(f"ID: {result['ID']}")  
    print(f"Name: {result['Name']}")  
    print(f"Title: {result['Title']}")
    print(f"Description: {result['Description']}")   


Score: 0.03333333507180214
ID: 7
Name: Mary Wilson
Title: Aquarist
Description: Maintains aquatic exhibits
Score: 0.032522473484277725
ID: 10
Name: James Anderson
Title: Groundskeeper
Description: Maintains zoo grounds
Score: 0.03201844170689583
ID: 16
Name: Mason Thompson
Title: Maintenance Worker
Description: Handles maintenance and repairs


## Answer questions that require data analysis

Some questions require a deeper understanding of the data schema. For example, the question "Which employees are older than 40?" requires using [filtering](https://learn.microsoft.com/en-us/azure/search/search-filters) and "Who is the youngest employee" requires using [sorting](https://learn.microsoft.com/en-us/azure/search/search-pagination-page-layout). Use your [chat deployment](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/completions) to create the correct Azure Search query to answer the question

In [None]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from pydantic import BaseModel, Field
import pandas as pd
import json
from typing import Optional

openai_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(openai_credential, "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_key,
    azure_ad_token_provider=token_provider if not azure_openai_key else None
)

# See https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs for more information
# NOTE: Updating the tool definition with specific examples related to your data will help improve the accuracy.
class QueryOptions(BaseModel):
    """
    Given a question, get any additional Azure Search query parameters required to answer the question. If no additional query parameters are required to answer the question, don't return any.
    """
    orderBy: Optional[str] = Field(description="Specify a custom sort order for search results. Format is a comma-separated list of up to 32 order-by clauses. If a direction is not specified, the default is ascending. Example: ID, Age desc, Title asc")
    filter: Optional[str] = Field(description="Specify inclusion or exclusion criteria for search results. Format is an Azure Search OData boolean expression. Example: Age le 4 or not (Age gt 8). Do not use filters for text expressions, only numeric ones")
    search: Optional[str] = Field(description="Specify a query string used to search text and vectors in an Azure Search index in order to answer the provided question. If no query string is required to answer the question, return * or no query string at all")

# Specifically instruct the model to only use filterable fields when creating query options
filterable_fields = ", ".join([field.name for field in fields if field.filterable])
query_options_system_prompt = f"Create options for Azure Search queries. If you are creating filters, you may only use the following fields: {filterable_fields}. Only generate filters when you are trying to answer a question involving numbers."
def get_query_options(query: str) -> QueryOptions:
    response = client.beta.chat.completions.parse(
        model=azure_openai_chat_deployment,
        messages=[
            {"role": "system", "content": query_options_system_prompt},
            {"role": "user", "content": query}
        ],
        response_format=QueryOptions
    )
    return response.choices[0].message.parsed


answer_query_results_system_prompt = f"The following question requires search results to provide an answer. Use the provided search results to answer the question. If you can't answer the question using the search results, say I don't know."
search_client = SearchClient(endpoint, index_name, credential=credential)
def answer_query(query: str) -> str:
    # Parse the query options returned by the model
    query_options = get_query_options(query)
    query_option_search = query_options.search
    vector_queries = None
    if query_option_search and query_option_search != "*":
        vector_queries = [VectorizableTextQuery(text=query_option_search, k_nearest_neighbors=50, fields="TitleVector,DescriptionVector")]

    query_option_order_by = query_options.orderBy
    order_by = None
    query_type = None
    semantic_configuration_name = None
    if query_option_order_by:
        try:
            order_by = query_option_order_by.split(",")
        except:
            order_by = None

    # This sample only uses specific fields to answer questions. Update these fields for your own data
    columns = ["ID", "Age", "Name", "Title", "Description"]
    search_results = search_client.search(
        search_text=query_option_search,
        vector_queries=vector_queries,
        top=5,
        order_by=order_by,
        query_type=query_type,
        semantic_configuration_name=semantic_configuration_name,
        filter=query_options.filter,
        select=columns
    )

    # Convert the search results to markdown for use by the model
    results = [ { column: result[column] for column in columns } for result in search_results ]
    results_markdown_table = pd.DataFrame(results).to_markdown(index=False)

    response = client.chat.completions.create(
        model=azure_openai_chat_deployment,
        messages=[
            {"role": "system", "content": answer_query_results_system_prompt},
            {"role": "user", "content": results_markdown_table },
            {"role": "user", "content": query}
        ]
    )
    # Return the generated answer, query options, and results table for analysis
    return response.choices[0].message.content, query_options, results_markdown_table

def print_answer(answer, query_options, results):
    print("Generated Answer:", answer)
    print("Generated Query Options:", query_options)
    print("Search Results")
    print(results)



## Answer sample questions

These questions may require filtering and sorting in addition to regular search

In [None]:
answer, query_options, results = answer_query("Who is the youngest employee?")
print_answer(answer, query_options, results)

In [10]:
answer, query_options, results = answer_query("Who provides marketing updates on the zoo?")
print_answer(answer, query_options, results)

Generated Answer: The person who provides marketing updates on the zoo is William Harris, the Marketing Coordinator, who promotes zoo events and activities.
Generated Query Options: orderBy=None filter=None search='marketing updates zoo'
Search Results
|   ID |   Age | Name           | Title                 | Description                          |
|-----:|------:|:---------------|:----------------------|:-------------------------------------|
|   14 |    32 | William Harris | Marketing Coordinator | Promotes zoo events and activities   |
|   11 |    26 | Olivia Thomas  | Facilities Manager    | Manages zoo facilities               |
|    4 |    23 | Robert Brown   | Tour Guide            | Guides visitors through the zoo      |
|   10 |    43 | James Anderson | Groundskeeper         | Maintains zoo grounds                |
|    1 |    64 | John Doe       | Zookeeper             | Cares for animals and their habitats |


In [11]:
answer, query_options, results = answer_query("Of the employees who are older than 40, who is the youngest?")
print_answer(answer, query_options, results)

Generated Answer: David Moore is the youngest employee older than 40, with an age of 41.
Generated Query Options: orderBy='Age asc' filter='Age gt 40' search='*'
Search Results
|   ID |   Age | Name               | Title                   | Description                              |
|-----:|------:|:-------------------|:------------------------|:-----------------------------------------|
|    8 |    41 | David Moore        | Curator                 | Oversees animal exhibits and collections |
|   18 |    43 | Ethan Martinez     | Fundraising Coordinator | Organizes fundraising events             |
|   10 |    43 | James Anderson     | Groundskeeper           | Maintains zoo grounds                    |
|   12 |    44 | Daniel Jackson     | Guest Services          | Assists visitors and handles inquiries   |
|   19 |    44 | Charlotte Robinson | Event Planner           | Plans and coordinates events             |


In [12]:
answer, query_options, results = answer_query("Who are the employees who's first name is Alice?")
print_answer(answer, query_options, results)

Generated Answer: The employee with the first name Alice is Alice Johnson.
Generated Query Options: orderBy=None filter=None search='Alice'
Search Results
|   ID |   Age | Name               | Title                    | Description                          |
|-----:|------:|:-------------------|:-------------------------|:-------------------------------------|
|    3 |    23 | Alice Johnson      | Animal Trainer           | Trains animals for performances      |
|    4 |    23 | Robert Brown       | Tour Guide               | Guides visitors through the zoo      |
|   19 |    44 | Charlotte Robinson | Event Planner            | Plans and coordinates events         |
|    1 |    64 | John Doe           | Zookeeper                | Cares for animals and their habitats |
|   17 |    59 | Mia Garcia         | Administrative Assistant | Supports administrative tasks        |


In [13]:
answer, query_options, results = answer_query("Is there an employee named Scarlett?")
print_answer(answer, query_options, results)

Generated Answer: I don't know. The provided data does not include an employee named Scarlett.
Generated Query Options: orderBy=None filter=None search='Scarlett'
Search Results
|   ID |   Age | Name               | Title                 | Description                          |
|-----:|------:|:-------------------|:----------------------|:-------------------------------------|
|   14 |    32 | William Harris     | Marketing Coordinator | Promotes zoo events and activities   |
|    3 |    23 | Alice Johnson      | Animal Trainer        | Trains animals for performances      |
|    1 |    64 | John Doe           | Zookeeper             | Cares for animals and their habitats |
|   19 |    44 | Charlotte Robinson | Event Planner         | Plans and coordinates events         |
|   15 |    21 | Isabella Martin    | Researcher            | Conducts research on wildlife        |
