sdk/python/generative-ai/rag/notebooks/pinecone/pinecone_mlindex_with_langchain.ipynb (1,043 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install azure-ai-ml\n", "\n", "# To use the latest version of langchain, please uncomment the line right below\n", "# and remove the `langchain` extra. Otherwise, the `langchain` extra will be used instead.\n", "# %pip install langchain\n", "%pip install -U 'azureml-rag[pinecone,langchain]>=0.2.11'\n", "\n", "# If using hugging_face embeddings add `hugging_face` extra, e.g. `azureml-rag[pinecone,hugging_face]`" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Create a Pinecone-based Vector Index for Document Retrieval with AzureML\n", "\n", "We'll walk through setting up an AzureML Pipeline that uploads the `retrieval-augmented-generation` directory from this repository, processes the data into chunks, embeds the chunks, and creates a LangChain-compatible Pinecone MLIndex. Furthermore, we will demonstrate how to source data from a public git repository instead and setup a Pipeline Schedule for continuous indexing.\n", "\n", "Note: Support for [namespaces](https://docs.pinecone.io/docs/namespaces) is currently not available. We will be actively working on it and are excited to bring this feature to you!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Get client for AzureML Workspace\n", "\n", "The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.\n", "\n", "If you don't have a Workspace and want to create and Index locally see [here to create one](https://learn.microsoft.com/azure/machine-learning/quickstart-create-resources?view=azureml-api-2).\n", "\n", "Enter your Workspace details below, running this still will write a `workspace.json` file to the current folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile workspace.json\n", "{\n", " \"subscription_id\": \"<subscription_id>\",\n", " \"resource_group\": \"<resource_group_name>\",\n", " \"workspace_name\": \"<workspace_name>\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`MLClient` is how you interact with AzureML" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n", "from azure.ai.ml import MLClient\n", "from azureml.core import Workspace\n", "\n", "try:\n", " credential = DefaultAzureCredential()\n", " # Check if given credential can get token successfully.\n", " credential.get_token(\"https://management.azure.com/.default\")\n", "except Exception as ex:\n", " # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work\n", " credential = InteractiveBrowserCredential()\n", "\n", "try:\n", " ml_client = MLClient.from_config(credential=credential, path=\"workspace.json\")\n", "except Exception as ex:\n", " raise Exception(\n", " \"Failed to create MLClient from config file. Please modify and then run the above cell with your AzureML Workspace details.\"\n", " ) from ex\n", " # ml_client = MLClient(\n", " # credential=credential,\n", " # subscription_id=\"\",\n", " # resource_group_name=\"\",\n", " # workspace_name=\"\"\n", " # )\n", "\n", "ws = Workspace(\n", " subscription_id=ml_client.subscription_id,\n", " resource_group=ml_client.resource_group_name,\n", " workspace_name=ml_client.workspace_name,\n", ")\n", "\n", "print(ml_client)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Which Embeddings Model to use?\n", "\n", "There are currently two supported Embedding options: OpenAI's `text-embedding-ada-002` embedding model or HuggingFace embedding models. Here are some factors that might influence your decision:\n", "\n", "### OpenAI\n", "\n", "OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly.\n", "If you have an existing Azure OpenAI Instance you can connect it to AzureML, if you don't AzureML provisions a default one for you called `Default_AzureOpenAI`.\n", "The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use.\n", "\n", "### HuggingFace\n", "\n", "HuggingFace hosts many different models capable of embedding text into single-dimensional vectors. The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, not all models ranked can be run locally (e.g. `text-embedding-ada-002` is on the list), though many can and there is a range of larger and smaller models. When embedding with HuggingFace the model is loaded locally for inference, this will potentially impact your choice of compute resources.\n", "\n", "**NOTE:** The default PromptFlow Runtime does not come with HuggingFace model dependencies installed, Indexes created using HuggingFace embeddings will not work in PromptFlow by default. **Pick OpenAI if you want to use PromptFlow**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the cells under _either_ heading (OpenAI or HuggingFace) to use the respective embedding model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OpenAI\n", "\n", "We can use the automatically created `Default_AzureOpenAI` connection.\n", "\n", "If you would rather use an existing Azure OpenAI connection then change `aoai_connection_name` below.\n", "If you would rather use an existing Azure OpenAI resource, but don't have a connection created, modify `aoai_connection_name` and the details under the `# Create New Connection` code comment, or navigate to the PromptFlow section in your AzureML Workspace and use the Connections create UI flow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aoai_connection_name = \"Default_AzureOpenAI\"\n", "aoai_connection_id = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we get the connection from the Workspace and save its `id` so we can reference it later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.rag.utils.connections import (\n", " get_connection_by_name_v2,\n", " create_connection_v2,\n", ")\n", "\n", "try:\n", " aoai_connection = get_connection_by_name_v2(ws, aoai_connection_name)\n", "except Exception as ex:\n", " # Create New Connection\n", " # Modify the details below to match the `Endpoint` and API key of your AOAI resource, these details can be found in Azure Portal\n", " raise RuntimeError(\n", " \"Have you entered your AOAI resource details below? If so, delete me!\"\n", " )\n", " aoai_connection = create_connection_v2(\n", " workspace=ws,\n", " name=aoai_connection,\n", " category=\"AzureOpenAI\",\n", " # 'Endpoint' from Azure OpenAI resource overview\n", " target=\"https://<endpoint_name>.openai.azure.com/\",\n", " auth_type=\"ApiKey\",\n", " credentials={\n", " # Either `Key` from the `Keys and Endpoint` tab of your Azure OpenAI resource, will be stored in your Workspace associated Azure Key Vault.\n", " \"key\": \"<api-key>\"\n", " },\n", " metadata={\"ApiType\": \"azure\", \"ApiVersion\": \"2023-05-15\"},\n", " )\n", "\n", "aoai_connection_id = aoai_connection[\"id\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that your Workspace has a connection to Azure OpenAI we will make sure the `text-embedding-ada-002` model has been deployed ready for inference. This cell will fail if there is not deployment for the embeddings model, [follow these instructions](https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#deploy-a-model) to deploy a model with Azure OpenAI." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.rag.utils.deployment import infer_deployment\n", "\n", "aoai_embedding_model_name = \"text-embedding-ada-002\"\n", "try:\n", " aoai_embedding_deployment_name = infer_deployment(\n", " aoai_connection, aoai_embedding_model_name\n", " )\n", " print(\n", " f\"Deployment name in AOAI workspace for model '{aoai_embedding_model_name}' is '{aoai_embedding_deployment_name}'\"\n", " )\n", "except Exception as e:\n", " print(\n", " f\"Deployment name in AOAI workspace for model '{aoai_embedding_model_name}' is not found.\"\n", " )\n", " print(\n", " f\"Please create a deployment for this model by following the deploy instructions on the resource page for '{aoai_connection['properties']['target']}' in Azure Portal.\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we will combine the deployment and model information into a uri form which the AzureML embeddings components expect as input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embeddings_model_uri = f\"azure_open_ai://deployment/{aoai_embedding_deployment_name}/model/{aoai_embedding_model_name}\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### HuggingFace\n", "\n", "AzureML's default model from HuggingFace is `all-mpnet-base-v2`, it can be run by most laptops. Any `sentence-transformer` models should be supported, you can learn more about `sentence-transformers` [here](https://huggingface.co/sentence-transformers)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embeddings_model_uri = \"hugging_face://model/sentence-transformers/all-mpnet-base-v2\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare Pinecone Index\n", "\n", "If you would rather use an existing Pinecone index from a Pinecone project connection then change `pinecone_connection_name` below.\n", "If you would rather use an existing Pinecone index, but don't have a connection to the Pinecone project created, modify `pinecone_connection_name` and the details under the `# Create New Connection` code comment.\n", "\n", "Note: Only Custom Connections are supported right now for connecting to a Pinecone project (where your index lives).\n", "\n", "If creating the connection through the Create Connections UI flow, then please specify the Pinecone project\n", "- API key (must use `api_key` as the key for the key-value pair)\n", "- ID (must use `project_id` as the key for the key-value pair)\n", "- environment (must use `environment` as the key for the key-value pair)\n", "\n", "as key-value pairs in the Custom Connection where the API key key-value pair must be a secret. For example:\n", "\n", "![Create Pinecone Connection UI](../../media/pinecone_connection_ui.png)\n", "\n", "\n", "You can also modify the details under the `# Create New Connection` code comment to create a connection via the SDK!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pinecone_connection_name = \"my_pinecone_project_connection\"\n", "pinecone_connection_id = None" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.rag.utils.connections import (\n", " get_connection_by_name_v2,\n", " create_connection_v2,\n", ")\n", "\n", "try:\n", " pinecone_connection = get_connection_by_name_v2(ws, pinecone_connection_name)\n", "except Exception as ex:\n", " # Create New Connection\n", " # Modify the details below to match the details of the Pinecone project where your index lives, more details here: https://docs.pinecone.io/docs/projects\n", " raise RuntimeError(\n", " \"Have you entered your Pinecone project details below? If so, delete me!\"\n", " )\n", " pinecone_connection = create_connection_v2(\n", " workspace=ws,\n", " name=pinecone_connection_name,\n", " category=\"CustomKeys\",\n", " target=\"_\",\n", " auth_type=\"CustomKeys\",\n", " credentials={\n", " \"keys\": {\n", " # https://docs.pinecone.io/docs/projects#api-keys\n", " \"api_key\": \"<api-key>\"\n", " }\n", " },\n", " metadata={\n", " # https://docs.pinecone.io/docs/projects#project-environment\n", " \"environment\": \"<environment>\",\n", " # https://docs.pinecone.io/docs/projects#project-id\n", " \"project_id\": \"<project_id>\",\n", " },\n", " )\n", "\n", "pinecone_connection_id = pinecone_connection[\"id\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Setup Pipeline to process data into Index\n", "\n", "AzureML [Pipelines](https://learn.microsoft.com/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2) connect together multiple [Components](https://learn.microsoft.com/azure/machine-learning/concept-component?view=azureml-api-2). Each Component defines inputs, code that consumes the inputs and outputs produced from the code. Pipelines themselves can have inputs, and outputs produced by connecting together individual sub Components.\n", "To process your data for embedding and indexing we will chain together multiple components each performing their own step of the workflow.\n", "\n", "The Components are published to a [Registry](https://learn.microsoft.com/azure/machine-learning/how-to-manage-registries?view=azureml-api-2&tabs=cli), `azureml`, which should have access to by default, it can be accessed from any Workspace.\n", "In the below cell we get the Component Definitions from the `azureml` registry." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_registry = MLClient(credential=credential, registry_name=\"azureml\")\n", "\n", "# Clones git repository to output folder of pipeline, by default this will be on the default Workspace Datastore `workspaceblobstore`\n", "git_clone_component = ml_registry.components.get(\"llm_rag_git_clone\", label=\"latest\")\n", "# Walks input folder according to provided glob pattern (all files by default: '**/*') and attempts to open them, extract text chunks and further chunk if necessary to fir within provided `chunk_size`.\n", "crack_and_chunk_component = ml_registry.components.get(\n", " \"llm_rag_crack_and_chunk\", label=\"latest\"\n", ")\n", "# Reads input folder of files containing chunks and their metadata as batches, in parallel, and generates embeddings for each chunk. Output format is produced and loaded by `azureml.rag.embeddings.EmbeddingContainer`.\n", "generate_embeddings_component = ml_registry.components.get(\n", " \"llm_rag_generate_embeddings\", label=\"latest\"\n", ")\n", "# Reads an input folder produced by `azureml.rag.embeddings.EmbeddingsContainer.save()` and pushes all embeddings (including metadata) into a Pinecone index. Writes an MLIndex yaml detailing the index and embeddings model information.\n", "update_pinecone_index_component = ml_registry.components.get(\n", " \"llm_rag_update_pinecone_index\", label=\"latest\"\n", ")\n", "# Takes a uri to a storage location where an MLIndex yaml is stored and registers it as an MLIndex Data asset in the AzureML Workspace.\n", "register_mlindex_component = ml_registry.components.get(\n", " \"llm_rag_register_mlindex_asset\", label=\"latest\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each Component has documentation which provides an overall description of the Components purpose and each of the inputs/outputs.\n", "For example we can see understand what `crack_and_chunk` does by inspecting the Component definition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(crack_and_chunk_component)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below a Pipeline is built by defining a python function which chains together the above components inputs and outputs. Arguments to the function are inputs to the Pipeline itself and the return value is a dictionary defining the outputs of the Pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import Input, Output\n", "from azure.ai.ml.dsl import pipeline\n", "from azure.ai.ml.entities._job.pipeline._io import PipelineInput\n", "from typing import Optional\n", "\n", "\n", "def use_automatic_compute(component, instance_count=1, instance_type=\"Standard_E8s_v3\"):\n", " \"\"\"Configure input `component` to use automatic compute with `instance_count` and `instance_type`.\n", "\n", " This avoids the need to provision a compute cluster to run the component.\n", " \"\"\"\n", " component.set_resources(\n", " instance_count=instance_count,\n", " instance_type=instance_type,\n", " properties={\"compute_specification\": {\"automatic\": True}},\n", " )\n", " return component\n", "\n", "\n", "def optional_pipeline_input_provided(input: Optional[PipelineInput]):\n", " \"\"\"Checks if optional pipeline inputs are provided.\"\"\"\n", " return input is not None and input._data is not None\n", "\n", "\n", "# If you have an existing compute cluster you want to use instead of automatic compute, uncomment the following line, replace `dedicated_cpu_compute` with the name of your cluster.\n", "# Also comment out the `component.set_resources` line in `use_automatic_compute` above and the `default_compute='serverless'` line below.\n", "\n", "\n", "# @pipeline(compute=dedicated_cpu_compute)\n", "@pipeline(default_compute=\"serverless\")\n", "def uri_into_pinecone(\n", " input_data: Input,\n", " embeddings_model: str,\n", " pinecone_config: str,\n", " pinecone_connection_id: str,\n", " asset_name: str,\n", " chunk_size: int = 1024,\n", " data_source_glob: str = None,\n", " data_source_url: str = None,\n", " document_path_replacement_regex: str = None,\n", " aoai_connection_id: str = None,\n", " embeddings_container: Input = None,\n", "):\n", " \"\"\"Pipeline to generate embeddings for a `input_data` source and push them into a Pinecone index.\"\"\"\n", "\n", " crack_and_chunk = crack_and_chunk_component(\n", " input_data=input_data,\n", " input_glob=data_source_glob,\n", " chunk_size=chunk_size,\n", " data_source_url=data_source_url,\n", " document_path_replacement_regex=document_path_replacement_regex,\n", " )\n", " use_automatic_compute(crack_and_chunk)\n", "\n", " generate_embeddings = generate_embeddings_component(\n", " chunks_source=crack_and_chunk.outputs.output_chunks,\n", " embeddings_container=embeddings_container,\n", " embeddings_model=embeddings_model,\n", " )\n", " use_automatic_compute(generate_embeddings)\n", " if optional_pipeline_input_provided(aoai_connection_id):\n", " generate_embeddings.environment_variables[\n", " \"AZUREML_WORKSPACE_CONNECTION_ID_AOAI\"\n", " ] = aoai_connection_id\n", " if optional_pipeline_input_provided(embeddings_container):\n", " # If provided, `embeddings_container` is expected to be a URI to folder, the folder can be empty.\n", " # Each sub-folder is generated by a `create_embeddings_component` run and can be reused for subsequent embeddings runs.\n", " generate_embeddings.outputs.embeddings = Output(\n", " type=\"uri_folder\", path=f\"{embeddings_container.path}/{{name}}\"\n", " )\n", "\n", " # `update_pinecone_index` takes the Embedded data produced by `generate_embeddings` and pushes it into a Pinecone index.\n", " update_pinecone_index = update_pinecone_index_component(\n", " embeddings=generate_embeddings.outputs.embeddings,\n", " pinecone_config=pinecone_config,\n", " )\n", " use_automatic_compute(update_pinecone_index)\n", " if optional_pipeline_input_provided(pinecone_connection_id):\n", " update_pinecone_index.environment_variables[\n", " \"AZUREML_WORKSPACE_CONNECTION_ID_PINECONE\"\n", " ] = pinecone_connection_id\n", "\n", " register_mlindex = register_mlindex_component(\n", " storage_uri=update_pinecone_index.outputs.index,\n", " asset_name=asset_name,\n", " )\n", " use_automatic_compute(register_mlindex)\n", " return {\n", " \"mlindex_asset_uri\": update_pinecone_index.outputs.index,\n", " \"mlindex_asset_id\": register_mlindex.outputs.asset_id,\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can learn about the URIs AzureML will accept as data inputs [here](https://learn.microsoft.com/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python#paths). Referencing a path on AzureML supported storages (Blob, ADLSgen2, ADLSgen1, Fileshare) works best using [Datastores](https://learn.microsoft.com/azure/machine-learning/how-to-datastore?view=azureml-api-2&tabs=cli-identity-based-access%2Ccli-adls-identity-based-access%2Ccli-azfiles-account-key%2Ccli-adlsgen1-identity-based-access) as they help manage credentials for access.\n", "\n", "You can also reference a local path on your machine and AzureML will upload it to your Workspace default Datastore, usually `workspaceblobstore`, which is what the below cell does. If you have existing Data in a location supported by AzureML (as detailed in the above linked documentation) you can replace the local path below with your own uri, you will also want to update the `data_source_url`, if you're not sure what to put there it can be left blank to start with and the relative path of each file in the data source will be set as the source url in metadata." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This will upload the `azureml-examples/sdk/python/generative-ai` folder.\n", "input_uri = \"../../../\"\n", "# This url is used as the base_url to combine with paths to processed files, the combined url will be added to each embedded documents metadata.\n", "data_source_url = (\n", " \"https://github.com/Azure/azureml-examples/blob/main/sdk/python/generative-ai/\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `update_pinecone_index` component takes a `pinecone_config` argument which specifies the name of the Index to push chunked and embedded data to. If this index does not exist it will be created, if it does exists it will be reused.\n", "\n", "**Note: if you are using Pinecone's Starter Plan, you are allowed to have only 1 index. In which case, make sure you have deleted any existing index if you are not looking to reuse.** More details on the Starter Plan here: https://docs.pinecone.io/docs/starter-environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pinecone_config = {\"index_name\": \"pinecone-notebook-local-files-index\"}\n", "\n", "print(update_pinecone_index_component.description)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create the Pipeline Job by calling the `@pipeline` annotated function and providing input arguments.\n", "`asset_name` will be used when registering the MLIndex Data Asset produced by the `register_mlindex` component in the pipeline. This is how you can refer to the MLIndex within AzureML." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_name = \"azureml_rag_aoai_pinecone_mlindex\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import Input\n", "import json\n", "\n", "pipeline_job = uri_into_pinecone(\n", " input_data=Input(type=\"uri_folder\", path=input_uri),\n", " data_source_url=data_source_url,\n", " pinecone_config=json.dumps(pinecone_config),\n", " pinecone_connection_id=pinecone_connection_id,\n", " # Each run will save latest Embeddings to subfolder under this path, runs will load latest embeddings from container and reuse any unchanged chunk embeddings.\n", " embeddings_container=Input(\n", " type=\"uri_folder\",\n", " path=f\"azureml://datastores/workspaceblobstore/paths/embeddings/{asset_name}\",\n", " ),\n", " embeddings_model=embeddings_model_uri,\n", " # This should be None if using a HuggingFace embeddings model.\n", " aoai_connection_id=aoai_connection_id,\n", " # Name of asset to register MLIndex under\n", " asset_name=asset_name,\n", ")\n", "\n", "# By default AzureML Pipelines will reuse the output of previous component Runs when inputs have not changed.\n", "# If you want to rerun the Pipeline every time each time so that any changes to upstream data sources are processed uncomment the below line.\n", "\n", "# pipeline_job.settings.force_rerun = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we add some properties to `pipeline_job` which ensure the Index generation progress and final Artifact appear in the PromptFlow Vector Index UI." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_job.properties[\"azureml.mlIndexAssetName\"] = asset_name\n", "pipeline_job.properties[\"azureml.mlIndexAssetKind\"] = \"pinecone\"\n", "pipeline_job.properties[\"azureml.mlIndexAssetSource\"] = \"Local Data\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Submit Pipeline\n", "\n", "**In case of any errors see [TROUBLESHOOT.md](../../TROUBLESHOOT.md).**\n", "\n", "The output of each step in the pipeline can be inspected via the Workspace UI, click the link under 'Details Page' after running the below cell. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "running_pipeline_job = ml_client.jobs.create_or_update(\n", " pipeline_job, experiment_name=\"uri_to_pinecone\"\n", ")\n", "running_pipeline_job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.jobs.stream(running_pipeline_job.name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Use Index with langchain\n", "\n", "The Data Asset produced by the AzureML Pipeline above contains a yaml file named 'MLIndex' which contains all the information needed to use the Pinecone index.\n", "For instance if an AOAI deployment was used to embed the documents the details of that deployment and a reference to the secret are there.\n", "This allows easy loading of the MLIndex into a langchain retriever." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azureml.rag.mlindex import MLIndex\n", "\n", "question = \"What is RAG?\"\n", "\n", "retriever = MLIndex(\n", " ml_client.data.get(asset_name, label=\"latest\")\n", ").as_langchain_retriever()\n", "retriever.get_relevant_documents(question)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have not deployed `gpt-35-turbo` on your Azure OpenAI resource the below cell will fail indicated the `API deployment for this resource does not exist`. Follow the previous instructions for deploying `text-embedding-ada-002` to deploy `gpt-35-turbo`, note the chosen deployment name below and use the same or update it if you choose different one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.chains import RetrievalQA\n", "from azureml.rag.models import init_llm, parse_model_uri\n", "\n", "model_config = parse_model_uri(\n", " \"azure_open_ai://deployment/gpt-35-turbo/model/gpt-35-turbo\"\n", ")\n", "model_config[\"api_base\"] = aoai_connection[\"properties\"][\"target\"]\n", "model_config[\"key\"] = aoai_connection[\"properties\"][\"credentials\"][\"key\"]\n", "model_config[\"temperature\"] = 0.3\n", "model_config[\"max_retries\"] = 3\n", "\n", "qa = RetrievalQA.from_chain_type(\n", " llm=init_llm(model_config), chain_type=\"stuff\", retriever=retriever\n", ")\n", "\n", "qa.run(question)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note: To control the number of documents returned when searching try getting the the MLIndex `as_langchain_vectorstore()` instead, this implements the `VectorStore` interface which has more parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use MLIndex with PromptFlow\n", "\n", "To use the MLindex in PromptFlow the asset_id can be used with the `Vector Index Lookup​` Tool. Replace `versions/2` with `versions/latest` to use the latest version." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asset_id = f\"azureml:/{ml_client.data.get(asset_name, label='latest').id}\"\n", "asset_id.replace(\"resourceGroups\", \"resourcegroups\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Source Data from a Git Repo\n", "\n", "If you want to use a git repo as a data source you can wrap the `uri_to_pinecone` pipeline in a new one which also performs a git_clone." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@pipeline(default_compute=\"serverless\")\n", "def git_to_pinecone(\n", " git_url,\n", " embeddings_model,\n", " pinecone_config,\n", " pinecone_connection_id,\n", " asset_name,\n", " chunk_size=1024,\n", " data_source_glob=None,\n", " data_source_url=None,\n", " document_path_replacement_regex=None,\n", " branch_name=None,\n", " git_connection_id=None,\n", " aoai_connection_id=None,\n", " embeddings_container=None,\n", "):\n", " git_clone = git_clone_component(git_repository=git_url, branch_name=branch_name)\n", " use_automatic_compute(git_clone)\n", " if optional_pipeline_input_provided(git_connection_id):\n", " git_clone.environment_variables[\n", " \"AZUREML_WORKSPACE_CONNECTION_ID_GIT\"\n", " ] = git_connection_id\n", "\n", " return uri_into_pinecone(\n", " git_clone.outputs.output_data,\n", " embeddings_model,\n", " pinecone_config,\n", " pinecone_connection_id,\n", " asset_name,\n", " chunk_size,\n", " data_source_glob,\n", " data_source_url,\n", " document_path_replacement_regex,\n", " aoai_connection_id,\n", " embeddings_container,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The settings below show how the different git and data_source parameters can be set to process only the AzureML documentation from the larger AzureDocs git repo, and ensure the source url for each document is processed to link to the publicly hosted URL instead of the git url." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "git_url = \"https://github.com/MicrosoftDocs/azure-docs/\"\n", "data_source_glob = \"articles/machine-learning/**/*\"\n", "data_source_url = \"https://learn.microsoft.com/en-us/azure\"\n", "# This regex is used to remove the 'articles' folder from the source url and removes the file extension.\n", "document_path_replacement_regex = r'{\"match_pattern\": \"(.*)/articles/(.*)(\\\\.[^.]+)$\", \"replacement_pattern\": \"\\\\1/\\\\2\"}'\n", "asset_name = \"azure_docs_ml_aoai_pinecone_mlindex\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the data being indexed is changing, the target index in Pinecone should also be updated.\n", "\n", "**Note: if you are using Pinecone's Starter Plan, you are allowed to have only 1 index. In which case, make sure you have deleted any existing index before moving on.** More details on the Starter Plan here: https://docs.pinecone.io/docs/starter-environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pinecone_config = {\n", " \"index_name\": \"azure-docs-machine-learning-aoai-embedding\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import Input\n", "import json\n", "\n", "pipeline_job = git_to_pinecone(\n", " git_url=git_url,\n", " data_source_glob=data_source_glob,\n", " data_source_url=data_source_url,\n", " document_path_replacement_regex=document_path_replacement_regex,\n", " pinecone_config=json.dumps(pinecone_config),\n", " pinecone_connection_id=pinecone_connection_id,\n", " # Each run will save latest Embeddings to subfolder under this path, runs will load latest embeddings from container and reuse any unchanged chunk embeddings\n", " embeddings_container=Input(\n", " type=\"uri_folder\",\n", " path=f\"azureml://datastores/workspaceblobstore/paths/embeddings/{asset_name}\",\n", " ),\n", " embeddings_model=embeddings_model_uri,\n", " aoai_connection_id=aoai_connection_id,\n", " # Name of asset to register MLIndex under\n", " asset_name=asset_name,\n", ")\n", "\n", "# Rerun each time so that git_clone isn't cached, if intent is to ingest latest data.\n", "# pipeline_job.settings.force_rerun = True" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These are added so that in progress index generations can be listed in UI, this tagging is done automatically by UI.\n", "pipeline_job.properties[\"azureml.mlIndexAssetName\"] = asset_name\n", "pipeline_job.properties[\"azureml.mlIndexAssetKind\"] = \"pinecone\"\n", "pipeline_job.properties[\"azureml.mlIndexAssetSource\"] = \"Git Repository\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "running_pipeline_job = ml_client.jobs.create_or_update(\n", " pipeline_job, experiment_name=\"git_to_pinecone\"\n", ")\n", "running_pipeline_job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.jobs.stream(running_pipeline_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup Pipeline to Run on Schedule\n", "\n", "It's possible to setup Pipelines to run on a regular schedule. Below we configure the AzureDocs pipeline to run once every day." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.constants import TimeZone\n", "from azure.ai.ml.entities import JobSchedule, RecurrenceTrigger, RecurrencePattern\n", "from datetime import datetime, timedelta\n", "\n", "\n", "schedule_name = \"azure_docs_ml_aoai_pinecone_mlindex_daily\"\n", "\n", "# Make sure pipeline runs git_clone every trigger\n", "pipeline_job.settings.force_rerun = True\n", "\n", "schedule_start_time = datetime.utcnow() + timedelta(minutes=1)\n", "recurrence_trigger = RecurrenceTrigger(\n", " frequency=\"day\",\n", " interval=1,\n", " # schedule=RecurrencePattern(hours=16, minutes=[15]),\n", " start_time=schedule_start_time,\n", " time_zone=TimeZone.UTC,\n", ")\n", "\n", "job_schedule = JobSchedule(\n", " name=schedule_name,\n", " trigger=recurrence_trigger,\n", " create_job=pipeline_job,\n", " properties={\n", " \"azureml.mlIndexAssetName\": asset_name,\n", " \"azureml.mlIndexAssetKind\": \"pinecone\",\n", " \"azureml.mlIndexAssetSource\": \"Git Repository\",\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once created the schedule must be enabled.\n", "\n", "**Note:** To see Scheduled Pipelines in the AzureML Workspace UI you must navigate to the 'Jobs' page (beaker icon) AND have a flight enabled in your URL. Take your URL and modify it like so:\n", "- before: https://ml.azure.com/experiments?wsid=/subscriptions/.../resourceGroups/.../providers/Microsoft.MachineLearningServices/workspaces/my_awesome_workspace\n", "- after: https://ml.azure.com/experiments?wsid=/subscriptions/.../resourceGroups/.../providers/Microsoft.MachineLearningServices/workspaces/my_awesome_workspace&flight=schedules" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_schedule_res = ml_client.schedules.begin_create_or_update(\n", " schedule=job_schedule\n", ").result()\n", "job_schedule_res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The schedule can be disabled via the schedules UI or via the below code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_schedule_res = ml_client.schedules.begin_disable(name=schedule_name).result()\n", "job_schedule_res.is_enabled" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### [Optional] Provision Cluster\n", "\n", "You don't have to! The settings on the Pipeline use AzureML Serverless Compute, you can use any SKU you have quota for on demand. If you want to use a cluster that's also supported." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", "cpu_compute_target = \"rag-cpu\"\n", "\n", "try:\n", " dedicated_cpu_compute = ml_client.compute.get(cpu_compute_target)\n", "except Exception:\n", " # Let's create the Azure Machine Learning compute object with the intended parameters\n", " dedicated_cpu_compute = AmlCompute(\n", " name=cpu_compute_target,\n", " type=\"amlcompute\",\n", " size=\"Standard_E8s_v3\",\n", " min_instances=0,\n", " max_instances=2,\n", " idle_time_before_scale_down=600,\n", " tier=\"Dedicated\",\n", " )\n", "\n", " dedicated_cpu_compute = ml_client.compute.begin_create_or_update(\n", " dedicated_cpu_compute\n", " ).result(timeout=600)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }