phi3/3_optimization

{ "cells": [ { "cell_type": "markdown", "id": "b996236e-2b7b-4b4a-b6d1-abbb83f79d5d", "metadata": {}, "source": [ "# Optimize SLM using Microsoft Olive\n", "\n", "This hands-on considers on-device or hybrid deployment scenarios.\n", "\n", "### Overview\n", "\n", "Microsoft Olive is a hardware-aware AI model optimization toolchain developed by Microsoft to streamline the deployment of AI models. Olive simplifies the process of preparing AI models for deployment by making them faster and more efficient, particularly for use on edge devices, cloud, and various hardware configurations. It works by automatically applying optimizations to the AI models, such as reducing model size, lowering latency, and improving performance, without requiring manual intervention from developers.\n", "\n", "Key features of Microsoft Olive include:\n", "\n", "- **Automated optimization**: Olive analyzes and applies optimizations specific to the model’s hardware environment.\n", "- **Cross-platform compatibility**: It supports various platforms such as Windows, Linux, and different hardware architectures, including CPUs, GPUs, and specialized AI accelerators.\n", "- **Integration with Microsoft tools**: Olive is designed to work seamlessly with Microsoft AI services like Azure, making it easier to deploy optimized models in cloud-based solutions.\n", "\n", "[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7efa4821", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import os, sys\n", "lab_prep_dir = os.getcwd().split(\"slm-innovator-lab\")[0] + \"slm-innovator-lab/0_lab_preparation\"\n", "sys.path.append(os.path.abspath(lab_prep_dir))\n", "\n", "from common import check_kernel\n", "check_kernel()" ] }, { "cell_type": "code", "execution_count": null, "id": "c784f3b2", "metadata": {}, "outputs": [], "source": [ "%pip install onnxruntime-genai==0.4.0" ] }, { "cell_type": "code", "execution_count": null, "id": "cb8cda9e-351a-455b-b17d-5e0f2b716db1", "metadata": { "tags": [] }, "outputs": [], "source": [ "%store -r job_name\n", "try:\n", " job_name\n", " print(job_name)\n", "except NameError:\n", " print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the previous notebook (model training) again.\")\n", " print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "markdown", "id": "4729ad82-7ab6-4eb0-9a06-d9627ed3e868", "metadata": {}, "source": [ "## 1. Load config file\n", "\n", "---\n" ] }, { "cell_type": "code", "execution_count": null, "id": "afa38600-ea3e-41dc-b05e-8ddc5c976327", "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "import yaml\n", "from logger import logger\n", "from datetime import datetime\n", "snapshot_date = datetime.now().strftime(\"%Y-%m-%d\")\n", "\n", "with open('config.yml') as f:\n", " d = yaml.load(f, Loader=yaml.FullLoader)\n", " \n", "AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']\n", "AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']\n", "AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']\n", "AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME'] \n", "DATA_DIR = d['config']['DATA_DIR']\n", "CLOUD_DIR = d['config']['CLOUD_DIR']\n", "HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']\n", "IS_DEBUG = d['config']['IS_DEBUG']\n", "\n", "azure_env_name = d['serve']['azure_env_name']\n", "azure_model_name = d['serve']['azure_model_name']\n", "\n", "logger.info(\"===== 0. Azure ML Deployment Info =====\")\n", "logger.info(f\"AZURE_SUBSCRIPTION_ID={AZURE_SUBSCRIPTION_ID}\")\n", "logger.info(f\"AZURE_RESOURCE_GROUP={AZURE_RESOURCE_GROUP}\")\n", "logger.info(f\"AZURE_WORKSPACE={AZURE_WORKSPACE}\")\n", "logger.info(f\"AZURE_DATA_NAME={AZURE_DATA_NAME}\")\n", "logger.info(f\"DATA_DIR={DATA_DIR}\")\n", "logger.info(f\"CLOUD_DIR={CLOUD_DIR}\")\n", "logger.info(f\"HF_MODEL_NAME_OR_PATH={HF_MODEL_NAME_OR_PATH}\")\n", "logger.info(f\"IS_DEBUG={IS_DEBUG}\")\n", "\n", "logger.info(f\"azure_env_name={azure_env_name}\")\n", "logger.info(f\"azure_model_name={azure_model_name}\")" ] }, { "cell_type": "markdown", "id": "1efb59c7-06f4-4a18-a177-6c3ec0725423", "metadata": {}, "source": [ "<br>\n", "\n", "## 2. Model preparation\n", "\n", "---\n", "\n", "### 2.1. Configure workspace details\n", "\n", "To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5b8099c0-d403-47dd-b2fc-ac1426319661", "metadata": { "tags": [] }, "outputs": [], "source": [ "# import required libraries\n", "import time\n", "from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n", "from azure.ai.ml import MLClient, Input\n", "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Model\n", "from azure.ai.ml.constants import AssetTypes\n", "from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError\n", "\n", "logger.info(f\"===== 2. Serving preparation =====\")\n", "logger.info(f\"Calling DefaultAzureCredential.\")\n", "credential = DefaultAzureCredential()\n", "ml_client = None\n", "try:\n", " ml_client = MLClient.from_config(credential)\n", "except Exception as ex:\n", " print(ex)\n", " ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)" ] }, { "cell_type": "markdown", "id": "a5fd308c-c64f-4fe6-958c-217e1d13e022", "metadata": {}, "source": [ "### 2.2. Create model asset\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9e3006a4-0aa1-4059-b358-c6122b471497", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_or_create_model_asset(ml_client, model_name, job_name, model_dir=\"outputs\", model_type=\"custom_model\", update=False):\n", " \n", " try:\n", " latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])\n", " if update:\n", " raise ResourceExistsError('Found Model asset, but will update the Model.')\n", " else:\n", " model_asset = ml_client.models.get(name=model_name, version=latest_model_version)\n", " logger.info(f\"Found Model asset: {model_name}. Will not create again\")\n", " except (ResourceNotFoundError, ResourceExistsError) as e:\n", " logger.info(f\"Exception: {e}\") \n", " model_path = f\"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/\" \n", " run_model = Model(\n", " name=model_name, \n", " path=model_path,\n", " description=\"Model created from run.\",\n", " type=model_type # mlflow_model, custom_model, triton_model\n", " )\n", " model_asset = ml_client.models.create_or_update(run_model)\n", " logger.info(f\"Created Model asset: {model_name}\")\n", "\n", " return model_asset" ] }, { "cell_type": "code", "execution_count": null, "id": "efe25c0b-5145-46f5-a468-09b684e1fdfd", "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dir = d['train']['model_dir']\n", "model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type=\"custom_model\", update=False)\n", "\n", "logger.info(\"===== 3. (Optional) Create model asset and get fine-tuned LLM to local folder =====\")\n", "logger.info(f\"azure_model_name={azure_model_name}\")\n", "logger.info(f\"model_dir={model_dir}\")\n", "#logger.info(f\"model={model}\")" ] }, { "cell_type": "markdown", "id": "20c9eb91-816e-4115-8865-1227d5ce4cda", "metadata": {}, "source": [ "### 2.3. Get fine-tuned LLM adapter to local folder\n", "\n", "You can copy it to your local directory to perform inference or serve the model in Azure environment. (e.g., real-time endpoint)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e06d992f-d700-4e37-8886-171af1e2e7ef", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Download the model \n", "local_model_dir = \"./artifact_downloads\"\n", "os.makedirs(local_model_dir, exist_ok=True)\n", "\n", "ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version)" ] }, { "cell_type": "markdown", "id": "e739f255-90e6-4417-824b-628488f314a4", "metadata": {}, "source": [ "### 2.4. Merge adapter and save\n" ] }, { "cell_type": "code", "execution_count": null, "id": "32a3a26d-1c05-4b7c-b68c-bc9f79e83f12", "metadata": { "tags": [] }, "outputs": [], "source": [ "import torch\n", "from transformers import AutoTokenizer\n", "from peft import AutoPeftModelForCausalLM\n", "model_tmp_dir = os.path.join(local_model_dir, azure_model_name, model_dir)\n", "model = AutoPeftModelForCausalLM.from_pretrained(model_tmp_dir, torch_dtype=torch.bfloat16)\n", "merged_model = model.merge_and_unload()" ] }, { "cell_type": "code", "execution_count": null, "id": "2674efa1-927f-4354-96f4-80502a903dc6", "metadata": { "tags": [] }, "outputs": [], "source": [ "merged_model_dir = os.path.join(local_model_dir, \"merged\")\n", "merged_model.save_pretrained(merged_model_dir, safe_serialization=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "0443ce5d-d731-4a12-90cd-3a7e68beee5d", "metadata": { "tags": [] }, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME_OR_PATH)\n", "tokenizer.save_pretrained(merged_model_dir)" ] }, { "cell_type": "markdown", "id": "2ca963df-6bb6-475c-b0c9-0dee6eaa5f78", "metadata": { "tags": [] }, "source": [ "## 3. Optimization using Olive\n", "\n", "---\n", "\n", "Before running this notebook, please make sure you have installed the [Olive](https://github.com/microsoft/Olive) package.\n", "\n", "#### Input model\n", "\n", "You can also select Azure ML curated model. The input model will be automatically downloaded from the Azure Model catalog:\n", "\n", "#### Systems\n", "\n", "We use `LocalSystem` as the device in this notebook. We enable `CPUExecutionProvider` in the `accelerators` field. but you can use Azure ML and Azure Arc.\n", "\n", "#### Passes\n", "\n", "We can add several passes to the config file. For example, you can pass LoRA, Evaluation, MergeAdapter, and Quantization. But for this hands-on, we'll simply do a 4-bit quantization followed by an ONNX conversion.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2a2901e7-41ab-4459-be74-9232977fdbab", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%writefile olive/olive_onnx_config.json\n", "{\n", " \"input_model\": {\n", " \"type\": \"HfModel\",\n", " \"model_path\": \"{{merged_model_dir}}\",\n", " \"load_kwargs\": {\n", " \"trust_remote_code\": true\n", " }\n", " },\n", " \"systems\": {\n", " \"local_system\": {\n", " \"type\": \"LocalSystem\",\n", " \"accelerators\": [\n", " {\n", " \"device\": \"CPU\",\n", " \"execution_providers\": [\n", " \"CPUExecutionProvider\"\n", " ]\n", " }\n", " ]\n", " }\n", " },\n", " \"passes\": {\n", " \"builder\": {\n", " \"type\": \"ModelBuilder\",\n", " \"precision\": \"int4\",\n", " \"int4_accuracy_level\": 4\n", " }\n", " },\n", " \"pass_flows\": [\n", " [\n", " \"builder\"\n", " ]\n", " ],\n", " \"cache_dir\": \"{{olive_cache_dir}}\",\n", " \"output_dir\": \"{{olive_output_dir}}\",\n", " \"host\": \"local_system\",\n", " \"target\": \"local_system\"\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "81c9e909-12a9-4d10-bcfe-31dd9efae84f", "metadata": {}, "outputs": [], "source": [ "import jinja2\n", "from pathlib import Path\n", "jinja_env = jinja2.Environment() \n", "\n", "olive_cache_dir = \"olive_cache\"\n", "olive_output_dir = \"olive_models\"\n", "\n", "template = jinja_env.from_string(Path(\"olive/olive_onnx_config.json\").open().read())\n", "Path(\"olive/olive_onnx_config.json\").open(\"w\").write(\n", " template.render(merged_model_dir=merged_model_dir, olive_cache_dir=olive_cache_dir, olive_output_dir=olive_output_dir)\n", ")\n", "!pygmentize olive/olive_onnx_config.json | cat -n" ] }, { "cell_type": "code", "execution_count": null, "id": "4615ed0d-f69a-4afc-93d6-e8b952cfc8d5", "metadata": { "tags": [] }, "outputs": [], "source": [ "HF_TOKEN = \"\" # Your Hugging Face Token\n", "!huggingface-cli login --token {HF_TOKEN} --add-to-git-credential" ] }, { "cell_type": "markdown", "id": "bc571920-e460-47d5-8299-38ee6c0afd43", "metadata": {}, "source": [ "### 3.2. Start Optimization\n", "\n", "It takes a few minutes to complete the code cell below.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1bdae095-eef8-49dc-a15d-d76deabbd9bb", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m olive run --config olive/olive_onnx_config.json" ] }, { "cell_type": "markdown", "id": "86c2d129-63da-4ae6-99e4-08dd7a7254df", "metadata": {}, "source": [ "### 3.3. Prediction\n", "\n", "You don't need a GPU device - you can load and infer Phi models on your on-device.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "24971715-f5e7-4ec1-a445-91fdeb00bfd1", "metadata": { "tags": [] }, "outputs": [], "source": [ "import time\n", "import onnxruntime_genai as og\n", "\n", "onnx_path = f\"./{olive_output_dir}/output_model/model\"\n", "!cp tokenizer.json {onnx_path}\n", "model = og.Model(onnx_path)\n", "tokenizer = og.Tokenizer(model)" ] }, { "cell_type": "code", "execution_count": null, "id": "306afca5-efc4-41b7-88d2-b6c212565a31", "metadata": { "tags": [] }, "outputs": [], "source": [ "tokenizer_stream = tokenizer.create_stream()\n", " \n", "search_options = {}\n", "search_options['min_length'] = 128\n", "search_options['max_length'] = 256\n", "search_options['do_sample'] = True\n", "search_options['temperature'] = 0.1\n", "search_options['top_p'] = 0.95\n", "\n", "timings = True\n", "chat_template = '<|user|>\\n{input} <|end|>\\n<|assistant|>'" ] }, { "cell_type": "code", "execution_count": null, "id": "3821894a-2d8b-4bb3-b925-7b601aba5773", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Keep asking for input prompts in a loop\n", "while True:\n", " text = input(\"Input: \")\n", " if not text:\n", " print(\"Error, input cannot be empty\")\n", " continue\n", "\n", " if timings: started_timestamp = time.time()\n", "\n", " # If there is a chat template, use it\n", " prompt = f'{chat_template.format(input=text)}'\n", "\n", " input_tokens = tokenizer.encode(prompt)\n", "\n", " params = og.GeneratorParams(model)\n", " params.set_search_options(**search_options)\n", " params.input_ids = input_tokens\n", " generator = og.Generator(model, params)\n", " if timings:\n", " first = True\n", " new_tokens = []\n", "\n", " print()\n", " print(\"Output: \", end='', flush=True)\n", "\n", " try:\n", " while not generator.is_done():\n", " generator.compute_logits()\n", " generator.generate_next_token()\n", " if timings:\n", " if first:\n", " first_token_timestamp = time.time()\n", " first = False\n", "\n", " new_token = generator.get_next_tokens()[0]\n", " print(tokenizer_stream.decode(new_token), end='', flush=True)\n", " if timings: new_tokens.append(new_token)\n", " except KeyboardInterrupt:\n", " print(\" --control+c pressed, aborting generation--\")\n", " print()\n", " print()\n", "\n", " # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n", " del generator\n", "\n", " if timings:\n", " prompt_time = first_token_timestamp - started_timestamp\n", " run_time = time.time() - first_token_timestamp\n", " print(f\"Prompt length: {len(input_tokens)}, New tokens: {len(new_tokens)}, Time to first: {(prompt_time):.2f}s, Prompt tokens per second: {len(input_tokens)/prompt_time:.2f} tps, New tokens per second: {len(new_tokens)/run_time:.2f} tps\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "py310", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 5 }

phi3/3_optimization_olive.ipynb (572 lines of code) (raw):