# Optimize SLM using Microsoft Olive

This hands-on considers on-device or hybrid deployment scenarios.

### Overview

Microsoft Olive is a hardware-aware AI model optimization toolchain developed by Microsoft to streamline the deployment of AI models. Olive simplifies the process of preparing AI models for deployment by making them faster and more efficient, particularly for use on edge devices, cloud, and various hardware configurations. It works by automatically applying optimizations to the AI models, such as reducing model size, lowering latency, and improving performance, without requiring manual intervention from developers.

Key features of Microsoft Olive include:

-   **Automated optimization**: Olive analyzes and applies optimizations specific to the modelâ€™s hardware environment.
-   **Cross-platform compatibility**: It supports various platforms such as Windows, Linux, and different hardware architectures, including CPUs, GPUs, and specialized AI accelerators.
-   **Integration with Microsoft tools**: Olive is designed to work seamlessly with Microsoft AI services like Azure, making it easier to deploy optimized models in cloud-based solutions.

[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.


In [None]:
%load_ext autoreload
%autoreload 2

import os, sys
lab_prep_dir = os.getcwd().split("slm-innovator-lab")[0] + "slm-innovator-lab/0_lab_preparation"
sys.path.append(os.path.abspath(lab_prep_dir))

from common import check_kernel
check_kernel()

In [None]:
%pip install onnxruntime-genai==0.4.0

In [None]:
%store -r job_name
try:
    job_name
    print(job_name)
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the previous notebook (model training) again.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

## 1. Load config file

---


In [None]:
import os
import yaml
from logger import logger
from datetime import datetime
snapshot_date = datetime.now().strftime("%Y-%m-%d")

with open('config.yml') as f:
    d = yaml.load(f, Loader=yaml.FullLoader)
    
AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']
AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']
AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']
AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME']    
DATA_DIR = d['config']['DATA_DIR']
CLOUD_DIR = d['config']['CLOUD_DIR']
HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']
IS_DEBUG = d['config']['IS_DEBUG']

azure_env_name = d['serve']['azure_env_name']
azure_model_name = d['serve']['azure_model_name']

logger.info("===== 0. Azure ML Deployment Info =====")
logger.info(f"AZURE_SUBSCRIPTION_ID={AZURE_SUBSCRIPTION_ID}")
logger.info(f"AZURE_RESOURCE_GROUP={AZURE_RESOURCE_GROUP}")
logger.info(f"AZURE_WORKSPACE={AZURE_WORKSPACE}")
logger.info(f"AZURE_DATA_NAME={AZURE_DATA_NAME}")
logger.info(f"DATA_DIR={DATA_DIR}")
logger.info(f"CLOUD_DIR={CLOUD_DIR}")
logger.info(f"HF_MODEL_NAME_OR_PATH={HF_MODEL_NAME_OR_PATH}")
logger.info(f"IS_DEBUG={IS_DEBUG}")

logger.info(f"azure_env_name={azure_env_name}")
logger.info(f"azure_model_name={azure_model_name}")

<br>

## 2. Model preparation

---

### 2.1. Configure workspace details

To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.


In [None]:
# import required libraries
import time
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml import command
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError

logger.info(f"===== 2. Serving preparation =====")
logger.info(f"Calling DefaultAzureCredential.")
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)

### 2.2. Create model asset


In [None]:
def get_or_create_model_asset(ml_client, model_name, job_name, model_dir="outputs", model_type="custom_model", update=False):
    
    try:
        latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])
        if update:
            raise ResourceExistsError('Found Model asset, but will update the Model.')
        else:
            model_asset = ml_client.models.get(name=model_name, version=latest_model_version)
            logger.info(f"Found Model asset: {model_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        logger.info(f"Exception: {e}")        
        model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/"    
        run_model = Model(
            name=model_name,        
            path=model_path,
            description="Model created from run.",
            type=model_type # mlflow_model, custom_model, triton_model
        )
        model_asset = ml_client.models.create_or_update(run_model)
        logger.info(f"Created Model asset: {model_name}")

    return model_asset

In [None]:
model_dir = d['train']['model_dir']
model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type="custom_model", update=False)

logger.info("===== 3. (Optional) Create model asset and get fine-tuned LLM to local folder =====")
logger.info(f"azure_model_name={azure_model_name}")
logger.info(f"model_dir={model_dir}")
#logger.info(f"model={model}")

### 2.3. Get fine-tuned LLM adapter to local folder

You can copy it to your local directory to perform inference or serve the model in Azure environment. (e.g., real-time endpoint)


In [None]:
# Download the model 
local_model_dir = "./artifact_downloads"
os.makedirs(local_model_dir, exist_ok=True)

ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version)

### 2.4. Merge adapter and save


In [None]:
import torch
from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM
model_tmp_dir = os.path.join(local_model_dir, azure_model_name, model_dir)
model = AutoPeftModelForCausalLM.from_pretrained(model_tmp_dir, torch_dtype=torch.bfloat16)
merged_model = model.merge_and_unload()

In [None]:
merged_model_dir = os.path.join(local_model_dir, "merged")
merged_model.save_pretrained(merged_model_dir, safe_serialization=True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME_OR_PATH)
tokenizer.save_pretrained(merged_model_dir)

## 3. Optimization using Olive

---

Before running this notebook, please make sure you have installed the [Olive](https://github.com/microsoft/Olive) package.

#### Input model

You can also select Azure ML curated model. The input model will be automatically downloaded from the Azure Model catalog:

#### Systems

We use `LocalSystem` as the device in this notebook. We enable `CPUExecutionProvider` in the `accelerators` field. but you can use Azure ML and Azure Arc.

#### Passes

We can add several passes to the config file. For example, you can pass LoRA, Evaluation, MergeAdapter, and Quantization. But for this hands-on, we'll simply do a 4-bit quantization followed by an ONNX conversion.


In [None]:
%%writefile olive/olive_onnx_config.json
{
    "input_model": {
        "type": "HfModel",
        "model_path": "{{merged_model_dir}}",
        "load_kwargs": {
            "trust_remote_code": true
        }
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "accelerators": [
                {
                    "device": "CPU",
                    "execution_providers": [
                        "CPUExecutionProvider"
                    ]
                }
            ]
        }
    },
    "passes": {
        "builder": {
            "type": "ModelBuilder",
            "precision": "int4",
            "int4_accuracy_level": 4
        }
    },
    "pass_flows": [
        [
            "builder"
        ]
    ],
    "cache_dir": "{{olive_cache_dir}}",
    "output_dir": "{{olive_output_dir}}",
    "host": "local_system",
    "target": "local_system"
}

In [None]:
import jinja2
from pathlib import Path
jinja_env = jinja2.Environment()  

olive_cache_dir = "olive_cache"
olive_output_dir = "olive_models"

template = jinja_env.from_string(Path("olive/olive_onnx_config.json").open().read())
Path("olive/olive_onnx_config.json").open("w").write(
    template.render(merged_model_dir=merged_model_dir, olive_cache_dir=olive_cache_dir, olive_output_dir=olive_output_dir)
)
!pygmentize olive/olive_onnx_config.json | cat -n

In [None]:
HF_TOKEN = "" # Your Hugging Face Token
!huggingface-cli login --token {HF_TOKEN} --add-to-git-credential

### 3.2. Start Optimization

It takes a few minutes to complete the code cell below.


In [None]:
import sys
!{sys.executable} -m olive run --config olive/olive_onnx_config.json

### 3.3. Prediction

You don't need a GPU device - you can load and infer Phi models on your on-device.


In [None]:
import time
import onnxruntime_genai as og

onnx_path = f"./{olive_output_dir}/output_model/model"
!cp tokenizer.json {onnx_path}
model = og.Model(onnx_path)
tokenizer = og.Tokenizer(model)

In [None]:
tokenizer_stream = tokenizer.create_stream()
 
search_options = {}
search_options['min_length'] = 128
search_options['max_length'] = 256
search_options['do_sample'] = True
search_options['temperature'] = 0.1
search_options['top_p'] = 0.95

timings = True
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

In [None]:
# Keep asking for input prompts in a loop
while True:
    text = input("Input: ")
    if not text:
        print("Error, input cannot be empty")
        continue

    if timings: started_timestamp = time.time()

    # If there is a chat template, use it
    prompt = f'{chat_template.format(input=text)}'

    input_tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(**search_options)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)
    if timings:
        first = True
        new_tokens = []

    print()
    print("Output: ", end='', flush=True)

    try:
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()
            if timings:
                if first:
                    first_token_timestamp = time.time()
                    first = False

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
            if timings: new_tokens.append(new_token)
    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")
    print()
    print()

    # Delete the generator to free the captured graph for the next generator, if graph capture is enabled
    del generator

    if timings:
        prompt_time = first_token_timestamp - started_timestamp
        run_time = time.time() - first_token_timestamp
        print(f"Prompt length: {len(input_tokens)}, New tokens: {len(new_tokens)}, Time to first: {(prompt_time):.2f}s, Prompt tokens per second: {len(input_tokens)/prompt_time:.2f} tps, New tokens per second: {len(new_tokens)/run_time:.2f} tps")
