example_notebooks/sagemaker-core-llama-3-8B-speculative-decoding.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Amazon SageMaker Using AWS Large Model Inference (LMI) Deep Learning Container with Speculative Decoding \n", "\n", "**Recommended Kernel(s):** You can run this notebook with any Amazon SageMaker Studio kernel. We recommend using the Data Science 3.0 kernel.\n", "\n", "This notebook demonstrates how to deploy [`meta-llama/Meta-Llama-3-8B`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) HuggingFace model to a SageMaker Endpoint for text generation with speculative decoding enabled. In this example, the SageMaker-managed [LMI (Large Model Inference)](https://docs.djl.ai/docs/serving/serving/docs/large_model_inference.html) Docker image will serve as the inference image. LMI images feature a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). \n", "\n", "**What is speculative decoding?** In the context of large language model inference, speculative decoding, as introduced by [Y. Leviathan et al. (ICML 2023)](https://arxiv.org/abs/2211.17192), is a technique used to accelerate the decoding process of large and therefore slow LLMs for latency-critical applications. The key idea is to use a smaller, less powerful but faster model called the *draft model* to generate candidate tokens that get validated by the larger, more powerful but slower model called the *target model*. At each iteration, the draft model generates $K>1$ candidate tokens. Then, using a single forward pass of the larger *target model*, none, part, or all candidate tokens get accepted. The more aligned the selected draft model is with the target model, the better guesses it makes, resulting in a higher candidate token acceptance rate and potential for speed up.\n", "\n", "\n", "## 1. Dependency Installation\n", "### 1.1. Python Dependencies & Imports\n", "This notebook requires the following Python dependencies:\n", "* AWS [`sagemaker_core`]()\n", "\n", "Let's install or upgrade these dependencies using the following command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install pip --upgrade --quiet\n", "%pip install sagemaker-core huggingface_hub --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "import os\n", "\n", "from sagemaker_core.helper.session_helper import get_execution_role, Session\n", "import pathlib \n", "import huggingface_hub" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "PROJECT_ROOT_DIR = pathlib.Path.cwd()\n", "\n", "SM_SESSION = Session()\n", "\n", "REGION = SM_SESSION._region_name\n", "\n", "SM_DEFAULT_EXECUTION_ROLE_ARN = get_execution_role()\n", "\n", "INSTANCE_TYPE = \"ml.g5.12xlarge\" \n", "\n", "#INSTANCE_TYPE = \"ml.p4d.24xlarge\" \n", "\n", "# See https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers for lastest DLC. \n", "container_image_uri = f\"763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124\"\n", "\n", "default_bucket = SM_SESSION.default_bucket()\n", "\n", "TARGET_MODEL = \"meta-llama/Meta-Llama-3-8B\"\n", "DRAFT_MODEL = \"sagemaker\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download custom draft model" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## 2. Deploy Speculative Decoding-Enabled Endpoint\n", "### 2.1. Endpoint Deployment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will configure what models to use and server startup parameters and via environment variables. \n", "\n", "* `OPTION_MODEL_ID` points to the HF Model ID or base S3 prefix of the target model artifacts\n", "* `OPTION_SPECULATIVE_DRAFT_MODEL` points to the HF Model ID, base S3 prefix of the draft model artifacts or in our case `sagemaker` which represents the SageMaker draft model. \n", "\n", "You can read about the other config paramters in LMI [documentation](https://github.com/deepjavalibrary/djl-serving/tree/master/serving/docs/lmi).\n", "\n", "Note you can replace the target and draft model for you own." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "environment = {\n", " \"HF_MODEL_ID\":TARGET_MODEL,\n", " \"OPTION_SPECULATIVE_DRAFT_MODEL\":DRAFT_MODEL,\n", " \"OPTION_GPU_MEMORY_UTILIZATION\":\"0.85\",\n", " \"HF_TOKEN\":\"<YourHFToken>\"\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker_core.shapes import ContainerDefinition, ProductionVariant\n", "from sagemaker_core.resources import Model, EndpointConfig, Endpoint\n", "from time import gmtime, strftime" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "container_defintion = ContainerDefinition(\n", " image=container_image_uri,\n", " environment=environment\n", " \n", " )\n", "container_defintion.environment = environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_name =f'speculative-decoding-hugging-face-{strftime(\"%H-%M-%S\", gmtime())}'\n", "\n", "model = Model.create( \n", " model_name=model_name,\n", " primary_container=container_defintion,\n", " execution_role_arn=SM_DEFAULT_EXECUTION_ROLE_ARN,\n", "\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [], "vscode": { "languageId": "plaintext" } }, "source": [ "Start-up of LLM inference containers can last longer than smaller models, mainly due to longer model downloading and loading times. Timeout values need to be increased accordingly from their default values. Each endpoint deployment takes a few minutes. We also set `routing_strategy` which would benefit us if we were to back our endpoint with multiple instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker_core.shapes import ProductionVariantRoutingConfig\n", "\n", "routing_config = ProductionVariantRoutingConfig(\n", " routing_strategy=\"LEAST_OUTSTANDING_REQUESTS\"\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint_config = EndpointConfig.create(\n", " endpoint_config_name=model_name,\n", " production_variants=[\n", " ProductionVariant(\n", " variant_name=model_name,\n", " initial_instance_count=1,\n", " instance_type=INSTANCE_TYPE,\n", " model_name=model,\n", " container_startup_health_check_timeout_in_seconds=3600,\n", " model_data_download_timeout_in_seconds=3600 ,\n", " routing_config=routing_config\n", " )\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint = Endpoint.create(\n", " endpoint_name=model_name,\n", " endpoint_config_name=endpoint_config # Pass `EndpointConfig` object created above\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "This cells will block until the endpoint is deployed, which is necessary for the following steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint.wait_for_status(\"InService\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Endpoint invocation" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Let's invoke our endpoint and get a sample response." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "response = endpoint.invoke(\n", " body =json.dumps({\n", " \"inputs\": [\"What is the capital of France?\"],\n", " \"max_new_tokens\": 512,\n", " \"temperature\": 0.0,\n", " }),\n", " content_type =\"application/json\"\n", ")\n", "response['Body'].read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Clean Up Endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Delete any sagemaker core resource objects created in this notebook\n", "def delete_all_sagemaker_resources():\n", " all_objects = list(locals().values()) + list(globals().values())\n", " deletable_objects = [obj for obj in all_objects if hasattr(obj, 'delete') and obj.__class__.__module__ == 'sagemaker_core.main.resources']\n", " \n", " for obj in deletable_objects:\n", " obj.delete()\n", " \n", "delete_all_sagemaker_resources()" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 57, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.trn1.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 58, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.trn1.32xlarge", "vcpuNum": 128 }, { "_defaultOrder": 59, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.trn1n.32xlarge", "vcpuNum": 128 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }

example_notebooks/sagemaker-core-llama-3-8B-speculative-decoding.ipynb (953 lines of code) (raw):