notebooks/text-generation/CodeLlama-7B-Compilation.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "59913016-f89e-4a0e-9afe-b3a06e9112d5", "metadata": {}, "source": [ "# Run `codellama/CodeLlama-7b-hf` from Hugging Face on Inf2 & Trn1" ] }, { "cell_type": "markdown", "id": "f8454655-ec27-45e3-8da7-f82b744321ee", "metadata": {}, "source": [ "In this example we compile and deploy the [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) model from Hugging Face on Neuron core devices using the `transformers-neuronx` package.\n", "\n", "The example has the following main sections:\n", "1. Deploy your instance\n", "1. Set up the Jupyter Notebook\n", "1. Inference from a pre-compiled model\n", "1. Compiling a model\n", "1. Save to disk\n", "\n", "\n", "This Jupyter Notebook can be run on a Inf2 instance (`inf2.xlarge`) or larger or also a Trn1 instance (`trn1.32xlarge`)." ] }, { "cell_type": "markdown", "id": "a649c112", "metadata": {}, "source": [ "## Deploy your instance\n", "\n", "1. Raise your limits\n", " The default limit for Inferentia and Trainium instances are 0, so request an increase before you start! EC2 instances are limited by vCPUs, but SageMaker limits are based on number of instances and instance size. Start with a vCPU limit request of 32 so you can deploy an inf2.xlarge (4 vCPUs) or an inf2.8xlarge (32 vCPUs). Go to [Service Quotas](https://us-west-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas) and search for `inf`.\n", "\n", " Request increases for both On-Demand and Spot (yes, there are usually Spot instances available!). IAD (us-east-1) and PDX (us-west-2) are the best regions to start in for the US. Limits are region specific!\n", "1. Go to EC2 to launch an instance. Make sure you choose the region to match your limit increases.\n", "1. Launch an inf2.xlarge instance with the [Hugging Face DLAMI(Deep learning AMI)](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2). This image comes with all the software preinstalled. Alternatively, you can follow the instructions at the [Neuron Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx)\n", "1. 100GB disk size is fine for this example, but larger models such as 70B make take over 500GB" ] }, { "cell_type": "markdown", "id": "af2b7693-2950-41fc-a038-17cba44bf003", "metadata": {}, "source": [ "## Set up the Jupyter Notebook" ] }, { "cell_type": "markdown", "id": "c47ef383-0dea-4423-8c38-29c73927fd78", "metadata": {}, "source": [ "(You could also just copy and paste most of this code into an SSH prompt)\n", "\n", "The following steps set up Jupyter Notebook and launch this tutorial:\n", "1. Clone the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) repo to your instance using\n", "```\n", "git clone https://github.com/huggingface/optimum-neuron.git\n", "```\n", "2. Navigate to the `text-generation` notebooks folder\n", "```\n", "cd optimum-neuron/notebooks/text-generation\n", "```\n", "3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.\n", "4. Locate this tutorial in your Jupyter Notebook session (`CodeLlama-7B-Compilation.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. " ] }, { "cell_type": "markdown", "id": "3f11f6f3", "metadata": {}, "source": [ "## Inference from a pre-compiled model\n", "\n", "The model needs to be exported(compiled) to a Neuron specific format. Some configuration changes (such as batch size and number of Neuron cores used) will require a re-compile. We have pre-compiled a model for you to use as a test using the Hugging Face Optimum Neuron library. \n", "\n", "In this case, the model is pre-compiled for 2 Neuron cores, so it will run on an inf2.xlarge. You can read more about the model at [aws-neuron/CodeLlama-7b-hf-neuron-8xlarge](https://huggingface.co/aws-neuron/CodeLlama-7b-hf-neuron-8xlarge)\n", "\n", "Instructions on how to compile a model are in the next section. \n", "\n", "**Keep in mind that this is a model for code generation, so the example prompt is code, and the response will probably be code**\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8446884c", "metadata": {}, "outputs": [], "source": [ "from optimum.neuron import pipeline\n", "\n", "\n", "p = pipeline('text-generation', 'aws-neuron/CodeLlama-7b-hf-neuron-8xlarge')\n", "p(\"import socket\\n\\ndef ping_exponential_backoff(host: str):\",\n", " do_sample=True,\n", " top_k=10,\n", " temperature=0.1,\n", " top_p=0.95,\n", " num_return_sequences=1,\n", " max_length=200,\n", ")" ] }, { "cell_type": "markdown", "id": "7ad1fc6a", "metadata": {}, "source": [ "If you are going to compile in the same session, you need to release the Neuron cores by deleting the pipeline object. Otherwise, you may get load errors." ] }, { "cell_type": "code", "execution_count": null, "id": "5da55ed5", "metadata": {}, "outputs": [], "source": [ "del p" ] }, { "cell_type": "markdown", "id": "d1c85162", "metadata": {}, "source": [ "Example output:\n", "\n", "[{'generated_text': 'import socket\\n\\ndef ping_exponential_backoff(host: str):\\n \"\"\"\\n Ping a host with exponential backoff.\\n\\n :param host: Host to ping\\n :return: True if host is reachable, False otherwise\\n \"\"\"\\n for i in range(1, 10):\\n try:\\n socket.create_connection((host, 80), 1).close()\\n return True\\n except OSError:\\n time.sleep(2 ** i)\\n return False\\n\\n\\ndef ping_exponential_backoff_with_timeout(host: str, timeout: int):\\n \"\"\"\\n Ping a host with exponential backoff and timeout.\\n\\n :param host: Host to ping\\n :param timeout: Timeout in seconds\\n :return: True if host is reachable, False otherwise\\n \"\"\"\\n for'}]\n", "\n" ] }, { "cell_type": "markdown", "id": "14400e26-2058-44b0-b680-b1cee57203aa", "metadata": {}, "source": [ "## Troubleshooting\n", "If you see the error\n", "\n", "```\n", "FileNotFoundError: Could not find a matching NEFF for your HLO in this directory. Ensure that the model you are trying to load is the same type and has the same parameters as the one you saved or call \"save\" on this model to reserialize it.\n", "```\n", "then you may be using a different version of the Neuron SDK than we used in this example. \n", "\n", "You can fix that by compiling your own following the instructions below!" ] }, { "cell_type": "markdown", "id": "48f5e207", "metadata": {}, "source": [ "## Compiling a model" ] }, { "cell_type": "markdown", "id": "5e233a69-5658-4180-8f6c-91f377a01001", "metadata": {}, "source": [ "Before you run the code below consider what machine you want to run the final result on. You will need to change the num_cores to match the number of cores on your target machine (at a maximum -- you could run a model compiled for 2 cores on an inf2.24xlarge, but you will only use 2 of the 12 available cores). \n", "\n", "Using this method, the machine you are compiling on needs to also have at least that number of cores. \n", "\n", " -inf2.xlarge, inf2.8xlarge: num_cores=2\n", "\n", " -inf2.24xlarge: num_cores=12\n", "\n", " -inf2.48xlarge: num_cores=24\n", "\n", "**The speed of this process depends on whether this is the first time you are running it (and the model needs to be downloaded to the local system) and whether the process is able to automatically download pre-compiled files from the online cache [aws-neuron/optimum-neuron-cache](https://huggingface.co/aws-neuron/optimum-neuron-cache)**\n", "\n", "The optimum.neuron library will download the original model directly from Hugging Face. In this example, that is [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf). This is different from the code above where we pointed to a model in the aws-neuron group.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e1ba7f35", "metadata": {}, "outputs": [], "source": [ "from optimum.neuron import NeuronModelForCausalLM\n", "\n", "\n", "#num_cores should be changed based on the instance. inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total\n", "compiler_args = {\"num_cores\": 2, \"auto_cast_type\": 'fp16'}\n", "input_shapes = {\"batch_size\": 1, \"sequence_length\": 2048}\n", "model = NeuronModelForCausalLM.from_pretrained(\"codellama/CodeLlama-7b-hf\", export=True, **compiler_args, **input_shapes)" ] }, { "cell_type": "markdown", "id": "45263053", "metadata": {}, "source": [ "## Save to disk\n", "\n", "If you want to save your compiled model out to a local directory:" ] }, { "cell_type": "code", "execution_count": null, "id": "39fa23a1", "metadata": {}, "outputs": [], "source": [ "model.save_pretrained(\"CodeLlama-7b-hf-neuron-8xlarge\")\n" ] }, { "cell_type": "markdown", "id": "cd7e1775", "metadata": {}, "source": [ "Once you have the model saved locally,you can just give the local directory name in the pipeline code above instead of the path on Hugging Face. You should delete the model object before loading the pipeline code above." ] }, { "cell_type": "code", "execution_count": null, "id": "9af39f07", "metadata": {}, "outputs": [], "source": [ "del model" ] }, { "cell_type": "markdown", "id": "643a84cf", "metadata": {}, "source": [ "If you want to save your model back up to Hugging Face:\n", "1. Create a WRITE token for you to use for command line access https://huggingface.co/settings/tokens\n", "1. While you are there, create a \"New Model\" for your system to reside in. Our example uses jburtoft/CodeLlama-7b-hf-neuron-24xlarge\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ec81c819", "metadata": {}, "outputs": [], "source": [ "from huggingface_hub.hf_api import HfFolder\n", "\n", "\n", "HfFolder.save_token('MY_HUGGINGFACE_TOKEN_HERE')" ] }, { "cell_type": "code", "execution_count": null, "id": "bdbc2537", "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import HfApi, login\n", "\n", "\n", "api = HfApi()\n", "login()\n", "\n", "api.upload_folder(\n", " folder_path=\"CodeLlama-7b-hf-neuron-8xlarge\",\n", " repo_id=\"jburtoft/CodeLlama-7b-hf-neuron-8xlarge\",\n", " repo_type=\"model\",\n", " multi_commits=True,\n", " multi_commits_verbose=True,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python (torch-neuronx)", "language": "python", "name": "aws_neuron_venv_pytorch" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }

notebooks/text-generation/CodeLlama-7B-Compilation.ipynb (305 lines of code) (raw):