training/sagemaker-automatic-model-tuning/tune_r_bring_your_own.ipynb (441 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hyperparameter Tuning with Your Own Container in Amazon SageMaker\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_**Using Amazon SageMaker's Hyperparameter Tuning with a customer Docker container and R algorithm**_\n",
"\n",
"---\n",
"\n",
"---\n",
"\n",
"## Contents\n",
"\n",
"1. [Background](#Background)\n",
"1. [Setup](#Setup)\n",
" 1. [Permissions](#Permissions)\n",
"1. [Code](#Code)\n",
" 1. [Publish](#Publish)\n",
"1. [Data](#Data)\n",
"1. [Tune](#Tune)\n",
"1. [Wrap-up](#Wrap-up)\n",
"\n",
"---\n",
"## Background\n",
"\n",
"R is a popular open source statistical programming language, with a lengthy history in Data Science and Machine Learning. The breadth of algorithms available as R packages is impressive and fuels a diverse community of users. In this example, we'll combine one of those algorithms ([Multivariate Adaptive Regression Splines](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines)) with SageMaker's hyperparameter tuning capabilities to build a simple model on the well-known [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). This notebook will focus mainly on the integration of hyperparameter tuning and a custom algorithm container, rather than the process of building your own container. For more details on that process, please see this [notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/r_bring_your_own/r_bring_your_own.ipynb).\n",
"\n",
"---\n",
"## Setup\n",
"\n",
"_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
"\n",
"Let's start by specifying:\n",
"\n",
"- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the notebook instance, training, and hosting.\n",
"- The IAM role arn used to give training and hosting access to your data. See the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/using-identity-based-policies.html) for more details on creating these. Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"isConfigCell": true
},
"outputs": [],
"source": [
"import sagemaker\n",
"\n",
"bucket = sagemaker.Session().default_bucket()\n",
"prefix = \"sagemaker/DEMO-hpo-r-byo\"\n",
"\n",
"role = sagemaker.get_execution_role()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll import the libraries we'll need for the remainder of the notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import boto3\n",
"import sagemaker\n",
"from sagemaker.tuner import (\n",
" IntegerParameter,\n",
" CategoricalParameter,\n",
" ContinuousParameter,\n",
" HyperparameterTuner,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Permissions\n",
"\n",
"Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because we'll be creating a new repository in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role associated with your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.\n",
"\n",
"---\n",
"## Code\n",
"\n",
"For this example, we'll need 3 supporting code files. We'll provide just a brief overview of what each one does. See the full R bring your own notebook for more details.\n",
"\n",
"- **Fit**: `mars.R` creates functions to train and serve our model.\n",
"- **Serve**: `plumber.R` uses the [plumber](https://www.rplumber.io/) package to create a lightweight HTTP server for processing requests in hosting. Note the specific syntax, and see the plumber help docs for additional detail on more specialized use cases.\n",
"- **Dockerfile**: This specifies the configuration for our docker container. Smaller containers are preferred for Amazon SageMaker as they lead to faster spin up times in training and endpoint creation, so this container is kept minimal. It simply starts with Ubuntu, installs R, mda, and plumber libraries, then adds `mars.R` and `plumber.R`, and finally sets `mars.R` to run as the entrypoint when launched."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Publish\n",
"Now, to publish this container to ECR, we'll run the comands below.\n",
"\n",
"This command will take several minutes to run the first time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"\n",
"# The name of our algorithm\n",
"algorithm_name=rmars\n",
"\n",
"#set -e # stop if anything fails\n",
"account=$(aws sts get-caller-identity --query Account --output text)\n",
"\n",
"# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
"region=$(aws configure get region)\n",
"region=${region:-us-west-2}\n",
"\n",
"if [ \"$region\" = \"cn-north-1\" ] || [ \"$region\" = \"cn-northwest-1\" ]; then domain=\"amazonaws.com.cn\"; \n",
"else domain=\"amazonaws.com\"; fi\n",
"\n",
"fullname=\"${account}.dkr.ecr.${region}.${domain}/${algorithm_name}:latest\"\n",
"\n",
"# If the repository doesn't exist in ECR, create it.\n",
"aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
"\n",
"if [ $? -ne 0 ]\n",
"then\n",
" aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
"fi\n",
"\n",
"# Get the login command from ECR and execute it directly\n",
"$(aws ecr get-login --region ${region} --no-include-email)\n",
"\n",
"# Build the docker image locally with the image name and then push it to ECR\n",
"# with the full name.\n",
"docker build -t ${algorithm_name} .\n",
"docker tag ${algorithm_name} ${fullname}\n",
"\n",
"docker push ${fullname}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Data\n",
"For this illustrative example, we'll simply use `iris`. This a classic, but small, dataset used to test supervised learning algorithms. Typically the goal is to predict one of three flower species based on various measurements of the flowers' attributes. Further detail can be found [here](https://en.wikipedia.org/wiki/Iris_flower_data_set).\n",
"\n",
"Let's copy the data to S3 so that SageMaker training can access it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_file = \"iris.csv\"\n",
"boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n",
" os.path.join(prefix, \"train\", train_file)\n",
").upload_file(train_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Note: Although we could do preliminary data transformations in the notebook, we'll avoid doing so, instead choosing to do those transformations inside the container. This is not typically the best practice for model efficiency, but provides some benefits in terms of flexibility._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Tune\n",
"\n",
"Now, let's setup the information needed to train a Multivariate Adaptive Regression Splines model on `iris` data. In this case, we'll predict `Sepal.Length` rather than the more typical classification of `Species` in order to show how factors might be included in a model and to limit the use case to regression.\n",
"\n",
"First, we'll get our region and account information so that we can point to the ECR container we just created."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"region = boto3.Session().region_name\n",
"account = boto3.client(\"sts\").get_caller_identity().get(\"Account\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll create an estimator using the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk). This allows us to specify:\n",
"- The training container image in ECR\n",
"- The IAM role that controls permissions for accessing the S3 data and executing SageMaker functions\n",
"- Number and type of training instances\n",
"- S3 path for model artifacts to be output to\n",
"- Any hyperparameters that we want to have the same value across all training jobs during tuning"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"domain = (\n",
" \"amazonaws.com.cn\"\n",
" if (region == \"cn-north-1\" or region == \"cn-northwest-1\")\n",
" else \"amazonaws.com\"\n",
")\n",
"\n",
"estimator = sagemaker.estimator.Estimator(\n",
" image_uri=\"{}.dkr.ecr.{}.{}/rmars:latest\".format(account, region, domain),\n",
" role=role,\n",
" train_instance_count=1,\n",
" train_instance_type=\"ml.m4.xlarge\",\n",
" output_path=\"s3://{}/{}/output\".format(bucket, prefix),\n",
" sagemaker_session=sagemaker.Session(),\n",
" hyperparameters={\"target\": \"Sepal.Length\"},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we've defined our estimator we can specify the hyperparameters that we'd like to tune and their possible values. We have three different types of hyperparameters.\n",
"- Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to `CategoricalParameter(list)`\n",
"- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`\n",
"- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`\n",
"\n",
"*Note, if possible, it's almost always best to specify a value as the least restrictive type. For example, tuning `thresh` as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with possible values of 0.01, 0.1, 0.15, or 0.2.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hyperparameter_ranges = {\n",
" \"degree\": IntegerParameter(1, 3),\n",
" \"thresh\": ContinuousParameter(0.001, 0.01),\n",
" \"prune\": CategoricalParameter([\"TRUE\", \"FALSE\"]),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll specify the objective metric that we'd like to tune and its definition. This metric is output by a `print` statement in our `mars.R` file. Its critical that the format aligns with the regular expression (Regex) we then specify to extract that metric from the CloudWatch logs of our training job."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"objective_metric_name = \"mse\"\n",
"metric_definitions = [{\"Name\": \"mse\", \"Regex\": \"mse: ([0-9\\\\.]+)\"}]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we'll create a `HyperparameterTuner` object, which we pass:\n",
"- The MXNet estimator we created above\n",
"- Our hyperparameter ranges\n",
"- Objective metric name and definition\n",
"- Whether we should maximize or minimize our objective metric (defaults to 'Maximize')\n",
"- Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tuner = HyperparameterTuner(\n",
" estimator,\n",
" objective_metric_name,\n",
" hyperparameter_ranges,\n",
" metric_definitions,\n",
" objective_type=\"Minimize\",\n",
" max_jobs=9,\n",
" max_parallel_jobs=3,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, we can start our hyperparameter tuning job by calling `.fit()` and passing in the S3 paths to our train and test datasets.\n",
"\n",
"*Note, typically for hyperparameter tuning, we'd want to specify both a training and validation (or test) dataset and optimize the objective metric from the validation dataset. However, because `iris` is a very small dataset we'll skip the step of splitting into training and validation. In practice, doing this could lead to a model that overfits to our training data and does not generalize well.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tuner.fit({\"train\": \"s3://{}/{}/train\".format(bucket, prefix)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is `InProgress`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuner.latest_tuning_job.job_name\n",
")[\"HyperParameterTuningJobStatus\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Wrap-up\n",
"\n",
"Now that we've started our hyperparameter tuning job, it will run in the background and we can close this notebook. Once finished, we can use the [HPO Analysis notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) to determine which set of hyperparameters worked best.\n",
"\n",
"For more detail on Amazon SageMaker's Hyperparameter Tuning, please refer to the AWS documentation. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
},
"nbformat": 4,
"nbformat_minor": 2
}