training/sagemaker-automatic-model-tuning/hpo_xgboost_random_log.ipynb (743 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.008677,
"end_time": "2021-06-08T00:19:14.607940",
"exception": false,
"start_time": "2021-06-08T00:19:14.599263",
"status": "completed"
},
"tags": []
},
"source": [
"# Random search and hyperparameter scaling with SageMaker XGBoost and Automatic Model Tuning\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.008677,
"end_time": "2021-06-08T00:19:14.607940",
"exception": false,
"start_time": "2021-06-08T00:19:14.599263",
"status": "completed"
},
"tags": []
},
"source": [
"\n",
"---\n",
"\n",
"## Contents\n",
"\n",
"1. [Introduction](#Introduction)\n",
"1. [Preparation](#Preparation)\n",
"1. [Download and prepare the data](#Download-and-prepare-the-data)\n",
"1. [Setup hyperparameter tuning](#Setup-hyperparameter-tuning)\n",
"1. [Logarithmic scaling](#Logarithmic-scaling)\n",
"1. [Random search](#Random-search)\n",
"1. [Linear scaling](#Linear-scaling)\n",
"\n",
"\n",
"---\n",
"\n",
"## Introduction\n",
"\n",
"This notebook showcases the use of two hyperparameter tuning features: **random search** and **hyperparameter scaling**.\n",
"\n",
"\n",
"We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.\n",
"\n",
"---\n",
"\n",
"## Preparation\n",
"\n",
"Let's start by specifying:\n",
"\n",
"- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as SageMaker training.\n",
"- The IAM role used to give training access to your data. See SageMaker documentation for how to create these."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2021-06-08T00:19:14.630095Z",
"iopub.status.busy": "2021-06-08T00:19:14.629379Z",
"iopub.status.idle": "2021-06-08T00:19:15.771648Z",
"shell.execute_reply": "2021-06-08T00:19:15.772039Z"
},
"isConfigCell": true,
"papermill": {
"duration": 1.155625,
"end_time": "2021-06-08T00:19:15.772182",
"exception": false,
"start_time": "2021-06-08T00:19:14.616557",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"import sagemaker\n",
"import boto3\n",
"from sagemaker.tuner import (\n",
" IntegerParameter,\n",
" CategoricalParameter,\n",
" ContinuousParameter,\n",
" HyperparameterTuner,\n",
")\n",
"\n",
"import numpy as np # For matrix operations and numerical processing\n",
"import pandas as pd # For munging tabular data\n",
"import os\n",
"from time import gmtime, strftime\n",
"\n",
"region = boto3.Session().region_name\n",
"smclient = boto3.Session().client(\"sagemaker\")\n",
"\n",
"role = sagemaker.get_execution_role()\n",
"\n",
"bucket = sagemaker.Session().default_bucket()\n",
"prefix = \"sagemaker/DEMO-hpo-xgboost-dm\""
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.008818,
"end_time": "2021-06-08T00:19:15.789912",
"exception": false,
"start_time": "2021-06-08T00:19:15.781094",
"status": "completed"
},
"tags": []
},
"source": [
"---\n",
"\n",
"## Download and prepare the data\n",
"Here we download the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2021-06-08T00:19:15.811976Z",
"iopub.status.busy": "2021-06-08T00:19:15.811347Z",
"iopub.status.idle": "2021-06-08T00:19:16.485053Z",
"shell.execute_reply": "2021-06-08T00:19:16.485448Z"
},
"papermill": {
"duration": 0.68692,
"end_time": "2021-06-08T00:19:16.485591",
"exception": false,
"start_time": "2021-06-08T00:19:15.798671",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip\n",
"!unzip -o bank-additional.zip"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.01023,
"end_time": "2021-06-08T00:19:16.506406",
"exception": false,
"start_time": "2021-06-08T00:19:16.496176",
"status": "completed"
},
"tags": []
},
"source": [
"Now let us load the data, apply some preprocessing, and upload the processed data to s3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2021-06-08T00:19:16.538372Z",
"iopub.status.busy": "2021-06-08T00:19:16.537591Z",
"iopub.status.idle": "2021-06-08T00:19:17.558844Z",
"shell.execute_reply": "2021-06-08T00:19:17.558236Z"
},
"papermill": {
"duration": 1.04247,
"end_time": "2021-06-08T00:19:17.559042",
"exception": true,
"start_time": "2021-06-08T00:19:16.516572",
"status": "failed"
},
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# Load data\n",
"data = pd.read_csv(\"./bank-additional/bank-additional-full.csv\", sep=\";\")\n",
"pd.set_option(\"display.max_columns\", 500) # Make sure we can see all of the columns\n",
"pd.set_option(\"display.max_rows\", 50) # Keep the output on one page\n",
"\n",
"# Apply some feature processing\n",
"data[\"no_previous_contact\"] = np.where(\n",
" data[\"pdays\"] == 999, 1, 0\n",
") # Indicator variable to capture when pdays takes a value of 999\n",
"data[\"not_working\"] = np.where(\n",
" np.in1d(data[\"job\"], [\"student\", \"retired\", \"unemployed\"]), 1, 0\n",
") # Indicator for individuals not actively employed\n",
"model_data = pd.get_dummies(data) # Convert categorical variables to sets of indicators\n",
"\n",
"# columns that should not be included in the input\n",
"model_data = model_data.drop(\n",
" [\"duration\", \"emp.var.rate\", \"cons.price.idx\", \"cons.conf.idx\", \"euribor3m\", \"nr.employed\"],\n",
" axis=1,\n",
")\n",
"\n",
"# split data\n",
"train_data, validation_data, test_data = np.split(\n",
" model_data.sample(frac=1, random_state=1729),\n",
" [int(0.7 * len(model_data)), int(0.9 * len(model_data))],\n",
")\n",
"\n",
"# save preprocessed file to s3\n",
"pd.concat([train_data[\"y_yes\"], train_data.drop([\"y_no\", \"y_yes\"], axis=1)], axis=1).to_csv(\n",
" \"train.csv\", index=False, header=False\n",
")\n",
"pd.concat(\n",
" [validation_data[\"y_yes\"], validation_data.drop([\"y_no\", \"y_yes\"], axis=1)], axis=1\n",
").to_csv(\"validation.csv\", index=False, header=False)\n",
"pd.concat([test_data[\"y_yes\"], test_data.drop([\"y_no\", \"y_yes\"], axis=1)], axis=1).to_csv(\n",
" \"test.csv\", index=False, header=False\n",
")\n",
"boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n",
" os.path.join(prefix, \"train/train.csv\")\n",
").upload_file(\"train.csv\")\n",
"boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n",
" os.path.join(prefix, \"validation/validation.csv\")\n",
").upload_file(\"validation.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# input for SageMaker\n",
"\n",
"from sagemaker.inputs import TrainingInput\n",
"\n",
"s3_input_train = TrainingInput(\n",
" s3_data=\"s3://{}/{}/train\".format(bucket, prefix), content_type=\"csv\"\n",
")\n",
"\n",
"s3_input_validation = TrainingInput(\n",
" s3_data=\"s3://{}/{}/validation\".format(bucket, prefix), content_type=\"csv\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"---\n",
"\n",
"## Setup hyperparameter tuning\n",
"In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, and define the static hyperparameter and objective"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"from sagemaker.amazon.amazon_estimator import get_image_uri\n",
"from sagemaker.image_uris import retrieve\n",
"\n",
"sess = sagemaker.Session()\n",
"\n",
"container = retrieve(\"xgboost\", region, \"latest\")\n",
"\n",
"xgb = sagemaker.estimator.Estimator(\n",
" container,\n",
" role,\n",
" base_job_name=\"xgboost-random-search\",\n",
" instance_count=1,\n",
" instance_type=\"ml.m4.xlarge\",\n",
" output_path=\"s3://{}/{}/output\".format(bucket, prefix),\n",
" sagemaker_session=sess,\n",
")\n",
"\n",
"xgb.set_hyperparameters(\n",
" eval_metric=\"auc\",\n",
" objective=\"binary:logistic\",\n",
" num_round=10,\n",
" rate_drop=0.3,\n",
" tweedie_variance_power=1.4,\n",
")\n",
"objective_metric_name = \"validation:auc\""
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"## Logarithmic scaling\n",
"\n",
"In both cases we use logarithmic scaling, which is the scaling type that should be used whenever the order of magnitude is more important that the absolute value. It should be used if a change, say, from 1 to 2 is expected to have a much bigger impact than a change from 100 to 101, due to the fact that the hyperparameter doubles in the first case but not in the latter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"hyperparameter_ranges = {\n",
" \"alpha\": ContinuousParameter(0.01, 10, scaling_type=\"Logarithmic\"),\n",
" \"lambda\": ContinuousParameter(0.01, 10, scaling_type=\"Logarithmic\"),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"## Random search\n",
"\n",
"We now start a tuning job using random search. The main advantage of using random search is that this allows us to train jobs with a high level of parallelism"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"tuner_log = HyperparameterTuner(\n",
" xgb,\n",
" objective_metric_name,\n",
" hyperparameter_ranges,\n",
" max_jobs=5,\n",
" max_parallel_jobs=5,\n",
" strategy=\"Random\",\n",
")\n",
"\n",
"tuner_log.fit(\n",
" {\"train\": s3_input_train, \"validation\": s3_input_validation},\n",
" include_cls_metadata=False,\n",
" job_name=\"xgb-randsearch-\" + strftime(\"%Y%m%d-%H-%M-%S\", gmtime()),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuner_log.latest_tuning_job.job_name\n",
")[\"HyperParameterTuningJobStatus\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"## Linear scaling\n",
"\n",
"Let us compare the results with executing a job using linear scaling."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"hyperparameter_ranges_linear = {\n",
" \"alpha\": ContinuousParameter(0.01, 10, scaling_type=\"Linear\"),\n",
" \"lambda\": ContinuousParameter(0.01, 10, scaling_type=\"Linear\"),\n",
"}\n",
"tuner_linear = HyperparameterTuner(\n",
" xgb,\n",
" objective_metric_name,\n",
" hyperparameter_ranges_linear,\n",
" max_jobs=5,\n",
" max_parallel_jobs=5,\n",
" strategy=\"Random\",\n",
")\n",
"\n",
"tuner_linear.fit(\n",
" {\"train\": s3_input_train, \"validation\": s3_input_validation},\n",
" include_cls_metadata=False,\n",
" job_name=\"xgb-linsearch-\" + strftime(\"%Y%m%d-%H-%M-%S\", gmtime()),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"Check of the hyperparameter tuning jobs status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuner_linear.latest_tuning_job.job_name\n",
")[\"HyperParameterTuningJobStatus\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"## Analyze tuning job results - after tuning job is completed\n",
"\n",
"**Once the tuning jobs have completed**, we can compare the distribution of the hyperparameter configurations chosen in the two cases.\n",
"\n",
"Please refer to \"HPO_Analyze_TuningJob_Results.ipynb\" to see more example code to analyze the tuning job results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"outputs": [],
"source": [
"import seaborn as sns\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# check jobs have finished\n",
"status_log = boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuner_log.latest_tuning_job.job_name\n",
")[\"HyperParameterTuningJobStatus\"]\n",
"status_linear = boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuner_linear.latest_tuning_job.job_name\n",
")[\"HyperParameterTuningJobStatus\"]\n",
"\n",
"assert status_log == \"Completed\", \"First must be completed, was {}\".format(status_log)\n",
"assert status_linear == \"Completed\", \"Second must be completed, was {}\".format(status_linear)\n",
"\n",
"df_log = sagemaker.HyperparameterTuningJobAnalytics(\n",
" tuner_log.latest_tuning_job.job_name\n",
").dataframe()\n",
"df_linear = sagemaker.HyperparameterTuningJobAnalytics(\n",
" tuner_linear.latest_tuning_job.job_name\n",
").dataframe()\n",
"df_log[\"scaling\"] = \"log\"\n",
"df_linear[\"scaling\"] = \"linear\"\n",
"df = pd.concat([df_log, df_linear], ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"g = sns.FacetGrid(df, col=\"scaling\", palette=\"viridis\")\n",
"g = g.map(plt.scatter, \"alpha\", \"lambda\", alpha=0.6)"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": null,
"end_time": null,
"exception": null,
"start_time": null,
"status": "pending"
},
"tags": []
},
"source": [
"## Deploy the best model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictor = tuner_linear.deploy(initial_instance_count=1, instance_type=\"ml.m4.xlarge\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete the end point"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sess.delete_endpoint(endpoint_name=predictor.endpoint_name)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
},
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.",
"papermill": {
"default_parameters": {},
"duration": 4.297604,
"end_time": "2021-06-08T00:19:18.065191",
"environment_variables": {},
"exception": true,
"input_path": "hpo_xgboost_random_log.ipynb",
"output_path": "/opt/ml/processing/output/hpo_xgboost_random_log-2021-06-08-00-15-02.ipynb",
"parameters": {
"kms_key": "arn:aws:kms:us-west-2:521695447989:key/6e9984db-50cf-4c7e-926c-877ec47a8b25"
},
"start_time": "2021-06-08T00:19:13.767587",
"version": "2.3.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}