training/built-in-algorithms/Amazon_Tabular_Classification_XGBoost

{ "cells": [ { "cell_type": "markdown", "id": "1e07a9bf", "metadata": {}, "source": [ "# Tabular classification with Amazon SageMaker XGBoost and Scikit-learn Linear Learner algorithm" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "153d79ea", "metadata": {}, "source": [ "---\n", "This notebook demonstrates the use of Amazon SageMaker\u2019s implementation of the [XGBoost](https://xgboost.readthedocs.io/en/latest/) and [scikit-learn Linear Learner](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) algorithm to train and host a tabular multiclass classification model. Tabular classification is the task of assigning a class to an example of structured or relational data. The Amazon SageMaker API for tabular classification can be used for classification of an example in two classes (binary classification) or more than two classes (multi-class classification).\n", "\n", "\n", "In this notebook, we demonstrate two use cases of tabular classification models:\n", "\n", "* How to train a tabular model on an example dataset to do multi-class classification.\n", "* How to use the trained tabular model to perform inference, i.e., classifying new samples.\n", "\n", "Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.medium instance with Python 3 (Data Science) kernel.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "ca1b3090", "metadata": {}, "source": [ "1. [Set Up](#1.-Set-Up)\n", "2. [Train A Tabular Model on MNIST Dataset](#2.-Train-a-Tabular-Model-on-MNIST-Dataset)\n", " * [Retrieve Training Artifacts](#2.1.-Retrieve-Training-Artifacts)\n", " * [Set Training Parameters](#2.2.-Set-Training-Parameters)\n", " * [Train with Automatic Model Tuning](#2.3.-Train-with-Automatic-Model-Tuning) \n", " * [Start Training](#2.4.-Start-Training)\n", "3. [Deploy and Run Inference on the Trained Tabular Model](#3.-Deploy-and-Run-Inference-on-the-Trained-Tabular-Model)\n", "4. [Evaluate the Prediction Results Returned from the Endpoint](#4.-Evaluate-the-Prediction-Results-Returned-from-the-Endpoint)" ] }, { "cell_type": "markdown", "id": "5b5e0661", "metadata": {}, "source": [ "## 1. Set Up" ] }, { "cell_type": "markdown", "id": "f54e1d97", "metadata": {}, "source": [ "---\n", "Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "730733e1", "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pip install sagemaker ipywidgets --upgrade --quiet" ] }, { "cell_type": "markdown", "id": "dc31b963", "metadata": {}, "source": [ "\n", "---\n", "To train and host on Amazon SageMaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "15ed5da7", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import sagemaker, boto3, json\n", "from sagemaker import get_execution_role\n", "\n", "aws_role = get_execution_role()\n", "aws_region = boto3.Session().region_name\n", "sess = sagemaker.Session()" ] }, { "cell_type": "markdown", "id": "6820016b", "metadata": {}, "source": [ "## 2. Train a Tabular Model on MNIST Dataset\n", "\n", "---\n", "In this demonstration, we will train a tabular algorithm on the\n", "[MNIST](http://yann.lecun.com/exdb/mnist/) dataset. \n", "The dataset contains examples of individual pixel values from each 28 x 28 grayscale image to predict the digit label of 10 classes {0, 1, 2, 3, ..., 9}. The MNIST dataset is downloaded from [THE MNIST DATABASE](http://yann.lecun.com/exdb/mnist/). \n", "\n", "Below is the table of the first 5 examples in the MNIST dataset.\n", "\n", "| Target | Feature_0 | Feature_1 | Feature_2 | ... | Feature_291 | Feature_293 | Feature_294 | ... | Feature_783 | Feature_784 |\n", "|:------:|:---------:|:---------:|:---------:|:---:|:-----------:|:-----------:|:-----------:|:---:|:-----------:|:-------------:|\n", "| 7 | 0.0 | 0.0 | 0.0 | ... | 0.00000 | 0.25781 | 0.05469 | ... | 0.0 | 0.0 |\n", "| 2 | 0.0 | 0.0 | 0.0 | ... | 0.00000 | 0.29687 | 0.96484 | ... | 0.0 | 0.0 |\n", "| 1 | 0.0 | 0.0 | 0.0 | ... | 0.00000 | 0.00000 | 0.00000 | ... | 0.0 | 0.0 |\n", "| 0 | 0.0 | 0.0 | 0.0 | ... | 0.98828 | 0.98046 | 0.98046 | ... | 0.0 | 0.0 |\n", "| 4 | 0.0 | 0.0 | 0.0 | ... | 0.20703 | 0.00000 | 0.00000 | ... | 0.0 | 0.0 |\n", "\n", "If you want to bring your own dataset, below are the instructions on how the training data should be formatted as input to the model.\n", "\n", "A S3 path should contain two sub-directories 'train/' and 'validation/' (optional). Each sub-directory contains a 'data.csv' file (The MNIST dataset used in this example has been prepared and saved in `training_dataset_s3_path` shown below).\n", "* The 'data.csv' files under sub-directory 'train/' and 'validation/' are for training and validation, respectively. The validation data is used to compute a validation score at the end of each boosting iteration. An early stopping is applied when the validation score stops improving. If the validation data is not provided, a 20% of training data is randomly sampled to serve as the validation data. \n", "\n", ">Note. For [Linear Learner](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression), the validation data is used to compute a validation score at the end of entire training process. There is no early stopping in the training process for scikit-learn linear model. If the validation data is not provided, the corresponding validation scores are not computed. \n", "\n", "* The first column of the 'data.csv' should have the corresponding target variable. The rest of other columns should have the corresponding predictor variables (features).\n", "\n", "* For the training and validation data, all the categorical features must be encoded as numerical values through encoding techniques such as the label encoding and one-hot encoding. The target must be encoded as numerical values by label encoding. \n", "\n", "\n", "\n", "\n", "Citations:\n", "\n", "- [LeCun et al., 1998a] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 'Gradient-based learning applied to document recognition.' Proceedings of the IEEE, 86(11):2278-2324, November 1998" ] }, { "cell_type": "markdown", "id": "d96c999a", "metadata": {}, "source": [ "### 2.1. Retrieve Training Artifacts\n", "---\n", "Here, we retrieve the training docker container, the training algorithm source, and the tabular algorithm. Note that model_version=\"*\" fetches the latest model.\n", "\n", "For the training algorithm, we have two choices in this demonstration.\n", "* [XGBoost](https://xgboost.readthedocs.io/en/latest/): To use this algorithm, specify `train_model_id` as `xgboost-classification-model` in the cell below.\n", "* [Linear Learner](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression): To use this algorithm, specify `train_model_id` as `sklearn-classification-linear` in the cell below.\n", "\n", "Note. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) (`train_model_id: lightgbm-classification-model`), [CatBoost](https://catboost.ai/en/docs/) (`train_model_id:catboost-classification-model`), [TabTransformer](https://arxiv.org/abs/2012.06678) (`train_model_id: pytorch-tabtransformerclassification-model`), and [AutoGluon Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) (`train_model_id: autogluon-classification-ensemble`) are the other choices in the tabular classification category. Since they have different input-format requirements, please check separate notebooks `lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb`, `tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb`, and `autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb` for details.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "e41e77c2", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker import image_uris, model_uris, script_uris\n", "\n", "train_model_id, train_model_version, train_scope = \"xgboost-classification-model\", \"*\", \"training\"\n", "\n", "training_instance_type = \"ml.m5.xlarge\"\n", "\n", "# Retrieve the docker image\n", "train_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None,\n", " model_id=train_model_id,\n", " model_version=train_model_version,\n", " image_scope=train_scope,\n", " instance_type=training_instance_type,\n", ")\n", "# Retrieve the training script\n", "train_source_uri = script_uris.retrieve(\n", " model_id=train_model_id, model_version=train_model_version, script_scope=train_scope\n", ")\n", "# Retrieve the pre-trained model tarball to further fine-tune\n", "train_model_uri = model_uris.retrieve(\n", " model_id=train_model_id, model_version=train_model_version, model_scope=train_scope\n", ")" ] }, { "cell_type": "markdown", "id": "c7425e5a", "metadata": {}, "source": [ "### 2.2. Set Training Parameters\n", "\n", "---\n", "Now that we are done with all the setup that is needed, we are ready to train our tabular algorithm. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. \n", "\n", "There are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training.\n", "\n", "The second set of parameters are algorithm specific training hyper-parameters. \n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "31929d40", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Sample training data is available in this bucket\n", "training_data_bucket = f\"jumpstart-cache-prod-{aws_region}\"\n", "training_data_prefix = \"training-datasets/tabular_multiclass/\"\n", "\n", "training_dataset_s3_path = f\"s3://{training_data_bucket}/{training_data_prefix}\"\n", "\n", "output_bucket = sess.default_bucket()\n", "output_prefix = \"jumpstart-example-tabular-training\"\n", "\n", "s3_output_location = f\"s3://{output_bucket}/{output_prefix}/output\"" ] }, { "cell_type": "markdown", "id": "b2bd5795", "metadata": {}, "source": [ "---\n", "For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "04616c01", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker import hyperparameters\n", "\n", "# Retrieve the default hyper-parameters for fine-tuning the model\n", "hyperparameters = hyperparameters.retrieve_default(\n", " model_id=train_model_id, model_version=train_model_version\n", ")\n", "\n", "# [Optional] Override default hyperparameters with custom values\n", "hyperparameters[\"num_boost_round\"] = \"500\" # this hyperparameter is speficially for XGBoost\n", "hyperparameters[\"early_stopping_rounds\"] = \"5\" # this hyperparameter is speficially for XGBoost\n", "print(hyperparameters)" ] }, { "cell_type": "markdown", "id": "173a5a38", "metadata": {}, "source": [ "### 2.3. Train with Automatic Model Tuning \n", "\n", "\n", "Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a HyperparameterTuner object to interact with Amazon SageMaker hyperparameter tuning APIs." ] }, { "cell_type": "code", "execution_count": null, "id": "912b1acd", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner\n", "\n", "use_amt = True\n", "\n", "if train_model_id == \"xgboost-classification-model\":\n", "\n", " hyperparameter_ranges = {\n", " \"num_round\": IntegerParameter(1000, 10000),\n", " \"lambda\": IntegerParameter(1, 50),\n", " \"colsample_bytree\": ContinuousParameter(0, 1),\n", " \"alpha\": IntegerParameter(1, 50),\n", " \"subsample\": ContinuousParameter(0, 1),\n", " \"min_child_weight\": IntegerParameter(1, 50),\n", " \"max_delta_step\": IntegerParameter(1, 50),\n", " \"gamma\": IntegerParameter(1, 50),\n", " \"eta\": ContinuousParameter(0.01, 1, scaling_type=\"Logarithmic\"),\n", " \"max_depth\": IntegerParameter(1, 10),\n", " }\n", "\n", "\n", "elif train_model_id == \"sklearn-classification-linear\":\n", "\n", " hyperparameter_ranges = {\n", " \"alpha\": ContinuousParameter(1e-6, 1, scaling_type=\"Logarithmic\"),\n", " \"tol\": ContinuousParameter(1e-6, 1e-1, scaling_type=\"Logarithmic\"),\n", " \"l1_ratio\": ContinuousParameter(0, 1),\n", " }" ] }, { "cell_type": "markdown", "id": "908f33f2", "metadata": {}, "source": [ "### 2.4. Start Training" ] }, { "cell_type": "markdown", "id": "cfb06f51", "metadata": {}, "source": [ "---\n", "We start by creating the estimator object with all the required assets and then launch the training job.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "5c021fb5", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker.estimator import Estimator\n", "from sagemaker.utils import name_from_base\n", "\n", "training_job_name = name_from_base(f\"jumpstart-{train_model_id}-training\")\n", "\n", "# Create SageMaker Estimator instance\n", "tabular_estimator = Estimator(\n", " role=aws_role,\n", " image_uri=train_image_uri,\n", " source_dir=train_source_uri,\n", " model_uri=train_model_uri,\n", " entry_point=\"transfer_learning.py\",\n", " instance_count=1,\n", " instance_type=training_instance_type,\n", " max_run=360000,\n", " hyperparameters=hyperparameters,\n", " output_path=s3_output_location,\n", ")\n", "\n", "if use_amt:\n", " if train_model_id == \"xgboost-classification-model\":\n", "\n", " tuner = HyperparameterTuner(\n", " tabular_estimator,\n", " \"validation:mlogloss\",\n", " hyperparameter_ranges,\n", " max_jobs=2,\n", " max_parallel_jobs=2,\n", " objective_type=\"Minimize\",\n", " base_tuning_job_name=training_job_name,\n", " )\n", "\n", " elif train_model_id == \"sklearn-classification-linear\":\n", "\n", " tuner = HyperparameterTuner(\n", " tabular_estimator,\n", " \"accuracy_score\",\n", " hyperparameter_ranges,\n", " [{\"Name\": \"accuracy_score\", \"Regex\": \"accuracy_score: ([0-9\\\\.]+)\"}],\n", " max_jobs=10,\n", " max_parallel_jobs=2,\n", " objective_type=\"Maximize\",\n", " base_tuning_job_name=training_job_name,\n", " )\n", "\n", " tuner.fit({\"training\": training_dataset_s3_path}, logs=True)\n", "else:\n", " # Launch a SageMaker Training job by passing s3 path of the training data\n", " tabular_estimator.fit(\n", " {\"training\": training_dataset_s3_path}, logs=True, job_name=training_job_name\n", " )" ] }, { "cell_type": "markdown", "id": "3d4e7d98", "metadata": {}, "source": [ "## 3. Deploy and Run Inference on the Trained Tabular Model\n", "\n", "---\n", "In this section, you learn how to query an existing endpoint and make predictions of the examples you input. For each example, the model will output the probability of the sample for each class in the model. \n", "Next, the predicted class label is obtained by taking the class label with the maximum probability over others. Throughout the notebook, the examples are taken from the [MNIST](http://yann.lecun.com/exdb/mnist/) test set. \n", "The dataset contains examples of individual pixel values from each 28 x 28 grayscale image to predict the digit label of 10 classes {0, 1, 2, 3, ..., 9}.\n", "\n", "We start by retrieving the artifacts and deploy the `tabular_estimator` that we trained.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "68a94632", "metadata": { "scrolled": true }, "outputs": [], "source": [ "inference_instance_type = \"ml.m5.large\"\n", "\n", "# Retrieve the inference docker container uri\n", "deploy_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None,\n", " image_scope=\"inference\",\n", " model_id=train_model_id,\n", " model_version=train_model_version,\n", " instance_type=inference_instance_type,\n", ")\n", "# Retrieve the inference script uri\n", "deploy_source_uri = script_uris.retrieve(\n", " model_id=train_model_id, model_version=train_model_version, script_scope=\"inference\"\n", ")\n", "\n", "endpoint_name = name_from_base(f\"jumpstart-example-{train_model_id}-\")\n", "\n", "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", "predictor = (tuner if use_amt else tabular_estimator).deploy(\n", " initial_instance_count=1,\n", " instance_type=inference_instance_type,\n", " entry_point=\"inference.py\",\n", " image_uri=deploy_image_uri,\n", " source_dir=deploy_source_uri,\n", " endpoint_name=endpoint_name,\n", " enable_network_isolation=True,\n", ")" ] }, { "cell_type": "markdown", "id": "e9b2776b", "metadata": {}, "source": [ "---\n", "Next, we download a hold-out MNIST test data from the S3 bucket for inference.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "8b60a249", "metadata": { "scrolled": true }, "outputs": [], "source": [ "jumpstart_assets_bucket = f\"jumpstart-cache-prod-{aws_region}\"\n", "test_data_prefix = \"training-datasets/tabular_multiclass/test\"\n", "test_data_file_name = \"data.csv\"\n", "\n", "boto3.client(\"s3\").download_file(\n", " jumpstart_assets_bucket, f\"{test_data_prefix}/{test_data_file_name}\", test_data_file_name\n", ")" ] }, { "cell_type": "markdown", "id": "74404049", "metadata": {}, "source": [ "---\n", "Next, we read the MNIST test data into pandas data frame, prepare the ground truth target and predicting features to send into the endpoint. \n", "\n", "Below is the screenshot of the first 5 examples in the MNIST test set. All of the test examples with features \n", "from ```Feature_1``` to ```Feature_784``` are sent into the deployed model to get model predictions, \n", "to estimate the ground truth ```Target``` column. For each test example, the model will output \n", "a vector of ```num_classes``` elements, where each element is the probability of the example for each class in the model. \n", "The ```num_classes``` is 10 in this case. Next, the predicted class label is obtained by taking the class label \n", "with the maximum probability over others. \n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "e69ccdbc", "metadata": { "scrolled": true }, "outputs": [], "source": [ "newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.metrics import f1_score\n", "from sklearn.metrics import confusion_matrix\n", "import matplotlib.pyplot as plt\n", "\n", "# read the data\n", "test_data = pd.read_csv(test_data_file_name, header=None)\n", "test_data.columns = [\"Target\"] + [f\"Feature_{i}\" for i in range(1, test_data.shape[1])]\n", "\n", "num_examples, num_columns = test_data.shape\n", "print(\n", " f\"{bold}The test dataset contains {num_examples} examples and {num_columns} columns.{unbold}\\n\"\n", ")\n", "\n", "# prepare the ground truth target and predicting features to send into the endpoint.\n", "ground_truth_label, features = test_data.iloc[:, :1], test_data.iloc[:, 1:]\n", "\n", "print(f\"{bold}The first 5 observations of the data: {unbold} \\n\")\n", "test_data.head(5)" ] }, { "cell_type": "markdown", "id": "19c5d995", "metadata": {}, "source": [ "---\n", "The following code queries the endpoint you have created to get the prediction for each test example. \n", "The `query_endpoint()` function returns a array-like of shape (num_examples, num_classes), where each row indicates \n", "the probability of the example for each class in the model. The num_classes is 10 in above test data. \n", "Next, the predicted class label is obtained by taking the class label with the maximum probability over others for each example. \n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "c2e9c2fc", "metadata": { "scrolled": true }, "outputs": [], "source": [ "content_type = \"text/csv\"\n", "\n", "\n", "def query_endpoint(encoded_tabular_data):\n", " client = boto3.client(\"runtime.sagemaker\")\n", " response = client.invoke_endpoint(\n", " EndpointName=endpoint_name, ContentType=content_type, Body=encoded_tabular_data\n", " )\n", " return response\n", "\n", "\n", "def parse_response(query_response):\n", " model_predictions = json.loads(query_response[\"Body\"].read())\n", " predicted_probabilities = model_predictions[\"probabilities\"]\n", " return np.array(predicted_probabilities)\n", "\n", "\n", "# split the test data into smaller size of batches to query the endpoint due to the large size of test data.\n", "batch_size = 1500\n", "predict_prob = []\n", "for i in np.arange(0, num_examples, step=batch_size):\n", " query_response_batch = query_endpoint(\n", " features.iloc[i : (i + batch_size), :].to_csv(header=False, index=False).encode(\"utf-8\")\n", " )\n", " predict_prob_batch = parse_response(query_response_batch) # prediction probability per batch\n", " predict_prob.append(predict_prob_batch)\n", "\n", "\n", "predict_prob = np.concatenate(predict_prob, axis=0)\n", "predict_label = np.argmax(\n", " predict_prob, axis=1\n", ") # Note. For binary classification, the model returns a array-like of shape (num_examples, 1),\n", "# where each row is the probability of the positive label 1, assuming there are positive label (encoded as 1) and negative label (encoded as 0) in the target.\n", "# To get the probability for both label 0 and 1, execute following code:\n", "# predict_prob = np.vstack((1.0 - predict_prob, predict_prob)).transpose()" ] }, { "cell_type": "markdown", "id": "0691e70e", "metadata": {}, "source": [ "## 4. Evaluate the Prediction Results Returned from the Endpoint\n", "\n", "---\n", "We evaluate the predictions results returned from the endpoint by following two ways.\n", "\n", "* Visualize the predictions results by plotting the confusion matrix.\n", "\n", "* Measure the prediction results quantitatively.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "094eb500", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Visualize the predictions results by plotting the confusion matrix.\n", "conf_matrix = confusion_matrix(y_true=ground_truth_label.values, y_pred=predict_label)\n", "fig, ax = plt.subplots(figsize=(7.5, 7.5))\n", "ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)\n", "for i in range(conf_matrix.shape[0]):\n", " for j in range(conf_matrix.shape[1]):\n", " ax.text(x=j, y=i, s=conf_matrix[i, j], va=\"center\", ha=\"center\", size=\"xx-large\")\n", "\n", "plt.xlabel(\"Predictions\", fontsize=18)\n", "plt.ylabel(\"Actuals\", fontsize=18)\n", "plt.title(\"Confusion Matrix\", fontsize=18)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "a95032ac", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Measure the prediction results quantitatively.\n", "eval_accuracy = accuracy_score(ground_truth_label.values, predict_label)\n", "eval_f1_macro = f1_score(ground_truth_label.values, predict_label, average=\"macro\")\n", "eval_f1_micro = f1_score(ground_truth_label.values, predict_label, average=\"micro\")\n", "\n", "print(\n", " f\"{bold}Evaluation result on test data{unbold}:{newline}\"\n", " f\"{bold}{accuracy_score.__name__}{unbold}: {eval_accuracy}{newline}\"\n", " f\"{bold}F1 Macro{unbold}: {eval_f1_macro}{newline}\"\n", " f\"{bold}F1 Micro{unbold}: {eval_f1_micro}{newline}\"\n", ")" ] }, { "cell_type": "markdown", "id": "3e347345", "metadata": {}, "source": [ "---\n", "Next, we delete the endpoint corresponding to the finetuned model.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "id": "9b3f1bd9", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Delete the SageMaker endpoint and the attached resources\n", "predictor.delete_model()\n", "predictor.delete_endpoint()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|xgboost_linear_learner_tabular|Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }

training/built-in-algorithms/Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb (760 lines of code) (raw):