community-artifacts/Supervised-learning/xgboost-v1.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# XGBoost\n", "XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. XGBoost was first added in MADlib 1.20.0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Greenplum Database 6.X\n", "%sql postgresql://okislal@localhost:6600/madlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%sql select madlib.version();\n", "#%sql select version();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Load data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample data for XGBoost can be downloaded from the examples section of the MADlib documentation. Direct link: https://madlib.apache.org/docs/latest/example/madlib_xgboost_example.sql" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql \n", "SELECT * FROM abalone LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Run a single XGBoost training\n", "Note that the function collates the data into a single segment and runs the xgboost python process on that machine." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "DROP TABLE IF EXISTS xgb_single_out, xgb_single_out_summary;\n", "SELECT madlib.xgboost(\n", " 'abalone', -- Training table\n", " 'xgb_single_out', -- Grid search results table.\n", " 'id', -- Id column\n", " 'sex', -- Class label column\n", " '*', -- Independent variables \n", " NULL, -- Columns to exclude from features \n", " $$ \n", " {\n", " 'learning_rate': [0.01], #Regularization on weights (eta). For smaller values, increase n_estimators\n", " 'max_depth': [9],#Larger values could lead to overfitting\n", " 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting\n", " 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting\n", " 'min_child_weight': [10],#larger values will prevent over-fitting\n", " 'n_estimators':[100] #More estimators, lesser variance (better fit on test set) \n", " } \n", " $$, -- XGBoost grid search parameters\n", " '', -- Sample weights\n", " 0.8, -- Training set size ratio\n", " NULL -- Variable used to do the test/train split.\n", ");\n", "\n", "SELECT features, importance, precision, recall, fscore, support FROM xgb_single_out_summary;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Run XGBoost Prediction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "DROP TABLE IF EXISTS xgb_single_score_out, xgb_single_score_out_metrics, xgb_single_score_out_roc_curve;\n", "\n", "SELECT madlib.xgboost_predict(\n", " 'abalone', -- test_table\n", " 'xgb_single_out', -- model_table\n", " 'xgb_single_score_out', -- predict_output_table\n", " 'id', -- id_column\n", " 'sex' -- class_label\n", ");\n", "\n", "SELECT * FROM xgb_single_score_out LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Run XGBoost with grid search\n", "The parameter options are combined to form a grid and explored in parallel by running distinct xgboost processes in different segments in parallel. The following example will generate 4 configurations to test by combining 'learning_rate': [0.01,0.1] and 'max_depth': [9,12]." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "DROP TABLE IF EXISTS xgb_grid_out, xgb_grid_out_summary;\n", "\n", "SELECT madlib.xgboost(\n", " 'abalone', -- Training table\n", " 'xgb_grid_out', -- Grid search results table.\n", " 'id', -- Id column\n", " 'sex', -- Class label column\n", " '*', -- Independent variables\n", " NULL, -- Columns to exclude from features\n", " $$\n", " {\n", " 'learning_rate': [0.01,0.1], #Regularization on weights (eta). For smaller values, increase n_estimators\n", " 'max_depth': [9,12],#Larger values could lead to overfitting\n", " 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting\n", " 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting\n", " 'min_child_weight': [10],#larger values will prevent over-fitting\n", " 'n_estimators':[100] #More estimators, lesser variance (better fit on test set)\n", " }\n", " $$, -- XGBoost grid search parameters\n", " '', -- Sample weights\n", " 0.8, -- Training set size ratio\n", " NULL -- Variable used to do the test/train split.\n", ");\n", "\n", "SELECT features, params, importance, precision, recall, fscore, support, params_index FROM xgb_grid_out_summary ORDER BY params_index;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Run XGBoost Prediction on Grid Output Table\n", "Let's say we are interested in the model 2 and want to run a prediction using it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "\n", "DROP TABLE IF EXISTS xgb_grid_score_out, xgb_grid_score_out_metrics, xgb_grid_score_out_roc_curve;\n", "\n", "SELECT madlib.xgboost_predict(\n", " 'abalone', -- test_table\n", " 'xgb_grid_out', -- model_table\n", " 'xgb_grid_score_out', -- predict_output_table\n", " 'id', -- id_column\n", " 'sex', -- class_label\n", " 2 -- model_filters\n", ");\n", "SELECT * FROM xgb_grid_score_out LIMIT 10;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.17" } }, "nbformat": 4, "nbformat_minor": 1 }

community-artifacts/Supervised-learning/xgboost-v1.ipynb (232 lines of code) (raw):