community-artifacts/Supervised-learning/xgboost-v1.ipynb (232 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XGBoost\n",
"XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. XGBoost was first added in MADlib 1.20.0."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Greenplum Database 6.X\n",
"%sql postgresql://okislal@localhost:6600/madlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%sql select madlib.version();\n",
"#%sql select version();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Load data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sample data for XGBoost can be downloaded from the examples section of the MADlib documentation. Direct link: https://madlib.apache.org/docs/latest/example/madlib_xgboost_example.sql"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql \n",
"SELECT * FROM abalone LIMIT 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Run a single XGBoost training\n",
"Note that the function collates the data into a single segment and runs the xgboost python process on that machine."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS xgb_single_out, xgb_single_out_summary;\n",
"SELECT madlib.xgboost(\n",
" 'abalone', -- Training table\n",
" 'xgb_single_out', -- Grid search results table.\n",
" 'id', -- Id column\n",
" 'sex', -- Class label column\n",
" '*', -- Independent variables \n",
" NULL, -- Columns to exclude from features \n",
" $$ \n",
" {\n",
" 'learning_rate': [0.01], #Regularization on weights (eta). For smaller values, increase n_estimators\n",
" 'max_depth': [9],#Larger values could lead to overfitting\n",
" 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting\n",
" 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting\n",
" 'min_child_weight': [10],#larger values will prevent over-fitting\n",
" 'n_estimators':[100] #More estimators, lesser variance (better fit on test set) \n",
" } \n",
" $$, -- XGBoost grid search parameters\n",
" '', -- Sample weights\n",
" 0.8, -- Training set size ratio\n",
" NULL -- Variable used to do the test/train split.\n",
");\n",
"\n",
"SELECT features, importance, precision, recall, fscore, support FROM xgb_single_out_summary;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Run XGBoost Prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS xgb_single_score_out, xgb_single_score_out_metrics, xgb_single_score_out_roc_curve;\n",
"\n",
"SELECT madlib.xgboost_predict(\n",
" 'abalone', -- test_table\n",
" 'xgb_single_out', -- model_table\n",
" 'xgb_single_score_out', -- predict_output_table\n",
" 'id', -- id_column\n",
" 'sex' -- class_label\n",
");\n",
"\n",
"SELECT * FROM xgb_single_score_out LIMIT 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Run XGBoost with grid search\n",
"The parameter options are combined to form a grid and explored in parallel by running distinct xgboost processes in different segments in parallel. The following example will generate 4 configurations to test by combining 'learning_rate': [0.01,0.1] and 'max_depth': [9,12]."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS xgb_grid_out, xgb_grid_out_summary;\n",
"\n",
"SELECT madlib.xgboost(\n",
" 'abalone', -- Training table\n",
" 'xgb_grid_out', -- Grid search results table.\n",
" 'id', -- Id column\n",
" 'sex', -- Class label column\n",
" '*', -- Independent variables\n",
" NULL, -- Columns to exclude from features\n",
" $$\n",
" {\n",
" 'learning_rate': [0.01,0.1], #Regularization on weights (eta). For smaller values, increase n_estimators\n",
" 'max_depth': [9,12],#Larger values could lead to overfitting\n",
" 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting\n",
" 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting\n",
" 'min_child_weight': [10],#larger values will prevent over-fitting\n",
" 'n_estimators':[100] #More estimators, lesser variance (better fit on test set)\n",
" }\n",
" $$, -- XGBoost grid search parameters\n",
" '', -- Sample weights\n",
" 0.8, -- Training set size ratio\n",
" NULL -- Variable used to do the test/train split.\n",
");\n",
"\n",
"SELECT features, params, importance, precision, recall, fscore, support, params_index FROM xgb_grid_out_summary ORDER BY params_index;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Run XGBoost Prediction on Grid Output Table\n",
"Let's say we are interested in the model 2 and want to run a prediction using it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"\n",
"DROP TABLE IF EXISTS xgb_grid_score_out, xgb_grid_score_out_metrics, xgb_grid_score_out_roc_curve;\n",
"\n",
"SELECT madlib.xgboost_predict(\n",
" 'abalone', -- test_table\n",
" 'xgb_grid_out', -- model_table\n",
" 'xgb_grid_score_out', -- predict_output_table\n",
" 'id', -- id_column\n",
" 'sex', -- class_label\n",
" 2 -- model_filters\n",
");\n",
"SELECT * FROM xgb_grid_score_out LIMIT 10;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.17"
}
},
"nbformat": 4,
"nbformat_minor": 1
}