tutorials/sklearn/hpsearch/gke_randomized

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Train locally" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import training data\n", "\n", "For illustration purposes we will use the MNIST dataset. The following code downloads the dataset and puts it in `./mnist_data`.\n", "\n", "The first 60000 images and targets are the original training set, while the last 10000 are the testing set. The training set is ordered by the labels so we shuffle them since we will use a very small portion of the data to shorten training time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_mldata\n", "from sklearn.utils import shuffle\n", "\n", "mnist = fetch_mldata('MNIST original', data_home='./mnist_data')\n", "X, y = shuffle(mnist.data[:60000], mnist.target[:60000])\n", "\n", "X_small = X[:100]\n", "y_small = y[:100]\n", "\n", "# Note: using only 10% of the training data\n", "X_large = X[:6000]\n", "y_large = y[:6000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instantiate the estimator and the SearchCV objects\n", "\n", "For illustration purposes we will use the `RandomForestClassifier` with `RandomizedSearchCV`:\n", "\n", "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n", "\n", "http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import uniform\n", "\n", "rfc = RandomForestClassifier(n_jobs=-1)\n", "param_distributions = {\n", " 'max_features': uniform(0.6, 0.4),\n", " 'n_estimators': range(20, 401, 20),\n", " 'max_depth': range(5, 41, 5) + [None],\n", " 'min_samples_split': [0.01, 0.05, 0.1]\n", "}\n", "n_iter = 100\n", "search = RandomizedSearchCV(estimator=rfc, param_distributions=param_distributions, n_iter=n_iter, n_jobs=-1, verbose=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit the GridSearchCV object locally\n", "\n", "After fitting we can examine the best score (accuracy) and the best parameters that achieve that score." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time search.fit(X_small, y_small)\n", "\n", "print(search.best_score_, search.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything up to this point is what you would do when training locally. With larger amount of data it would take much longer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train on Google Container Engine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up for training on Google Container Engine\n", "\n", "Before we can start training on the Container Engine we need to:\n", "\n", "- Build the Docker image which will be handling the workloads.\n", "- Create a cluster.\n", "\n", "For these we will first set up some configuration variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your Google Cloud Platform project id." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "project_id = 'YOUR-PROJECT-ID'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Google Cloud Storage bucket belonging to your project created through either:\n", "- gsutil mb gs://YOUR-BUCKET-NAME; or\n", "- https://console.cloud.google.com/storage\n", "\n", "This bucket will be used for storing temporary data during Docker image building, for storing training data, and for storing trained models.\n", "\n", "This can be an existing bucket, but we recommend you create a new one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket_name = 'YOUR-BUCKET-NAME'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pick a cluster id for the cluster on Google Container Engine we will create. Preferably not an existing cluster to avoid affecting its workload." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cluster_id = 'YOUR-CLUSTER-ID'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose a name for the image that will be running on the container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image_name = 'YOUR-IMAGE-NAME'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose a zone to host the cluster.\n", "\n", "List of zones: https://cloud.google.com/compute/docs/regions-zones/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zone = 'us-central1-b'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change this only if you have customized the source." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source_dir = 'source'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build a Docker image\n", "\n", "This step builds a Docker image using the content in the `source/` folder. The image will be tagged with the provided `image_name` so the workers can pull it. The main script `source/worker.py` would retrieve a pickled `GridSearchCV` object from Cloud Storage and fit it to data on GCS.\n", "\n", "Note: This step only needs to be run once the first time you follow these steps,\n", "and each time you modify the codes in `source/`. If you have not modified `source/` then\n", "you can just re-use the same image.\n", "\n", "Note: This could take a couple minutes.\n", "To monitor the build process: https://console.cloud.google.com/gcr/builds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from helpers.cloudbuild_helper import build\n", "\n", "build(project_id, source_dir, bucket_name, image_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a cluster\n", "\n", "This step creates a cluster on the Container Engine.\n", "\n", "You can alternatively create the cluster with the gcloud command line tool or through the console, but\n", "you must add the additional scope of write access to Google Clous Storage: `'https://www.googleapis.com/auth/devstorage.read_write'`\n", "\n", "Note: This could take several minutes.\n", "To monitor the cluster creation process: https://console.cloud.google.com/kubernetes/list\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from helpers.gke_helper import create_cluster\n", "\n", "create_cluster(project_id, zone, cluster_id, n_nodes=1, machine_type='n1-standard-64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For GCE instance pricing: https://cloud.google.com/compute/pricing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instantiate the GKEParallel object\n", "\n", "The `GKEParallel` class is a helper wrapper around a `GridSearchCV` object that manages deploying fitting jobs to the Container Engine cluster created above.\n", "\n", "We pass in the `GridSearchCV` object, which will be pickled and stored on Cloud Storage with\n", "uri of the form: \n", "\n", "```gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/search.pkl```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import uniform\n", "\n", "rfc = RandomForestClassifier(n_jobs=-1)\n", "param_distributions = {\n", " 'max_features': uniform(0.6, 0.4),\n", " 'n_estimators': range(20, 401, 20),\n", " 'max_depth': range(5, 41, 5) + [None],\n", " 'min_samples_split': [0.01, 0.05, 0.1]\n", "}\n", "n_iter = 100\n", "search = RandomizedSearchCV(estimator=rfc, param_distributions=param_distributions, n_iter=n_iter, n_jobs=-1, verbose=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gke_parallel import GKEParallel\n", "\n", "gke_search = GKEParallel(search, project_id, zone, cluster_id, bucket_name, image_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Refresh access token to the cluster\n", "\n", "To make it easy to gain access to the cluster through the [Kubernetes client library](https://github.com/kubernetes-incubator/client-python), included in this sample is a script that retrieves credentials for the cluster with gcloud\n", "and refreshes access token with kubectl." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! bash get_cluster_credentials.sh $cluster_id $zone" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy the fitting task\n", "\n", "`GKEParallel` instances implement a similar (but different!) interface as `GridSearchCV`.\n", "\n", "Calling `fit(X, y)` first uploads the training data to Cloud Storage as:\n", "\n", "```\n", "gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/X.pkl\n", "gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/y.pkl\n", "```\n", "\n", "This allows reusing the same uploaded datasets for future training tasks.\n", "\n", "For instance, if you already have pickled data on Cloud Storage:\n", "\n", "```\n", "gs://DATA-BUCKET/X.pkl\n", "gs://DATA-BUCKET/y.pkl\n", "```\n", "\n", "then you can deploy the fitting task with:\n", "\n", "```\n", "gke_search.fit(X='gs://DATA-BUCKET/X.pkl', y='gs://DATA-BUCKET/y.pkl')\n", "```\n", "\n", "Calling `fit(X, y)` also pickles the wrapped `search` and `gke_search`, stores them on Cloud Storage as:\n", "\n", "```\n", "gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/search.pkl\n", "gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/gke_search.pkl\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.fit(X_large, y_large)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect the GKEParallel object\n", "\n", "In the background, the `GKEParallel` instance splits the `param_grid` into smaller `param_grids`\n", "\n", "Each smaller `param_grid` is pickled and stored on GCS within each worker's workspace:\n", "\n", "```gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/WORKER-ID/param_grid.pkl```\n", "\n", "The `param_distributions` can be accessed as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.param_distributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could optionally specify a `task_name` when creating a `GKEParallel` instance.\n", "\n", "If you did not specify a `task_name`, when you call `fit(X, y)` the task_name will be set to:\n", "\n", "```YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME``` " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.task_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, each job is given a `job_name`. The dictionary of `job_names` can be accessed as follows. Each worker pod handles one job processing one of the smaller `param_grids`. \n", "\n", "To monitor the jobs: https://console.cloud.google.com/kubernetes/workload" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.job_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cancel the task\n", "\n", "To cancel the task, run `cancel()`. This will delete all the deployed worker pods and jobs,\n", "but will NOT delete the cluster, nor delete any data already persisted to Cloud Storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#gke_search.cancel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor the progress\n", "\n", "\n", "`GKEParallel` instances implement a similar (but different!) interface as Future instances.\n", "Calling `done()` checks whether each worker has completed the job and persisted its outcome\n", "on GCS with uri:\n", "\n", "```gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/WORKER-ID/fitted_search.pkl```\n", "\n", "To monitor the jobs: https://console.cloud.google.com/kubernetes/workload\n", "\n", "To access the persisted data directly: https://console.cloud.google.com/storage/browser/YOUR-BUCKET-NAME/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.done(), gke_search.dones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When all the jobs are finished, the pods will stop running (but the cluster will remain), and we can retrieve the fitted model.\n", "\n", "Calling `result()` will populate the `gke_search.results` attribute which is returned.\n", "This attribute records all the fitted `GridSearchCV` from the jobs. The fitted model is downloaded only if the `download` argument is set to `True`.\n", "\n", "Calling `result()` also updates the pickled `gke_search` object on Cloud Storage:\n", "\n", "`gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/gke_search.pkl`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = gke_search.result(download=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also get the logs from the pods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from helpers.kubernetes_helper import get_pod_logs\n", "\n", "for pod_name, log in get_pod_logs().items():\n", " print('=' * 20)\n", " print('\\t{}\\n'.format(pod_name))\n", " print(log)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the jobs are finished, the cluster can be deleted. All the fitted models are stored on GCS.\n", "\n", "The cluster can also be deleted from the console: https://console.cloud.google.com/kubernetes/list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from helpers.gke_helper import delete_cluster\n", "\n", "#delete_cluster(project_id, zone, cluster_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell continues to poll the jobs until they are all finished, downloads the results, and deletes the cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "from helpers.gke_helper import delete_cluster\n", "\n", "while not gke_search.done():\n", " n_done = len([d for d in gke_search.dones.values() if d])\n", " print('{}/{} finished'.format(n_done, len(gke_search.job_names)))\n", " time.sleep(60)\n", "\n", "delete_cluster(project_id, zone, cluster_id)\n", "result = gke_search.result(download=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Restore the GKEParallel object\n", "\n", "To restore the fitted `gke_search object` (for example from a different notebook), you can use the helper function included in this sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from helpers.gcs_helper import download_uri_and_unpickle\n", "gcs_uri = 'gs://YOUR-BUCKET-NAME/YOUR-CLUSTER-ID.YOUR-IMAGE-NAME.UNIX-TIME/gke_search.pkl'\n", "gke_search_restored = download_uri_and_unpickle(gcs_uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect the result\n", "\n", "`GKEParallel` also implements part of the interface of `GridSearchCV` to allow easy access to `best_score+`, `best_param_`, and `beat_estimator_`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gke_search.best_score_, gke_search.best_params_, gke_search.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also call `predict()`, which deligates the call to the `best_estimator_`.\n", "\n", "Below we calculate the accuracy on the 10000 test images." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predicted = gke_search.predict(mnist.data[60000:])\n", "\n", "print(len([p for i, p in enumerate(predicted) if p == mnist.target[60000:][i]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean up\n", "\n", "To clean up, delete the cluster so your project will no longer be charged for VM instance usage. The simplest way to delete the cluster is through the console:\n", "https://console.cloud.google.com/kubernetes/list\n", "\n", "This will not delete any data persisted on Cloud Storage." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }

tutorials/sklearn/hpsearch/gke_randomized_search.ipynb (688 lines of code) (raw):