machine_learning/cloud_ai_platform/bigquery_ml.ipynb (493 lines of code) (raw):

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "bigquery-ml", "version": "0.3.2", "provenance": [], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "<a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/gcp-getting-started-lab-jp/blob/master/machine_learning/cloud_ai_platform/bigquery_ml.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" ] }, { "cell_type": "markdown", "metadata": { "id": "ufnRHzX6NKqi", "colab_type": "text" }, "source": [ "---\n", "```\n", "Copyright 2019 Google LLC\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n", "\n", "https://www.apache.org/licenses/LICENSE-2.0\n", "\n", "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", "```\n", "---" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "aklJxkHBD5aR" }, "source": [ "# BigQuery ML で出生児の体重を予測\n", "\n", "\n", "BigQueryの [Natality データセット](https://bigquery.cloud.google.com/table/bigquery-public-data:samples.natality)を使用して出生児の体重を予測します。\n", "\n", "このデータセットには1969年から2008年までの米国の[出生に関する詳細](https://bigquery.cloud.google.com/table/publicdata:samples.natality?tab=details)が含まれています。\n", "\n", "BigQueryの詳細は [BigQuery ドキュメント](https://cloud.google.com/bigquery/docs) および [ライブラリリファレンスドキュメント](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html)を参照してください。\n", "\n", "## 事前準備\n", "\n", "\n", "1. まだ作成していない場合は [Google Cloud Platform プロジェクト](https://console.cloud.google.com/cloud-resource-manager)を作成します。 \n", "2. [課金設定](https://support.google.com/cloud/answer/6293499#enable-billing) を有効にします。\n", "3. [BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) を有効にします。" ] }, { "cell_type": "markdown", "metadata": { "id": "vomLfcVGsqGF", "colab_type": "text" }, "source": [ "### Google アカウントの認証を実行します。\n", "下記のコードを実行すると、認証コードを取得するための画面へのリンクが表示されるので、そのリンク先へアクセスし、BigQuery への権限を持つアカウントで認証します。その後の画面で表示される認証コードをコピーして Colaboratory のテキストエリアへ入力します。" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "BhUiclqCD5aT", "colab": {} }, "source": [ "from google.colab import auth\n", "auth.authenticate_user()\n", "print('認証されました。')" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ymwkPXGmveHh", "colab_type": "text" }, "source": [ "### プロジェクト ID を設定します" ] }, { "cell_type": "code", "metadata": { "id": "WWsnOalbNgQT", "colab_type": "code", "cellView": "form", "colab": {} }, "source": [ "project_id = 'your-project-id' #@param {type:\"string\"}" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "n3oRMPjTdSFv", "colab_type": "text" }, "source": [ "## データセットの確認\n", "\n", "モデルを作成する前に、 Natality データセットの内容を確認していきましょう。" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "YrqwSv6vD5aZ", "colab": {} }, "source": [ "%%bigquery --project {project_id} data\n", "\n", "SELECT *\n", "FROM\n", " publicdata.samples.natality\n", "WHERE\n", " year > 2000\n", " AND gestation_weeks > 0\n", " AND mother_age > 0\n", " AND plurality > 0\n", " AND weight_pounds > 0\n", "LIMIT 500" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "U4QI03iLik_V", "colab_type": "code", "colab": {} }, "source": [ "import seaborn\n", "from matplotlib import pyplot as plt\n", "\n", "fg = seaborn.FacetGrid(data=data, hue='plurality', size=6,aspect=1.67)\n", "fg = fg.map(plt.scatter, 'mother_age' ,'weight_pounds').add_legend()\n", "fg = fg.set_axis_labels(x_var=\"Mother's age\", y_var=\"Weight pounds\")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ANAJi24Zi6cP", "colab_type": "code", "colab": {} }, "source": [ "_ = data.hist(column='weight_pounds',by='is_male', layout=(1,2),\n", " sharey=True, sharex=True)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "LaRWrpEnjY3m", "colab_type": "code", "colab": {} }, "source": [ "import numpy as np\n", "\n", "x = data.gestation_weeks\n", "y = data.weight_pounds\n", "data.plot(kind=\"scatter\",x=\"gestation_weeks\",y=\"weight_pounds\",\n", " figsize=[10,6], ylim=0, xlim=20)\n", "z = np.polyfit(x, y, 1)\n", "p = np.poly1d(z)\n", "plt.plot(x,p(x),\"r\")\n", "plt.title(\"Weight pounds by Gestation Weeks.\")\n", "plt.show()" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ep02J20pD5ai" }, "source": [ "## 説明変数を決定する\n", "\n", "データセットを見てみると、出生時体重を適切に予測するために活用できそうな、いくつかの列があります。BQMLでは、すべての文字列はカテゴリ値と見なされ、すべての数値型は連続値と見なされます。" ] }, { "cell_type": "code", "metadata": { "id": "RO3JRNcfObqf", "colab_type": "code", "colab": {} }, "source": [ "%%bigquery --project {project_id} _df\n", "\n", "SELECT\n", " weight_pounds, -- 出生児の体重 (目的変数)\n", " is_male, -- 出生児の性別\n", " mother_age, -- 母親の年齢\n", " plurality, -- 同時に出生した数\n", " gestation_weeks -- 妊娠週\n", "FROM\n", " publicdata.samples.natality\n", "WHERE\n", " year > 2000\n", " AND gestation_weeks > 0\n", " AND mother_age > 0\n", " AND plurality > 0\n", " AND weight_pounds > 0\n", "LIMIT 10" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "H-d7C8KcD5am" }, "source": [ "## モデルのトレーニング\n", "\n", "予測に使用する列を選択することで、BigQueryでモデルを作成(トレーニング)することが可能になります。まず、モデルを保存するためのデータセットが必要になります。 (エラーが発生した場合は、BigQueryコンソールからデータセットを作成してください)。" ] }, { "cell_type": "code", "metadata": { "id": "2cXYosci1lc_", "colab_type": "code", "colab": {} }, "source": [ "%%bash -s \"$project_id\"\n", "\n", "gcloud config set project $1\n", "bq --location=US mk -d demo" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "T9JZIQ9nD5ap" }, "source": [ "デモデータセットの準備が整ったら、線形回帰モデルを作成してモデルを訓練することができます。\n", "実行には約 **3分** かかります。" ] }, { "cell_type": "code", "metadata": { "id": "mSIUh0-vPKfE", "colab_type": "code", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "\n", "-- 線形モデルを作成する\n", "CREATE or REPLACE MODEL demo.babyweight_model\n", "OPTIONS\n", " (\n", " model_type='linear_reg', -- モデルの種類を指定する\n", " input_label_cols=['weight_pounds'], -- 目的変数を指定する \n", " data_split_method='AUTO_SPLIT' -- データ分割方法を指定する\n", " ) AS \n", "SELECT\n", " weight_pounds, -- 出生児の体重 (目的変数)\n", " is_male, -- 出生児の性別\n", " mother_age, -- 母親の年齢\n", " plurality, -- 同時に出生した数\n", " gestation_weeks -- 妊娠週\n", "FROM\n", " publicdata.samples.natality -- natality データ・セットを指定する\n", "WHERE\n", " year > 2000\n", " AND gestation_weeks > 0\n", " AND mother_age > 0\n", " AND plurality > 0\n", " AND weight_pounds > 0;" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mZEn7PM7XUap", "colab_type": "text" }, "source": [ "### 学習済みモデルの学習結果を確認する" ] }, { "cell_type": "code", "metadata": { "id": "KLuDY7oyRVKO", "colab_type": "code", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "\n", "SELECT * FROM ML.TRAINING_INFO(MODEL demo.babyweight_model)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mJYtGxWyD5at" }, "source": [ "### 学習済みモデルの性能を確認する" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "ZbV6aDcRD5au", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "\n", "select * from ML.EVALUATE(MODEL demo.babyweight_model);" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ApiYdQJaXY8p", "colab_type": "text" }, "source": [ "### 学習済みモデルのパラメータを確認する" ] }, { "cell_type": "code", "metadata": { "id": "Hb3k9GqKXCLQ", "colab_type": "code", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "\n", "SELECT * FROM ML.WEIGHTS(MODEL demo.babyweight_model)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9QPwMA_ZD5a3" }, "source": [ "## BQML モデルで予測を実行" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9QBZxMb6D5a5" }, "source": [ "訓練されたモデルで値を予測することが可能になりました。\n", "\n", "`ml.predict`関数を利用すると、モデルの出力予測列名は` predicted_ <label_column_name> `になります。" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "x-hedqSeD5a6", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "\n", "WITH temp_data AS (\n", " SELECT\n", " weight_pounds,\n", " is_male,\n", " mother_age,\n", " plurality AS plurality,\n", " gestation_weeks\n", " FROM\n", " publicdata.samples.natality\n", " WHERE\n", " year > 2000\n", " AND gestation_weeks > 0\n", " AND mother_age > 0\n", " AND plurality > 0\n", " AND weight_pounds > 0\n", " LIMIT 10\n", ")\n", "\n", "SELECT * FROM ML.PREDICT(MODEL demo.babyweight_model,\n", " (SELECT * FROM temp_data))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "EGo8fj9qD5bU" }, "source": [ "28際の母親から38週で生まれた出生児の体重を以下のように予測してみます。" ] }, { "cell_type": "code", "metadata": { "id": "f5BT2nsYWcpJ", "colab_type": "code", "colab": {} }, "source": [ "%%bigquery --project {project_id}\n", "SELECT\n", " *\n", "FROM\n", " ml.PREDICT(MODEL demo.babyweight_model,\n", " (SELECT\n", " TRUE AS is_male,\n", " 28 AS mother_age,\n", " 1 AS plurality,\n", " 38 AS gestation_weeks))" ], "execution_count": 0, "outputs": [] } ] }