tutorials/R/01_EDA-with-R-and-BigQuery.ipynb (488 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis with R and BigQuery\n",
"## Overview\n",
"\n",
"This notebook illustrates how to perform Exploratory Data Analysis (EDA) using R, on data extracted from BigQuery. After you analyze and process the data, the transformed data is stored in Cloud Storage for further ML tasks.\n",
"\n",
"R is one of the most widely used programming languages for statistical modeling. It has a large and active community of data scientists and machine learning (ML) professionals. With over 15,000 packages in the open-source repository of CRAN, R has tools for all statistical data analysis applications, ML, and visualization.\n",
"\n",
"## Dataset\n",
"The dataset used in this tutorial is the BigQuery natality dataset. This public dataset includes information about more than 137 million births registered in the United States from 1969 to 2008. The dataset is available [here](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=natality&page=table&_ga=2.99329886.-1705629017.1551465326&_gac=1.109796023.1561476396.CI2rz-z4hOMCFc6RhQods4oEXA).\n",
"\n",
"In this notebook, we focus on exploratory data analysis and visualization using R and BigQuery, with an eye toward a potential machine learning goal of predicting a baby's weight given a number of factors about the pregnancy and about the baby's mother.\n",
"\n",
"## Objective\n",
"The goal of this tutorial is to:\n",
"1. Create an AI Platform Notebooks instance that has R support.\n",
"2. Query and analyze data from BigQuery using the bigrquery R library.\n",
"3. Prepare and store data for ML in Cloud Storage.\n",
"\n",
"## Costs\n",
"This tutorial uses the following billable components of Google Cloud:\n",
"1. BigQuery\n",
"2. AI Platform Notebooks instance\n",
"3. Cloud Storage\n",
"\n",
"Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. Setup "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the required libraries, all of which have been pre-installed in an R notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import R libraries\n",
"library(bigrquery) # used for querying BigQuery\n",
"library(ggplot2) # used for visualization\n",
"library(dplyr) # used for data wrangling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use BigQuery out-of-band authentication"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bq_auth(use_oob = TRUE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set a variable to the name of the project that you want to use for this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set the project ID\n",
"PROJECT_ID <- \"gcp-data-science-demo\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set a variable to the name of the Cloud Storage bucket that you want to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set your GCS bucket\n",
"BUCKET_NAME <- \"r-on-gcp\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Querying Data from BigQuery "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1. Prepare the BigQuery query"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql_query <- \"\n",
" SELECT\n",
" ROUND(weight_pounds, 2) AS weight_pounds ,\n",
" is_male,\n",
" mother_age,\n",
" plurality,\n",
" gestation_weeks,\n",
" cigarette_use,\n",
" alcohol_use,\n",
" CAST(ABS(FARM_FINGERPRINT(CONCAT(\n",
" CAST(YEAR AS STRING), CAST(month AS STRING), \n",
" CAST(weight_pounds AS STRING)))\n",
" ) AS STRING) AS key\n",
" FROM\n",
" publicdata.samples.natality\n",
" WHERE \n",
" year > 2000\n",
" AND weight_pounds > 0\n",
" AND mother_age > 0\n",
" AND plurality > 0\n",
" AND gestation_weeks > 0\n",
" AND month > 0\n",
" LIMIT %s\n",
"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2. Execute the query \n",
"The data will be retreived from BigQuery, and the results will be stored in an in-memory dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_size <- 10000\n",
"sql_query <- sprintf(sql_query, sample_size)\n",
"\n",
"natality_data <- bq_table_download(\n",
" bq_project_query(\n",
" PROJECT_ID, \n",
" query = sql_query\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3. View the query results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View the query result\n",
"head(natality_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show # of rows and data types of each column\n",
"str(natality_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View the results summary\n",
"summary(natality_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Visualizing retrieved data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the distribution of baby weights using a histogram\n",
"ggplot(\n",
" data = natality_data, \n",
" aes(x = weight_pounds)\n",
") + geom_histogram(bins = 200)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the relationship between gestation weeks and baby weights \n",
"ggplot(\n",
" data = natality_data, \n",
" aes(x = gestation_weeks, y = weight_pounds)\n",
") + geom_point() + geom_smooth(method = \"lm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performing the processing in BigQuery\n",
"Create function that finds the number of records and the average weight for each value of the chosen column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"get_distinct_values <- function(column_name) {\n",
" query <- paste0(\n",
" 'SELECT ', column_name, ', \n",
" COUNT(1) AS num_babies,\n",
" AVG(weight_pounds) AS avg_wt\n",
" FROM publicdata.samples.natality\n",
" WHERE year > 2000\n",
" GROUP BY ', column_name)\n",
" \n",
" bq_table_download(\n",
" bq_project_query(\n",
" PROJECT_ID, \n",
" query = query\n",
" )\n",
" )\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df <- get_distinct_values('is_male')\n",
"\n",
"ggplot(data = df, aes(x = is_male, y = num_babies)\n",
") + geom_col()\n",
"\n",
"ggplot(data = df, aes(x = is_male, y = avg_wt)\n",
") + geom_col()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df <- get_distinct_values('mother_age')\n",
"\n",
"ggplot(data = df, aes(x = mother_age, y = num_babies)\n",
") + geom_line()\n",
"\n",
"ggplot(data = df, aes(x = mother_age, y = avg_wt)\n",
") + geom_line()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df <- get_distinct_values('plurality')\n",
"\n",
"ggplot(data = df, aes(x = plurality, y = num_babies)\n",
") + geom_col() + scale_y_log10()\n",
"\n",
"ggplot(data = df, aes(x = plurality, y = avg_wt)\n",
") + geom_col()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df <- get_distinct_values('gestation_weeks')\n",
"\n",
"ggplot(data = df, aes(x = gestation_weeks, y = num_babies)\n",
") + geom_col() + scale_y_log10()\n",
"\n",
"ggplot(data = df, aes(x = gestation_weeks, y = avg_wt)\n",
") + geom_col()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Saving the data as CSV to Google Cloud Storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Prepare training and evaluation data from BigQuery\n",
"sample_size <- 10000\n",
"sql_query <- sprintf(sql_query, sample_size)\n",
"\n",
"train_query <- paste('SELECT * FROM (', sql_query, \n",
" ') WHERE MOD(CAST(key AS INT64), 100) <= 75')\n",
"eval_query <- paste('SELECT * FROM (', sql_query,\n",
" ') WHERE MOD(CAST(key AS INT64), 100) > 75')\n",
"\n",
"# Load training data to data frame\n",
"train_data <- bq_table_download(\n",
" bq_project_query(\n",
" PROJECT_ID, \n",
" query = train_query\n",
" )\n",
")\n",
"\n",
"# Load evaluation data to data frame\n",
"eval_data <- bq_table_download(\n",
" bq_project_query(\n",
" PROJECT_ID, \n",
" query = eval_query\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(paste0(\"Training instances count: \", nrow(train_data)))\n",
"print(paste0(\"Evaluation instances count: \", nrow(eval_data)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write data frames to local CSV files, without headers or row names\n",
"dir.create(file.path('data'), showWarnings = FALSE)\n",
"write.table(train_data, \"data/train_data.csv\", \n",
" row.names = FALSE, col.names = FALSE, sep = \",\")\n",
"write.table(eval_data, \"data/eval_data.csv\", \n",
" row.names = FALSE, col.names = FALSE, sep = \",\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Upload CSV data to Google Cloud Storage by passing gsutil commands to system\n",
"gcs_url <- paste0(\"gs://\", BUCKET_NAME, \"/\")\n",
"command <- paste(\"gsutil mb\", gcs_url)\n",
"system(command)\n",
"gcs_data_dir <- paste0(\"gs://\", BUCKET_NAME, \"/data\")\n",
"command <- paste(\"gsutil cp data/*_data.csv\", gcs_data_dir)\n",
"system(command)\n",
"command <- paste(\"gsutil ls -l\", gcs_data_dir)\n",
"system(command, intern = TRUE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## License"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Authors: Daniel Sparing, Khalid Salama, Alok Pattani, Polong Lin\n",
"\n",
"---\n",
"**Disclaimer**: This is not an official Google product. The sample\n",
"code is provided for an educational purpose.\n",
"\n",
"---\n",
"\n",
"Copyright 2020 Google LLC\n",
"\n",
"Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"you may not use this file except in compliance with the License.\n",
"You may obtain a copy of the License at \n",
"http://www.apache.org/licenses/LICENSE-2.0.\n",
"\n",
"Unless required by applicable law or agreed to in writing, software\n",
"distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"See the License for the specific language governing permissions and\n",
"limitations under the License."
]
}
],
"metadata": {
"environment": {
"name": "r-cpu.3-6.m46",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/r-cpu.3-6:m46"
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}