data-preparation/sagemaker-datawrangler/explore_data.ipynb (332 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Explore and Prepare Data for SageMaker DataWrangler"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"## Background\n",
"In this notebook, we download and explore the data that is used to build the SageMaker DataWrangler flow file for data processing. After running this notebook, you can follow the [README.md](README.md) for the step by step instructions how to write the SageMaker DataWrangler .flow file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# update pandas to avoid data type issues in older 1.0 version\n",
"!pip install pandas --upgrade --quiet\n",
"import pandas as pd\n",
"\n",
"print(pd.__version__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create data folder\n",
"!mkdir data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='get-the-data'></a>\n",
"\n",
"## Prerequisites: Get Data \n",
"\n",
"----\n",
"\n",
"Here, we download the music data from a public S3 bucket. We then upload it to your default S3 bucket, which was created for you when you initially created a SageMaker Studio workspace. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we import the necessary python libraries and set up the environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import json\n",
"import sagemaker\n",
"import boto3\n",
"import os\n",
"from awscli.customizations.s3.utils import split_s3_bucket_key\n",
"\n",
"# Sagemaker session\n",
"sess = sagemaker.Session()\n",
"# get session bucket name\n",
"bucket = sess.default_bucket()\n",
"# bucket prefix or the subfolder for everything we produce\n",
"prefix = \"music-recommendation-demo\"\n",
"# s3 client\n",
"s3_client = boto3.client(\"s3\")\n",
"\n",
"print(f\"this is your default SageMaker Studio bucket name: {bucket}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# define the functions that will be used to download data\n",
"def get_data(public_s3_data, to_bucket, sample_data=1):\n",
" new_paths = []\n",
" for f in public_s3_data:\n",
" bucket_name, key_name = split_s3_bucket_key(f)\n",
" filename = f.split(\"/\")[-1]\n",
" new_path = \"s3://{}/{}/input/{}\".format(to_bucket, prefix, filename)\n",
" new_paths.append(new_path)\n",
"\n",
" # only download if not already downloaded\n",
" if not os.path.exists(\"./data/{}\".format(filename)):\n",
" # download s3 data\n",
" print(\"Downloading file from {}\".format(f))\n",
" s3_client.download_file(bucket_name, key_name, \"./data/{}\".format(filename))\n",
"\n",
" # subsample the data to create a smaller datatset for this demo\n",
" new_df = pd.read_csv(\"./data/{}\".format(filename))\n",
" new_df = new_df.sample(frac=sample_data)\n",
" new_df.to_csv(\"./data/{}\".format(filename), index=False)\n",
"\n",
" # upload s3 data to our default s3 bucket for SageMaker Studio\n",
" print(\"Uploading {} to {}\\n\".format(filename, new_path))\n",
" s3_client.upload_file(\n",
" \"./data/{}\".format(filename), to_bucket, os.path.join(prefix, \"input\", filename)\n",
" )\n",
"\n",
" return new_paths"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# public S3 bucket that contains our music data\n",
"s3_bucket_music_data = \"s3://sagemaker-sample-files/datasets/tabular/synthetic-music\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_data_paths = get_data(\n",
" [f\"{s3_bucket_music_data}/tracks.csv\", f\"{s3_bucket_music_data}/ratings.csv\"],\n",
" bucket,\n",
" sample_data=0.70,\n",
")\n",
"print(new_data_paths)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# these are the new file paths located on your SageMaker Studio default s3 storage bucket\n",
"tracks_data_source = f\"s3://{bucket}/{prefix}/input/tracks.csv\"\n",
"ratings_data_source = f\"s3://{bucket}/{prefix}/input/ratings.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='explore-data'></a>\n",
"\n",
"## Explore the Data\n",
"\n",
"\n",
"##### [back to top](#00-nb)\n",
"\n",
"\n",
"----\n",
"\n",
"In this section, we perform preliminary data exploration to understand the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks = pd.read_csv(\"./data/tracks.csv\")\n",
"ratings = pd.read_csv(\"./data/ratings.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the [pandas DataFrame head function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to view the first five rows in each of the dataframes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# print the numbers of unique songs/tracks, users and user rating events\n",
"print(\"{:,} different songs/tracks\".format(tracks[\"trackId\"].nunique()))\n",
"print(\"{:,} users\".format(ratings[\"userId\"].nunique()))\n",
"print(\"{:,} user rating events\".format(ratings[\"ratingEventId\"].nunique()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot a bar chart to display the number of tracks per genre to see the distribution\n",
"tracks.groupby(\"genre\")[\"genre\"].count().plot.bar(title=\"Tracks by Genre\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot the histogram to view the distribution of the number of ratings by user id\n",
"ratings[[\"ratingEventId\", \"userId\"]].plot.hist(\n",
" by=\"userId\", bins=50, title=\"Distribution of # of Ratings by User\"\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"After you completed running this notebook, you can follow the steps in the README to start building the DataWrangler flow file."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}