data-preparation/sagemaker-datawrangler/explore

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Explore and Prepare Data for SageMaker DataWrangler" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Background\n", "In this notebook, we download and explore the data that is used to build the SageMaker DataWrangler flow file for data processing. After running this notebook, you can follow the [README.md](README.md) for the step by step instructions how to write the SageMaker DataWrangler .flow file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# update pandas to avoid data type issues in older 1.0 version\n", "!pip install pandas --upgrade --quiet\n", "import pandas as pd\n", "\n", "print(pd.__version__)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create data folder\n", "!mkdir data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id='get-the-data'></a>\n", "\n", "## Prerequisites: Get Data \n", "\n", "----\n", "\n", "Here, we download the music data from a public S3 bucket. We then upload it to your default S3 bucket, which was created for you when you initially created a SageMaker Studio workspace. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we import the necessary python libraries and set up the environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "import json\n", "import sagemaker\n", "import boto3\n", "import os\n", "from awscli.customizations.s3.utils import split_s3_bucket_key\n", "\n", "# Sagemaker session\n", "sess = sagemaker.Session()\n", "# get session bucket name\n", "bucket = sess.default_bucket()\n", "# bucket prefix or the subfolder for everything we produce\n", "prefix = \"music-recommendation-demo\"\n", "# s3 client\n", "s3_client = boto3.client(\"s3\")\n", "\n", "print(f\"this is your default SageMaker Studio bucket name: {bucket}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define the functions that will be used to download data\n", "def get_data(public_s3_data, to_bucket, sample_data=1):\n", " new_paths = []\n", " for f in public_s3_data:\n", " bucket_name, key_name = split_s3_bucket_key(f)\n", " filename = f.split(\"/\")[-1]\n", " new_path = \"s3://{}/{}/input/{}\".format(to_bucket, prefix, filename)\n", " new_paths.append(new_path)\n", "\n", " # only download if not already downloaded\n", " if not os.path.exists(\"./data/{}\".format(filename)):\n", " # download s3 data\n", " print(\"Downloading file from {}\".format(f))\n", " s3_client.download_file(bucket_name, key_name, \"./data/{}\".format(filename))\n", "\n", " # subsample the data to create a smaller datatset for this demo\n", " new_df = pd.read_csv(\"./data/{}\".format(filename))\n", " new_df = new_df.sample(frac=sample_data)\n", " new_df.to_csv(\"./data/{}\".format(filename), index=False)\n", "\n", " # upload s3 data to our default s3 bucket for SageMaker Studio\n", " print(\"Uploading {} to {}\\n\".format(filename, new_path))\n", " s3_client.upload_file(\n", " \"./data/{}\".format(filename), to_bucket, os.path.join(prefix, \"input\", filename)\n", " )\n", "\n", " return new_paths" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# public S3 bucket that contains our music data\n", "s3_bucket_music_data = \"s3://sagemaker-sample-files/datasets/tabular/synthetic-music\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_data_paths = get_data(\n", " [f\"{s3_bucket_music_data}/tracks.csv\", f\"{s3_bucket_music_data}/ratings.csv\"],\n", " bucket,\n", " sample_data=0.70,\n", ")\n", "print(new_data_paths)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# these are the new file paths located on your SageMaker Studio default s3 storage bucket\n", "tracks_data_source = f\"s3://{bucket}/{prefix}/input/tracks.csv\"\n", "ratings_data_source = f\"s3://{bucket}/{prefix}/input/ratings.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id='explore-data'></a>\n", "\n", "## Explore the Data\n", "\n", "\n", "##### [back to top](#00-nb)\n", "\n", "\n", "----\n", "\n", "In this section, we perform preliminary data exploration to understand the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tracks = pd.read_csv(\"./data/tracks.csv\")\n", "ratings = pd.read_csv(\"./data/ratings.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the [pandas DataFrame head function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to view the first five rows in each of the dataframes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tracks.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ratings.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print the numbers of unique songs/tracks, users and user rating events\n", "print(\"{:,} different songs/tracks\".format(tracks[\"trackId\"].nunique()))\n", "print(\"{:,} users\".format(ratings[\"userId\"].nunique()))\n", "print(\"{:,} user rating events\".format(ratings[\"ratingEventId\"].nunique()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot a bar chart to display the number of tracks per genre to see the distribution\n", "tracks.groupby(\"genre\")[\"genre\"].count().plot.bar(title=\"Tracks by Genre\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the histogram to view the distribution of the number of ratings by user id\n", "ratings[[\"ratingEventId\", \"userId\"]].plot.hist(\n", " by=\"userId\", bins=50, title=\"Distribution of # of Ratings by User\"\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "After you completed running this notebook, you can follow the steps in the README to start building the DataWrangler flow file." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }

data-preparation/sagemaker-datawrangler/explore_data.ipynb (332 lines of code) (raw):