# Explore and Prepare Data for SageMaker DataWrangler

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

---

----
## Background
In this notebook, we  download and explore the data that is used to build the SageMaker DataWrangler flow file for data processing. After running this notebook, you can follow the [README.md](README.md) for the step by step instructions how to write the SageMaker DataWrangler .flow file

In [None]:
# update pandas to avoid data type issues in older 1.0 version
!pip install pandas --upgrade --quiet
import pandas as pd

print(pd.__version__)

In [None]:
# create data folder
!mkdir data

<a id='get-the-data'></a>

## Prerequisites: Get Data 

----

Here, we download the music data from a public S3 bucket. We then upload it to your default S3 bucket, which was created for you when you initially created a SageMaker Studio workspace. 

First, we import the necessary python libraries and set up the environment

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

import json
import sagemaker
import boto3
import os
from awscli.customizations.s3.utils import split_s3_bucket_key

# Sagemaker session
sess = sagemaker.Session()
# get session bucket name
bucket = sess.default_bucket()
# bucket prefix or the subfolder for everything we produce
prefix = "music-recommendation-demo"
# s3 client
s3_client = boto3.client("s3")

print(f"this is your default SageMaker Studio bucket name: {bucket}")

In [None]:
# define the functions that will be used to download data
def get_data(public_s3_data, to_bucket, sample_data=1):
    new_paths = []
    for f in public_s3_data:
        bucket_name, key_name = split_s3_bucket_key(f)
        filename = f.split("/")[-1]
        new_path = "s3://{}/{}/input/{}".format(to_bucket, prefix, filename)
        new_paths.append(new_path)

        # only download if not already downloaded
        if not os.path.exists("./data/{}".format(filename)):
            # download s3 data
            print("Downloading file from {}".format(f))
            s3_client.download_file(bucket_name, key_name, "./data/{}".format(filename))

        # subsample the data to create a smaller datatset for this demo
        new_df = pd.read_csv("./data/{}".format(filename))
        new_df = new_df.sample(frac=sample_data)
        new_df.to_csv("./data/{}".format(filename), index=False)

        # upload s3 data to our default s3 bucket for SageMaker Studio
        print("Uploading {} to {}\n".format(filename, new_path))
        s3_client.upload_file(
            "./data/{}".format(filename), to_bucket, os.path.join(prefix, "input", filename)
        )

    return new_paths

In [None]:
# public S3 bucket that contains our music data
s3_bucket_music_data = "s3://sagemaker-sample-files/datasets/tabular/synthetic-music"

In [None]:
new_data_paths = get_data(
    [f"{s3_bucket_music_data}/tracks.csv", f"{s3_bucket_music_data}/ratings.csv"],
    bucket,
    sample_data=0.70,
)
print(new_data_paths)

In [None]:
# these are the new file paths located on your SageMaker Studio default s3 storage bucket
tracks_data_source = f"s3://{bucket}/{prefix}/input/tracks.csv"
ratings_data_source = f"s3://{bucket}/{prefix}/input/ratings.csv"

<a id='explore-data'></a>

## Explore the Data


##### [back to top](#00-nb)


----

In this section, we perform preliminary data exploration to understand the data.

In [None]:
tracks = pd.read_csv("./data/tracks.csv")
ratings = pd.read_csv("./data/ratings.csv")

We use the [pandas DataFrame head function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to view the first five rows in each of the dataframes.

In [None]:
tracks.head()

In [None]:
ratings.head()

In [None]:
# print the numbers of unique songs/tracks, users and user rating events
print("{:,} different songs/tracks".format(tracks["trackId"].nunique()))
print("{:,} users".format(ratings["userId"].nunique()))
print("{:,} user rating events".format(ratings["ratingEventId"].nunique()))

In [None]:
# plot a bar chart to display the number of tracks per genre to see the distribution
tracks.groupby("genre")["genre"].count().plot.bar(title="Tracks by Genre");

In [None]:
# plot the histogram to view the distribution of the number of ratings by user id
ratings[["ratingEventId", "userId"]].plot.hist(
    by="userId", bins=50, title="Distribution of # of Ratings by User"
);

----

After you completed running this notebook, you can follow the steps in the README to start building the DataWrangler flow file.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-datawrangler|joined-dataflow|explore_data.ipynb)
