## Building your own Spotify Wrapped

In this notebook we will generate a custom version of the top artists, songs, and trends over the year based on our downloadable spotify personal history.

You can request your data from Spotify [via this link.](https://www.spotify.com/uk/account/privacy/) Make sure to check your extended data. This process can take up to a month so you will have to wait for a few weeks before your json files are generated and sent to you. You can then add these files in the `data` folder to run the indexing process and build your own dashboard. 

Alternatively, you can test the notebook with the mini sample data provided.

![](/img/spotify.png)

### Exploring Spotify Streaming Data

Once data has been exported we can take a look at the stats. Spotify provides some helpful metadata to help understand the format:

![](img/spotify%20schema.png)

Let's do a quick test to view our data - only selecting certain columns for some personal data privacy:

In [60]:
import pandas as pd
import json

cols = [
    "ts",
    "ms_played",
    "master_metadata_track_name",
    "master_metadata_album_artist_name",
    "master_metadata_album_album_name",
]
file_name = "data/sample_data.json"

with open(file_name, "r") as file:
    data = json.load(file)

df = pd.DataFrame(data=data, columns=cols)

In [54]:
df[0:5]

Unnamed: 0,ts,ms_played,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name
0,2023-07-04T12:42:02Z,247000,Little Lion Man,Mumford & Sons,Sigh No More (Benelux Edition)
1,2023-07-04T12:44:09Z,40375,Girlfriend,Avril Lavigne,The Best Damn Thing (Expanded Edition)
2,2023-07-04T12:47:22Z,193226,"Better Love - From ""The Legend Of Tarzan"" Orig...",Hozier,Better Love
3,2023-07-04T12:50:49Z,206546,Talk,Hozier,"Wasteland, Baby!"
4,2023-07-04T12:59:55Z,331274,No Plan,Hozier,"Wasteland, Baby!"


## Connecting to your Elastic cluster

In [51]:
from elasticsearch import Elasticsearch, helpers
from getpass import getpass

# Connect to the elastic cloud server
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY,  # your username and password for connecting to elastic, found under Deplouments - Security
)

## Adding the documents into an elasticsearch index

Once your data is available you can add your documents in a local folder. In my example I put my json files for the 5 years of data history I got into the `data` folder.
For the purpose of this demo notebook I have also added a simplified sample of my streaming data with some hidden fields for data privacy that can be used as an example to run the following cells.

![](/img/data.png)

In [12]:
import json

index_name = "spotify-history"

# Create the Elasticsearch index with the specified name (delete if already existing)
if client.indices.exists(index=index_name):
    client.indices.delete(index=index_name)
client.indices.create(index=index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'spotify-history-eli'})

We can now open these files with a json reader and directly generate documents for our elasticsearch index from the files.

In [3]:
def generate_docs(DATASET_PATH):
    with open(DATASET_PATH, "r") as f:
        json_data = json.load(f)
        documents = []
        for doc in json_data:
            documents.append(doc)
        load = helpers.bulk(client, documents, index=index_name)

In [13]:
# import required module
import os

# assign directory
directory = "data"
file_list = []
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f) and f.endswith(".json"):
        file_list.append(f)

In [14]:
for DATASET_PATH in file_list:
    generate_docs(DATASET_PATH)

Once the data is added into elastic, a mapping is automatically generated. The data from Spotify is already high quality so this mapping is accurate enough by default that we don't need to pre-define it manually. The main important detail to pay attention to is that fields like artist name also generate as a `keyword` which will enable us to run more complex aggregations in the following steps. 

Here's what it will look like on the Elastic side:

In [None]:
mappings = {
    "properties": {
        "conn_country": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "episode_name": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "episode_show_name": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "incognito_mode": {"type": "boolean"},
        "ip_addr": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "master_metadata_album_album_name": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "master_metadata_album_artist_name": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "master_metadata_track_name": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "ms_played": {"type": "long"},
        "offline": {"type": "boolean"},
        "offline_timestamp": {"type": "long"},
        "platform": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "reason_end": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "reason_start": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "shuffle": {"type": "boolean"},
        "skipped": {"type": "boolean"},
        "spotify_episode_uri": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "spotify_track_uri": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
        },
        "ts": {"type": "date"},
    }
}

## We can now run queries on our data

In [55]:
index_name = "spotify-history"
query = {"match": {"master_metadata_album_artist_name": "Hozier"}}

# Run a simple query, for example looking for problems with the engine
response = client.search(index=index_name, query=query, size=3)

print(
    "We get back {total} results, here are the first ones:".format(
        total=response["hits"]["total"]["value"]
    )
)
for hit in response["hits"]["hits"]:
    print(hit["_source"]["master_metadata_track_name"])

We get back 5653 results, here are the first ones:
My Love Will Never Die
Angel Of Small Death & The Codeine Scene
Someone New - Live


### My top artists of all time

In [56]:
aggs = {"mydata_agg": {"terms": {"field": "master_metadata_album_artist_name.keyword"}}}

response = client.search(index=index_name, aggregations=aggs)
for hit in response["aggregations"]["mydata_agg"]["buckets"]:
    print(hit)

{'key': 'Hozier', 'doc_count': 5653}
{'key': 'Ariana Grande', 'doc_count': 1543}
{'key': 'Billie Eilish', 'doc_count': 1226}
{'key': 'Halsey', 'doc_count': 1076}
{'key': 'Taylor Swift', 'doc_count': 650}
{'key': 'Cardi B', 'doc_count': 547}
{'key': 'Beyoncé', 'doc_count': 525}
{'key': 'Avril Lavigne', 'doc_count': 469}
{'key': 'BLACKPINK', 'doc_count': 413}
{'key': 'Paramore', 'doc_count': 397}


### Artists of 2024 by # of times playes

In [57]:
body = {
    "query": {"range": {"ts": {"gte": "2024", "lte": "2025"}}},
    "aggs": {
        "top_artists": {"terms": {"field": "master_metadata_album_artist_name.keyword"}}
    },
}

response = client.search(index=index_name, body=body)
for hit in response["aggregations"]["top_artists"]["buckets"]:
    print(
        "{artist} played {times} times.".format(
            artist=hit["key"], times=hit["doc_count"]
        )
    )

Linkin Park played 271 times.
Hozier played 268 times.
Dua Lipa played 112 times.
Taylor Swift played 106 times.
Måneskin played 61 times.
Avril Lavigne played 55 times.
Evanescence played 40 times.
Paramore played 35 times.
The Pretty Reckless played 34 times.
Green Day played 33 times.


### Top artists by amount of time played

In [58]:
body = {
    "size": 0,
    "query": {"range": {"ts": {"gte": "2024", "lte": "2025"}}},
    "aggs": {
        "top_artists": {
            "terms": {
                "field": "master_metadata_album_artist_name.keyword",
                "order": {"minutes_played": "desc"},
            },
            "aggs": {"minutes_played": {"sum": {"field": "ms_played"}}},
        }
    },
}

response = client.search(index=index_name, body=body)
for hit in response["aggregations"]["top_artists"]["buckets"]:
    print(
        "{artist} played {times} times; for a total of {hours} hours".format(
            artist=hit["key"],
            times=hit["doc_count"],
            hours=round(hit["minutes_played"]["value"] / 3600000, 2),
        )
    )

Hozier played 268 times; for a total of 13.69 hours
Linkin Park played 271 times; for a total of 12.21 hours
Dua Lipa played 112 times; for a total of 4.43 hours
Taylor Swift played 106 times; for a total of 4.39 hours
Måneskin played 61 times; for a total of 2.48 hours
Avril Lavigne played 55 times; for a total of 2.05 hours
Evanescence played 40 times; for a total of 1.96 hours
Adele played 27 times; for a total of 1.6 hours
Billie Eilish played 32 times; for a total of 1.57 hours
Green Day played 33 times; for a total of 1.5 hours


## From here - you can read the [blog](/Spotify%20Wrapped%20Iulia's%20Version.md) on how to build the visualizations