courses/machine_learning/deepdive/10_recommend/content_based_preproc.ipynb (340 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Create datasets for the Content-based Filter\n", "\n", "This notebook builds the data we will use for creating our content based model. We'll collect the data via a collection of SQL queries from the publicly available Kurier.at dataset in BigQuery.\n", "Kurier.at is an Austrian newsite. The goal of these labs is to recommend an article for a visitor to the site. In this lab we collect the data for training, in the subsequent notebook we train the recommender model. \n", "\n", "This notebook illustrates\n", "* how to pull data from BigQuery table and write to local files\n", "* how to make reproducible train and test splits " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tensorflow as tf\n", "import numpy as np\n", "from google.cloud import bigquery \n", "\n", "PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID\n", "BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME\n", "REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n", "\n", "# do not change these\n", "os.environ['PROJECT'] = PROJECT\n", "os.environ['BUCKET'] = BUCKET\n", "os.environ['REGION'] = REGION\n", "os.environ['TFVERSION'] = '2.1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gcloud config set project $PROJECT\n", "gcloud config set compute/region $REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use this helper function to write lists containing article ids, categories, and authors for each article in our database to local file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def write_list_to_disk(my_list, filename):\n", " with open(filename, 'w') as f:\n", " for item in my_list:\n", " line = \"%s\\n\" % item\n", " f.write(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pull data from BigQuery\n", "\n", "The cell below creates a local text file containing all the article ids (i.e. 'content ids') in the dataset. \n", "\n", "Have a look at the original dataset in [BigQuery](https://console.cloud.google.com/bigquery?p=cloud-training-demos&d=GA360_test&t=ga_sessions_sample). Then read through the query below and make sure you understand what it is doing. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql=\"\"\"\n", "#standardSQL\n", "\n", "SELECT \n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id \n", "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, \n", " UNNEST(hits) AS hits\n", "WHERE \n", " # only include hits on pages\n", " hits.type = \"PAGE\"\n", " AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n", "GROUP BY\n", " content_id\n", " \n", "\"\"\"\n", "\n", "content_ids_list = bigquery.Client().query(sql).to_dataframe()['content_id'].tolist()\n", "write_list_to_disk(content_ids_list, \"content_ids.txt\")\n", "print(\"Some sample content IDs {}\".format(content_ids_list[:3]))\n", "print(\"The total number of articles is {}\".format(len(content_ids_list)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There should be 15,634 articles in the database. \n", "Next, we'll create a local file which contains a list of article categories and a list of article authors.\n", "\n", "Note the change in the index when pulling the article category or author information. Also, we are using the first author of the article to create our author list. \n", "Refer back to the original dataset, use the `hits.customDimensions.index` field to verify the correct index.\t " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql=\"\"\"\n", "#standardSQL\n", "SELECT \n", " (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category \n", "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, \n", " UNNEST(hits) AS hits\n", "WHERE \n", " # only include hits on pages\n", " hits.type = \"PAGE\"\n", " AND (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n", "GROUP BY \n", " category\n", "\"\"\"\n", "categories_list = bigquery.Client().query(sql).to_dataframe()['category'].tolist()\n", "write_list_to_disk(categories_list, \"categories.txt\")\n", "print(categories_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The categories are 'News', 'Stars & Kultur', and 'Lifestyle'. \n", "When creating the author list, we'll only use the first author information for each article. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql=\"\"\"\n", "#standardSQL\n", "SELECT\n", " REGEXP_EXTRACT((SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)), r\"^[^,]+\") AS first_author \n", "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, \n", " UNNEST(hits) AS hits\n", "WHERE \n", " # only include hits on pages\n", " hits.type = \"PAGE\"\n", " AND (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n", "GROUP BY \n", " first_author\n", "\"\"\"\n", "authors_list = bigquery.Client().query(sql).to_dataframe()['first_author'].tolist()\n", "write_list_to_disk(authors_list, \"authors.txt\")\n", "print(\"Some sample authors {}\".format(authors_list[:10]))\n", "print(\"The total number of authors is {}\".format(len(authors_list)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There should be 385 authors in the database. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create train and test sets.\n", "\n", "In this section, we will create the train/test split of our data for training our model. We use the concatenated values for visitor id and content id to create a farm fingerprint, taking approximately 90% of the data for the training set and 10% for the test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql=\"\"\"\n", "WITH site_history as (\n", " SELECT\n", " fullVisitorId as visitor_id,\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,\n", " (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, \n", " (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,\n", " (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,\n", " SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,\n", " LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions\n", " FROM \n", " `cloud-training-demos.GA360_test.ga_sessions_sample`, \n", " UNNEST(hits) AS hits\n", " WHERE \n", " # only include hits on pages\n", " hits.type = \"PAGE\"\n", " AND\n", " fullVisitorId IS NOT NULL\n", " AND\n", " hits.time != 0\n", " AND\n", " hits.time IS NOT NULL\n", " AND\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n", ")\n", "SELECT\n", " visitor_id,\n", " content_id,\n", " category,\n", " REGEXP_REPLACE(title, r\",\", \"\") as title,\n", " REGEXP_EXTRACT(author_list, r\"^[^,]+\") as author,\n", " DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id\n", "FROM\n", " site_history\n", "WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL\n", " AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) < 9\n", "\"\"\"\n", "training_set_df = bigquery.Client().query(sql).to_dataframe()\n", "training_set_df.to_csv('training_set.csv', header=False, index=False, encoding='utf-8')\n", "training_set_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql=\"\"\"\n", "WITH site_history as (\n", " SELECT\n", " fullVisitorId as visitor_id,\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,\n", " (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, \n", " (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,\n", " (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,\n", " SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,\n", " LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions\n", " FROM \n", " `cloud-training-demos.GA360_test.ga_sessions_sample`, \n", " UNNEST(hits) AS hits\n", " WHERE \n", " # only include hits on pages\n", " hits.type = \"PAGE\"\n", " AND\n", " fullVisitorId IS NOT NULL\n", " AND\n", " hits.time != 0\n", " AND\n", " hits.time IS NOT NULL\n", " AND\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n", ")\n", "SELECT\n", " visitor_id,\n", " content_id,\n", " category,\n", " REGEXP_REPLACE(title, r\",\", \"\") as title,\n", " REGEXP_EXTRACT(author_list, r\"^[^,]+\") as author,\n", " DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,\n", " (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id\n", "FROM\n", " site_history\n", "WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL\n", " AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) >= 9\n", "\"\"\"\n", "test_set_df = bigquery.Client().query(sql).to_dataframe()\n", "test_set_df.to_csv('test_set.csv', header=False, index=False, encoding='utf-8')\n", "test_set_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the two csv files we just created containing the training and test set. We'll also do a line count of both files to confirm that we have achieved an approximate 90/10 train/test split. \n", "In the next notebook, **Content Based Filtering** we will build a model to recommend an article given information about the current article being read, such as the category, title, author, and publish date. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "wc -l *_set.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head *_set.csv" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }