courses/machine_learning/deepdive2/recommendation_systems/solutions/als

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collaborative filtering on the MovieLense Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Learning objectives**\n", "1. Explore the data using BigQuery.\n", "2. Use the model to make recommendations for a user.\n", "3. Use the model to recommend an item to a group of users.\n", "\n", "\n", "## Introduction\n", "\n", "###### This notebook is based on part of Chapter 9 of [BigQuery: The Definitive Guide](https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ \"http://shop.oreilly.com/product/0636920207399.do\") by Lakshmanan and Tigani.\n", "### MovieLens dataset\n", "To illustrate recommender systems in action, let’s use the MovieLens dataset. This is a dataset of movie reviews released by GroupLens, a research lab in the Department of Computer Science and Engineering at the University of Minnesota, through funding by the US National Science Foundation.\n", "\n", "Download the data and load it as a BigQuery table using:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tensorflow as tf\n", "PROJECT = \"your-project-here\" # REPLACE WITH YOUR PROJECT ID\n", "\n", "# Do not change these\n", "os.environ[\"PROJECT\"] = PROJECT\n", "os.environ[\"TFVERSION\"] = '2.6'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: ml-20m.zip\n", "creating: ml-20m/\n", "inflating: ml-20m/genome-scores.csv\n", "inflating: ml-20m/genome-tags.csv\n", "inflating: ml-20m/links.csv\n", "inflating: ml-20m/movies.csv\n", "inflating: ml-20m/ratings.csv\n", "inflating: ml-20m/README.txt\n", "inflating: ml-20m/tags.csv\n", "BigQuery error in rm operation: Not found: Dataset qwiklabs-gcp-00-20dab82189fb:movielens\n", "Dataset 'qwiklabs-gcp-00-20dab82189fb:movielens' successfully created.\n", "rm: cannot remove 'bqml_data': No such file or directory\n", "% Total % Received % Xferd Average Speed Time Time Time Current\n", "Dload Upload Total Spent Left Speed\n", "100 189M 100 189M 0 0 51.1M 0 0:00:03 0:00:03 --:--:-- 51.1M\n", "Waiting on bqjob_r52bc2229fb29a472_0000017521131104_1 ... (48s) Current status: DONE\n", "Waiting on bqjob_r5622b356e8512afa_000001752113d934_1 ... (1s) Current status: DONE" ] } ], "source": [ "%%bash\n", "mkdir bqml_data\n", "cd bqml_data\n", "curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'\n", "unzip ml-20m.zip\n", "yes | bq rm -r $PROJECT:movielens\n", "bq --location=US mk --dataset \\\n", " --description 'Movie Recommendations' \\\n", " $PROJECT:movielens\n", "bq --location=US load --source_format=CSV \\\n", " --autodetect movielens.ratings gs://cloud-training/recommender-systems/movielens/ratings.csv\n", "bq --location=US load --source_format=CSV \\\n", " --autodetect movielens.movies_raw gs://cloud-training/recommender-systems/movielens/movies.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the data\n", "Two tables should now be available in <a href=\"https://console.cloud.google.com/bigquery\">BigQuery</a>.\n", "\n", "Collaborative filtering provides a way to generate product recommendations for users, or user targeting for products. The starting point is a table, <b>movielens.ratings</b>, with three columns: a user id, an item id, and the rating that the user gave the product. This table can be sparse -- users don’t have to rate all products. Then, based on just the ratings, the technique finds similar users and similar products and determines the rating that a user would give an unseen product. Then, we can recommend the products with the highest predicted ratings to users, or target products at users with the highest predicted ratings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "FROM movielens.ratings\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick exploratory query yields that the dataset consists of over 138 thousand users, nearly 27 thousand movies, and a little more than 20 million ratings, confirming that the data has been loaded successfully." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT \n", " COUNT(DISTINCT userId) numUsers,\n", " COUNT(DISTINCT movieId) numMovies,\n", " COUNT(*) totalRatings\n", "FROM movielens.ratings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On examining the first few movies using the query following query, we can see that the genres column is a formatted string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "FROM movielens.movies_raw\n", "WHERE movieId < 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can parse the genres into an array and rewrite the table as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "CREATE OR REPLACE TABLE movielens.movies AS\n", " SELECT * REPLACE(SPLIT(genres, \"|\") AS genres)\n", " FROM movielens.movies_raw" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "FROM movielens.movies\n", "WHERE movieId < 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matrix factorization\n", "Matrix factorization is a collaborative filtering technique that relies on factorizing the ratings matrix into two vectors called the user factors and the item factors. The user factors is a low-dimensional representation of a user_id and the item factors similarly represents an item_id.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table 'cloud-training-demos:movielens.recommender_16' successfully copied to 'qwiklabs-gcp-00-20dab82189fb:movielens.recommender'\n", "Waiting on bqjob_r6d5fb674bd2552f_000001752117e289_1 ... (0s) Current status: DONE" ] } ], "source": [ "%%bash\n", "bq --location=US cp \\\n", "cloud-training-demos:movielens.recommender_16 \\\n", "movielens.recommender" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "-- Note: remove cloud-training-demos if you are using your own model: \n", "FROM ML.TRAINING_INFO(MODEL `cloud-training-demos.movielens.recommender`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "-- Note: remove cloud-training-demos if you are using your own model:\n", "FROM ML.TRAINING_INFO(MODEL `cloud-training-demos.movielens.recommender_16`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we did that, we discovered that the evaluation loss was lower (0.97) with num_factors=16 than with num_factors=36 (1.67) or num_factors=24 (1.45). We could continue experimenting, but we are likely to see diminishing returns with further experimentation.\n", "\n", "## Making recommendations\n", "\n", "With the trained model, we can now provide recommendations. For example, let’s find the best comedy movies to recommend to the user whose userId is 903. In the query below, we are calling ML.PREDICT passing in the trained recommendation model and providing a set of movieId and userId to carry out the predictions on. In this case, it’s just one userId (903), but all movies whose genre includes Comedy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT * FROM\n", "ML.PREDICT(MODEL `cloud-training-demos.movielens.recommender_16`, (\n", " SELECT \n", " movieId, title, 903 AS userId\n", " FROM movielens.movies, UNNEST(genres) g\n", " WHERE g = 'Comedy'\n", "))\n", "ORDER BY predicted_rating DESC\n", "LIMIT 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering out already rated movies\n", "Of course, this includes movies the user has already seen and rated in the past. Let’s remove them.\n", "\n", "**TODO 1**: Make a prediction for user 903 that does not include already seen movies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT * FROM\n", "ML.PREDICT(MODEL `cloud-training-demos.movielens.recommender_16`, (\n", " WITH seen AS (\n", " SELECT ARRAY_AGG(movieId) AS movies \n", " FROM movielens.ratings\n", " WHERE userId = 903\n", " )\n", " SELECT \n", " movieId, title, 903 AS userId\n", " FROM movielens.movies, UNNEST(genres) g, seen\n", " WHERE g = 'Comedy' AND movieId NOT IN UNNEST(seen.movies)\n", "))\n", "ORDER BY predicted_rating DESC\n", "LIMIT 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this user, this happens to yield the same set of movies -- the top predicted ratings didn’t include any of the movies the user has already seen.\n", "\n", "## Customer targeting\n", "\n", "In the previous section, we looked at how to identify the top-rated movies for a specific user. Sometimes, we have a product and have to find the customers who are likely to appreciate it. Suppose, for example, we wish to get more reviews for movieId=96481 which has only one rating and we wish to send coupons to the 5 users who are likely to rate it the highest. \n", "\n", "**TODO 2**: Find the top five users who will likely enjoy *American Mullet (2001)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT * FROM\n", "ML.PREDICT(MODEL `cloud-training-demos.movielens.recommender_16`, (\n", " WITH allUsers AS (\n", " SELECT DISTINCT userId\n", " FROM movielens.ratings\n", " )\n", " SELECT \n", " 96481 AS movieId, \n", " (SELECT title FROM movielens.movies WHERE movieId=96481) title,\n", " userId\n", " FROM\n", " allUsers\n", "))\n", "ORDER BY predicted_rating DESC\n", "LIMIT 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch predictions for all users and movies\n", "What if we wish to carry out predictions for every user and movie combination? Instead of having to pull distinct users and movies as in the previous query, a convenience function is provided to carry out batch predictions for all movieId and userId encountered during training. A limit is applied here, otherwise, all user-movie predictions will be returned and will crash the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery --project $PROJECT\n", "SELECT *\n", "FROM ML.RECOMMEND(MODEL `cloud-training-demos.movielens.recommender_16`)\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen in a section above, it is possible to filter out movies the user has already seen and rated in the past. The reason already seen movies aren’t filtered out by default is that there are situations (think of restaurant recommendations, for example) where it is perfectly expected that we would need to recommend restaurants the user has liked in the past." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 4 }

courses/machine_learning/deepdive2/recommendation_systems/solutions/als_bqml.ipynb (391 lines of code) (raw):