courses/DSL/challenge-clickstream/click-stream-generator.ipynb (463 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "ebd219ee-136a-44ca-afce-1a4ebe33b016", "metadata": {}, "source": [ "# Simulated Visit Generator\n", "\n", "The function below generates a simulated visit to a web site. A visit has a collection of events. There are 3 types of events: Page Views, Add Item to Cart, and Purchase. \n", "\n", "All events have page views. Some have Add to Cart events. Some of the visits with Add to Cart events have purchases. \n", "\n", "The OpenAPI schema for a Visit is shown below. \n", "\n", "```\n", "openapi: 3.0.0\n", "info:\n", " title: Visit Schema API\n", " version: 1.0.0\n", " description: Schema for representing a visit to a website, including page views, adding items to a cart, and purchases.\n", "paths: {}\n", "components:\n", " schemas:\n", " Visit:\n", " type: object\n", " properties:\n", " session_id:\n", " type: string\n", " example: \"SID-1234\"\n", " description: \"A unique identifier for the user's session.\"\n", " user_id:\n", " type: string\n", " example: \"UID-5678\"\n", " description: \"A unique identifier for the user visiting the website.\"\n", " device_type:\n", " type: string\n", " enum: [desktop, mobile, tablet]\n", " example: \"desktop\"\n", " description: \"The type of device used by the user.\"\n", " geolocation:\n", " type: string\n", " example: \"37.7749,-122.4194\"\n", " description: \"The geolocation of the user in latitude,longitude format.\"\n", " user_agent:\n", " type: string\n", " example: \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\"\n", " description: \"The user agent string of the browser/device used by the user.\"\n", " events:\n", " type: array\n", " items:\n", " $ref: '#/components/schemas/Event'\n", " description: \"List of events during the user's visit.\"\n", "\n", " Event:\n", " type: object\n", " properties:\n", " event_type:\n", " type: string\n", " enum: [page_view, add_item_to_cart, purchase]\n", " example: \"page_view\"\n", " description: \"The type of event that occurred.\"\n", " timestamp:\n", " type: string\n", " format: date-time\n", " example: \"2023-08-10T12:34:56Z\"\n", " description: \"The exact time when the event occurred.\"\n", " details:\n", " type: object\n", " oneOf:\n", " - $ref: '#/components/schemas/PageViewDetails'\n", " - $ref: '#/components/schemas/AddItemToCartDetails'\n", " - $ref: '#/components/schemas/PurchaseDetails'\n", " description: \"Specific details of the event based on its type.\"\n", "\n", " PageViewDetails:\n", " type: object\n", " properties:\n", " page_url:\n", " type: string\n", " example: \"https://example.com/products\"\n", " description: \"The URL of the webpage that was viewed.\"\n", " referrer_url:\n", " type: string\n", " nullable: true\n", " example: \"https://google.com\"\n", " description: \"The URL of the referrer page that led to this page view, or null if none.\"\n", "\n", " AddItemToCartDetails:\n", " type: object\n", " properties:\n", " product_id:\n", " type: string\n", " example: \"HDW-001\"\n", " description: \"The unique identifier of the product added to the cart.\"\n", " product_name:\n", " type: string\n", " example: \"Laptop X200\"\n", " description: \"The name of the product added to the cart.\"\n", " category:\n", " type: string\n", " enum: [hardware, software, peripherals]\n", " example: \"hardware\"\n", " description: \"The category of the product added to the cart.\"\n", " price:\n", " type: number\n", " format: float\n", " example: 999.99\n", " description: \"The price of the product added to the cart.\"\n", " quantity:\n", " type: integer\n", " example: 2\n", " description: \"The quantity of the product added to the cart.\"\n", "\n", " PurchaseDetails:\n", " type: object\n", " properties:\n", " order_id:\n", " type: string\n", " example: \"ORD-4321\"\n", " description: \"A unique identifier for the order.\"\n", " amount:\n", " type: number\n", " format: float\n", " example: 1999.98\n", " description: \"The total amount of the purchase.\"\n", " currency:\n", " type: string\n", " example: \"USD\"\n", " description: \"The currency used for the purchase.\"\n", " items:\n", " type: array\n", " items:\n", " $ref: '#/components/schemas/PurchaseItem'\n", " description: \"A list of items purchased in this order.\"\n", "\n", " PurchaseItem:\n", " type: object\n", " properties:\n", " product_id:\n", " type: string\n", " example: \"HDW-001\"\n", " description: \"The unique identifier of the product purchased.\"\n", " product_name:\n", " type: string\n", " example: \"Laptop X200\"\n", " description: \"The name of the product purchased.\"\n", " category:\n", " type: string\n", " enum: [hardware, software, peripherals]\n", " example: \"hardware\"\n", " description: \"The category of the product purchased.\"\n", " price:\n", " type: number\n", " format: float\n", " example: 999.99\n", " description: \"The price of the product purchased.\"\n", " quantity:\n", " type: integer\n", " example: 2\n", " description: \"The quantity of the product purchased.\"\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "2bb75de7-ef7c-4a6f-ad32-d887dce2df21", "metadata": { "tags": [] }, "outputs": [], "source": [ "import random\n", "from datetime import datetime\n", "import json\n", "\n", "def generate_visit(custom_timestamp=None):\n", " # Sample products categorized by type with hard-coded product IDs and popularity scores\n", " products = {\n", " \"hardware\": [\n", " {\"product_id\": \"HDW-001\", \"name\": \"Laptop X200\", \"price\": 999.99, \"popularity\": 0.3},\n", " {\"product_id\": \"HDW-002\", \"name\": \"Desktop Z500\", \"price\": 1299.99, \"popularity\": 0.2},\n", " {\"product_id\": \"HDW-003\", \"name\": \"Gaming PC Y900\", \"price\": 1899.99, \"popularity\": 0.1},\n", " {\"product_id\": \"HDW-004\", \"name\": \"Ultrabook A400\", \"price\": 1199.99, \"popularity\": 0.15},\n", " {\"product_id\": \"HDW-005\", \"name\": \"Workstation Pro 9000\", \"price\": 2599.99, \"popularity\": 0.05},\n", " {\"product_id\": \"HDW-006\", \"name\": \"Mini PC Cube\", \"price\": 699.99, \"popularity\": 0.2}\n", " ],\n", " \"software\": [\n", " {\"product_id\": \"SFT-001\", \"name\": \"Office Suite Pro\", \"price\": 199.99, \"popularity\": 0.25},\n", " {\"product_id\": \"SFT-002\", \"name\": \"Antivirus Shield\", \"price\": 49.99, \"popularity\": 0.3},\n", " {\"product_id\": \"SFT-003\", \"name\": \"Photo Editor Pro\", \"price\": 79.99, \"popularity\": 0.15},\n", " {\"product_id\": \"SFT-004\", \"name\": \"Project Manager Plus\", \"price\": 299.99, \"popularity\": 0.1},\n", " {\"product_id\": \"SFT-005\", \"name\": \"Video Editor Pro\", \"price\": 149.99, \"popularity\": 0.1},\n", " {\"product_id\": \"SFT-006\", \"name\": \"Music Studio 2024\", \"price\": 89.99, \"popularity\": 0.1}\n", " ],\n", " \"peripherals\": [\n", " {\"product_id\": \"PER-001\", \"name\": \"Wireless Mouse\", \"price\": 29.99, \"popularity\": 0.4},\n", " {\"product_id\": \"PER-002\", \"name\": \"Mechanical Keyboard\", \"price\": 89.99, \"popularity\": 0.3},\n", " {\"product_id\": \"PER-003\", \"name\": \"27\\\" 4K Monitor\", \"price\": 399.99, \"popularity\": 0.1},\n", " {\"product_id\": \"PER-004\", \"name\": \"USB-C Docking Station\", \"price\": 129.99, \"popularity\": 0.05},\n", " {\"product_id\": \"PER-005\", \"name\": \"Noise Cancelling Headphones\", \"price\": 199.99, \"popularity\": 0.1},\n", " {\"product_id\": \"PER-006\", \"name\": \"Webcam HD 1080p\", \"price\": 49.99, \"popularity\": 0.05}\n", " ]\n", " }\n", "\n", " # Helper function to generate a timestamp\n", " def generate_timestamp():\n", " return custom_timestamp if custom_timestamp else datetime.now().isoformat()\n", "\n", " # Helper function to select a random product from a category based on popularity\n", " def select_random_product():\n", " category = random.choice(list(products.keys()))\n", " category_products = products[category]\n", " # Use weighted random choice based on popularity\n", " product = random.choices(category_products, weights=[p[\"popularity\"] for p in category_products])[0]\n", " return product, category\n", "\n", " # Generating the base session details\n", " session = {\n", " \"session_id\": f\"SID-{random.randint(1000, 9999)}\",\n", " \"user_id\": f\"UID-{random.randint(1000, 9999)}\",\n", " \"device_type\": random.choices(\n", " [\"mobile\", \"desktop\", \"tablet\"], weights=[0.6, 0.3, 0.1]\n", " )[0],\n", " \"geolocation\": f\"{random.uniform(-90, 90):.6f},{random.uniform(-180, 180):.6f}\",\n", " \"user_agent\": random.choice([\n", " \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\",\n", " \"Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1\",\n", " \"Mozilla/5.0 (Linux; Android 10; SM-G973F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36\"\n", " ]),\n", " \"events\": []\n", " }\n", "\n", " # The first page view is always the home page\n", " session[\"events\"].append({\n", " \"event\": {\n", " \"event_type\": \"page_view\",\n", " \"timestamp\": generate_timestamp(),\n", " \"details\": {\n", " \"page_url\": \"https://example.com/home\",\n", " \"referrer_url\": None # No referrer for the first page view\n", " }\n", " }\n", " })\n", "\n", " # The second page view is always the products page\n", " session[\"events\"].append({\n", " \"event\": {\n", " \"event_type\": \"page_view\",\n", " \"timestamp\": generate_timestamp(),\n", " \"details\": {\n", " \"page_url\": \"https://example.com/products\",\n", " \"referrer_url\": \"https://example.com/home\"\n", " }\n", " }\n", " })\n", "\n", " # Adding between 0 and 4 additional page_view events with low probability for about and contact pages\n", " num_additional_page_views = random.randint(0, 4)\n", " for _ in range(num_additional_page_views):\n", " page_url = random.choices(\n", " [\n", " \"https://example.com/cart\",\n", " \"https://example.com/about\",\n", " \"https://example.com/contact\"\n", " ],\n", " [0.9, 0.05, 0.05] # 90% chance of cart, 5% each for about and contact\n", " )[0]\n", "\n", " page_view_event = {\n", " \"event\": {\n", " \"event_type\": \"page_view\",\n", " \"timestamp\": generate_timestamp(),\n", " \"details\": {\n", " \"page_url\": page_url,\n", " \"referrer_url\": random.choice([\n", " \"https://google.com\",\n", " \"https://example.com/home\",\n", " \"https://example.com/products\"\n", " ])\n", " }\n", " }\n", " }\n", " session[\"events\"].append(page_view_event)\n", "\n", " # Determine whether to add add_item_to_cart events\n", " added_items = []\n", " if random.random() < 0.5: # 50% chance to add items to the cart\n", " num_items_to_add = random.randint(1, 3)\n", " for _ in range(num_items_to_add):\n", " product, category = select_random_product()\n", " add_item_to_cart_event = {\n", " \"event\": {\n", " \"event_type\": \"add_item_to_cart\",\n", " \"timestamp\": generate_timestamp(),\n", " \"details\": {\n", " \"product_id\": product[\"product_id\"],\n", " \"product_name\": product[\"name\"],\n", " \"category\": category,\n", " \"price\": product[\"price\"],\n", " \"quantity\": random.randint(1, 5)\n", " }\n", " }\n", " }\n", " session[\"events\"].append(add_item_to_cart_event)\n", " added_items.append(add_item_to_cart_event)\n", "\n", " # Determine whether to add a purchase event\n", " if added_items and random.random() < 0.5: # Only add purchase if items were added to cart\n", " total_amount = sum(\n", " item[\"event\"][\"details\"][\"price\"] * item[\"event\"][\"details\"][\"quantity\"]\n", " for item in added_items\n", " )\n", " purchase_event = {\n", " \"event\": {\n", " \"event_type\": \"purchase\",\n", " \"timestamp\": generate_timestamp(),\n", " \"details\": {\n", " \"order_id\": f\"ORD-{random.randint(1000, 9999)}\",\n", " \"amount\": total_amount,\n", " \"currency\": \"USD\",\n", " \"items\": [\n", " {\n", " \"product_id\": item[\"event\"][\"details\"][\"product_id\"],\n", " \"product_name\": item[\"event\"][\"details\"][\"product_name\"],\n", " \"category\": item[\"event\"][\"details\"][\"category\"],\n", " \"price\": item[\"event\"][\"details\"][\"price\"],\n", " \"quantity\": item[\"event\"][\"details\"][\"quantity\"]\n", " }\n", " for item in added_items\n", " ]\n", " }\n", " }\n", " }\n", " session[\"events\"].append(purchase_event)\n", "\n", " return session\n", "\n", "# Example usage\n", "visit = generate_visit(\"2024-08-12T14:30:00\")\n", "\n", "visit_json = json.dumps(visit, indent=4)\n", "print(visit_json)" ] }, { "cell_type": "markdown", "id": "0456baa7-e262-4fa5-a47b-b93ac4d8bf3e", "metadata": {}, "source": [ "# Generate files with Sample Visits\n", "\n", "The function below generates files with visits. There is a file for each day starting at the start date specified and continuing for the number of days specified. The number of visits per day is in the range specied. " ] }, { "cell_type": "code", "execution_count": null, "id": "a754c538-5cae-42dd-8a12-d135dc47dfdf", "metadata": { "tags": [] }, "outputs": [], "source": [ "import random\n", "from datetime import datetime, timedelta\n", "import json\n", "\n", "def generate_visits_for_days(start_date, num_days, visits_per_day_range, seed=None, time_increment_minutes=10):\n", " if seed is not None:\n", " random.seed(seed)\n", " \n", " current_date = start_date\n", "\n", " for day in range(num_days):\n", " # Start time for the day's visits (e.g., 9:00 AM)\n", " current_time = datetime.combine(current_date, datetime.min.time()) + timedelta(hours=9)\n", "\n", " # Randomly determine the number of visits for this day within the specified range\n", " num_visits_per_day = random.randint(visits_per_day_range[0], visits_per_day_range[1])\n", "\n", " # Generate a file name based on the current date\n", " file_name = f'visits-{current_date.strftime(\"%Y-%m-%d\")}.jsonl'\n", " \n", " # Generate visits for the current day\n", " with open(file_name, 'w') as f:\n", " for _ in range(num_visits_per_day):\n", " custom_timestamp = current_time.isoformat()\n", " visit = generate_visit(custom_timestamp)\n", " visit_json = json.dumps(visit)\n", " f.write(visit_json + '\\n')\n", "\n", " # Increment the time for the next visit\n", " current_time += timedelta(minutes=time_increment_minutes)\n", "\n", " # Print a message indicating the file creation\n", " print(f'Generated file: {file_name} with {num_visits_per_day} visits')\n", " \n", " # Move to the next day\n", " current_date += timedelta(days=1)\n", "\n", "# Example usage:\n", "\n", "START_DATE = datetime(2024, 7, 1) # Starting date\n", "NUM_DAYS = 60 # Number of days\n", "VISITS_PER_DAY_RANGE = (100, 199) # Range for the number of visits per day (min, max)\n", "SEED = 42 # Set a seed for reproducibility, can be set to None\n", "TIME_INCREMENT_MINUTES = 10 # Minutes between each visit\n", "\n", "generate_visits_for_days(START_DATE, NUM_DAYS, VISITS_PER_DAY_RANGE, seed=SEED, time_increment_minutes=TIME_INCREMENT_MINUTES)\n", "\n", "print(\"Done\")" ] }, { "cell_type": "markdown", "id": "ce2a99de-a240-423c-9865-60fdd0f44136", "metadata": {}, "source": [ "# List the Generated Files" ] }, { "cell_type": "code", "execution_count": null, "id": "79cc2ec1-b9da-4c3d-9355-368bbf6f373c", "metadata": {}, "outputs": [], "source": [ "! ls" ] } ], "metadata": { "environment": { "kernel": "apache-beam-2.58.0", "name": ".m116", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/:m116" }, "kernelspec": { "display_name": "Apache Beam 2.58.0 (Local)", "language": "python", "name": "apache-beam-2.58.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }