courses/understanding_spanner/colab-notebooks/Spanner_Generating_PK_Values.ipynb (189 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "OLcHQVFAlLbZ" }, "source": [ "# Generating Primary Keys for Spanner\n", "\n", "In Spanner databases, using monotonically increasing or decreasing primary keys is a potential performance problem. This is especially true for large databases with a lot of inserts. \n", " \n", "The best primary key is a natural key. This is where you choose two or more fields in the database that guarantee uniqueness. Just make sure the natural key doesn't always get larger (or smaller).\n", " \n", "Sometimes a surrogate key is required. Traditionally, developers have used counters as surrogate primary keys. This is inefficient in Spanner. Spanner also has no built-in key generators, so it's up to the client, not the server, to generate the key.\n", " \n", "There are several ways to generate surrogate keys. This notebook explore them. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "vXK8M0w6nIY4" }, "source": [ "## Universally Unique Identifiers (UUIDs)\n", "\n", "UUIDs are 36-bit strings guaranteed to be unique. There are different versions of UUIDs, so you must be careful. Version 1 UUIDs are generated with a timestamp. These always get larger, thus, should not be used for Spanner keys.\n", " \n", "Version 4 UUIDs are generated using random numbers. These do not increase monotonically. The code below uses Python to generate 10 UUID strings. Notice the values do not continually get bigger. They are randomly distributed across the table space in a Spanner database, which is what you want. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SFUWYCWmT3MW" }, "outputs": [], "source": [ "import uuid\n", "\n", "for i in range(10):\n", " print(str(uuid.uuid4()))" ] }, { "cell_type": "markdown", "metadata": { "id": "w0I6-pldpaR2" }, "source": [ "## Reversing the bits of integer values\n", "\n", "Sometimes you are migrating a database that uses counters as keys. If so, you would need to preserve the existing primary-foreign key relations in the tables. One way to do that would be to reverse the bits in all the existing key values. \n", "\n", "The code below shows an algorithm for doing that. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-T8YOoGzYvEd" }, "outputs": [], "source": [ "def reverseBits(num,bitSize):\n", " # Convert number into binary representation\n", " # output will be like bin(10) = '0b10101'\n", " binary = bin(num)\n", " \n", " # Skip first two characters of binary\n", " # representation string and reverse\n", " # remaining string and then append zeros\n", " # after it. binary[-1:1:-1] --> start\n", " # from last character and reverse it until\n", " # second last character from left\n", " reverse = binary[-1:1:-1]\n", " reverse = reverse + (bitSize - len(reverse))*'0'\n", "\n", " # converts reversed binary string into integer\n", " return (int(reverse,2)) \n", "\n", "\n", "values = [1, 2, 3, 4, 5, 50, 75, 100, 999, 10000]\n", "\n", "print(\"Output:\")\n", "for value in values: \n", " print(reverseBits(value,32))" ] }, { "cell_type": "markdown", "metadata": { "id": "5qXgPEOZqBtv" }, "source": [ "## Convert the Binary Value to ASCII\n", "\n", "Similar to the code above (and a bit easier to understand), you can use Python's built in binascii module to encode the counters in an existing database. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vcKzrZYTUnlc" }, "outputs": [], "source": [ "import binascii\n", "\n", "values = [1, 2, 3, 4, 5, 50, 75, 100, 999, 10000]\n", "\n", "print(\"Output:\")\n", "for value in values: \n", " encoded = binascii.crc32(str(value).encode('utf8'))\n", " print(encoded)" ] }, { "cell_type": "markdown", "metadata": { "id": "-aXpCKXArEnw" }, "source": [ "## Hashing\n", "\n", "Hashing takes any data and returns a hashed value. Because there are so many potential hashed values, it is extraordinarily unlikely that two different inputs would result in the same hashed value. Hashing works with any data type. It is a good way to use timestamps in your primary keys. Be aware that you can't reverse the operation. So, store the timestamps in another field if you want to be able to retrieve them later. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "45p9vdmm1L0k" }, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "for i in range(10):\n", " dt = datetime.now() # Get the Date and time\n", " ts = datetime.timestamp(dt) # Get the Timestamp\n", " hashed = hash(str(ts)) # Hash the Timestamp\n", " print(\"Hashed Timestamp is:\", hashed)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3.8.2 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }