python/jupyter/HLLSketch.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## HLL Sketch Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Sketch Usage" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datasketches import hll_sketch, hll_union, tgt_hll_type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll create a sketch with log2(k) = 12" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sk = hll_sketch(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### HLL SKETCH SUMMARY: \n", " Log Config K : 12\n", " Hll Target : HLL_4\n", " Current Mode : HLL\n", " LB : 2.06958e+06\n", " Estimate : 2.09635e+06\n", " UB : 2.12379e+06\n", " OutOfOrder flag: 0\n", " CurMin : 7\n", " NumAtCurMin : 72\n", " HipAccum : 2.09635e+06\n", " KxQ0 : 5.80703\n", " KxQ1 : 0\n", "\n" ] } ], "source": [ "n = 1 << 21\n", "for i in range(0, n):\n", " sk.update(i)\n", "print(sk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Upper bound (1 std. dev) as % of true value: 101.2703\n" ] } ], "source": [ "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimate as % of true value: 99.9618\n" ] } ], "source": [ "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lower bound (1 std. dev) as % of true value: 98.6852\n" ] } ], "source": [ "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can serialize and deserialize the sketch, which will give us back the same structure." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2096" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sk_bytes = sk.serialize_compact()\n", "len(sk_bytes)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### HLL SKETCH SUMMARY: \n", " Log Config K : 12\n", " Hll Target : HLL_4\n", " Current Mode : HLL\n", " LB : 2.06958e+06\n", " Estimate : 2.09635e+06\n", " UB : 2.12379e+06\n", " OutOfOrder flag: 0\n", " CurMin : 7\n", " NumAtCurMin : 72\n", " HipAccum : 2.09635e+06\n", " KxQ0 : 5.80703\n", " KxQ1 : 0\n", "\n" ] } ], "source": [ "sk2 = hll_sketch.deserialize(sk_bytes)\n", "print(sk2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sketch Union Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "k = 12\n", "n = 1 << 20\n", "offset = int(3 * n / 4)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "sk1 = hll_sketch(k)\n", "sk2 = hll_sketch(k + 1)\n", "for i in range(0, n):\n", " sk1.update(i)\n", " sk2.update(i + offset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "union = hll_union(k+1)\n", "union.update(sk1)\n", "union.update(sk2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how log config k has automatically adopted the value of the smaller input sketch." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### HLL SKETCH SUMMARY: \n", " Log Config K : 12\n", " Hll Target : HLL_4\n", " Current Mode : HLL\n", " LB : 1.80197e+06\n", " Estimate : 1.83108e+06\n", " UB : 1.86121e+06\n", " OutOfOrder flag: 1\n", " CurMin : 6\n", " NumAtCurMin : 2\n", " HipAccum : 1.76932e+06\n", " KxQ0 : 6.60752\n", " KxQ1 : 0\n", "\n" ] } ], "source": [ "result = union.get_result()\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can again compare against the exact result, in this case 1.75*n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimate as % of true value: 99.7859\n" ] } ], "source": [ "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }

python/jupyter/HLLSketch.ipynb (346 lines of code) (raw):