python/jupyter/HLLSketch.ipynb (346 lines of code) (raw):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HLL Sketch Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Sketch Usage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasketches import hll_sketch, hll_union, tgt_hll_type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll create a sketch with log2(k) = 12"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk = hll_sketch(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### HLL SKETCH SUMMARY: \n",
      "  Log Config K   : 12\n",
      "  Hll Target     : HLL_4\n",
      "  Current Mode   : HLL\n",
      "  LB             : 2.06958e+06\n",
      "  Estimate       : 2.09635e+06\n",
      "  UB             : 2.12379e+06\n",
      "  OutOfOrder flag: 0\n",
      "  CurMin       : 7\n",
      "  NumAtCurMin  : 72\n",
      "  HipAccum     : 2.09635e+06\n",
      "  KxQ0         : 5.80703\n",
      "  KxQ1         : 0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "n = 1 << 21\n",
    "for i in range(0, n):\n",
    "    sk.update(i)\n",
    "print(sk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Upper bound (1 std. dev) as % of true value:  101.2703\n"
     ]
    }
   ],
   "source": [
    "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimate as % of true value:  99.9618\n"
     ]
    }
   ],
   "source": [
    "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lower bound (1 std. dev) as % of true value:  98.6852\n"
     ]
    }
   ],
   "source": [
    "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can serialize and deserialize the sketch, which will give us back the same structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2096"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sk_bytes = sk.serialize_compact()\n",
    "len(sk_bytes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### HLL SKETCH SUMMARY: \n",
      "  Log Config K   : 12\n",
      "  Hll Target     : HLL_4\n",
      "  Current Mode   : HLL\n",
      "  LB             : 2.06958e+06\n",
      "  Estimate       : 2.09635e+06\n",
      "  UB             : 2.12379e+06\n",
      "  OutOfOrder flag: 0\n",
      "  CurMin       : 7\n",
      "  NumAtCurMin  : 72\n",
      "  HipAccum     : 2.09635e+06\n",
      "  KxQ0         : 5.80703\n",
      "  KxQ1         : 0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sk2 = hll_sketch.deserialize(sk_bytes)\n",
    "print(sk2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sketch Union Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = 12\n",
    "n = 1 << 20\n",
    "offset = int(3 * n / 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk1 = hll_sketch(k)\n",
    "sk2 = hll_sketch(k + 1)\n",
    "for i in range(0, n):\n",
    "    sk1.update(i)\n",
    "    sk2.update(i + offset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "union = hll_union(k+1)\n",
    "union.update(sk1)\n",
    "union.update(sk2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how log config k has automatically adopted the value of the smaller input sketch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### HLL SKETCH SUMMARY: \n",
      "  Log Config K   : 12\n",
      "  Hll Target     : HLL_4\n",
      "  Current Mode   : HLL\n",
      "  LB             : 1.80197e+06\n",
      "  Estimate       : 1.83108e+06\n",
      "  UB             : 1.86121e+06\n",
      "  OutOfOrder flag: 1\n",
      "  CurMin       : 6\n",
      "  NumAtCurMin  : 2\n",
      "  HipAccum     : 1.76932e+06\n",
      "  KxQ0         : 6.60752\n",
      "  KxQ1         : 0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "result = union.get_result()\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can again compare against the exact result, in this case 1.75*n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimate as % of true value:  99.7859\n"
     ]
    }
   ],
   "source": [
    "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}