python/jupyter/CPCSketch.ipynb (345 lines of code) (raw):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CPC Sketch Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Sketch Usage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasketches import cpc_sketch, cpc_union"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll create a sketch with log2(k) = 12"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk = cpc_sketch(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### CPC sketch summary:\n",
      "   lgK            : 12\n",
      "   seed hash      : 93cc\n",
      "   C              : 38212\n",
      "   flavor         : 4\n",
      "   merged         : false\n",
      "   compressed     : false\n",
      "   intresting col : 5\n",
      "   HIP estimate   : 2.09721e+06\n",
      "   kxp            : 11.4725\n",
      "   offset         : 6\n",
      "   table          : allocated\n",
      "   num SV         : 135\n",
      "   window         : allocated\n",
      "### End sketch summary\n",
      "\n"
     ]
    }
   ],
   "source": [
    "n = 1 << 21\n",
    "for i in range(0, n):\n",
    "    sk.update(i)\n",
    "print(sk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Upper bound (1 std. dev) as % of true value:  100.9281\n"
     ]
    }
   ],
   "source": [
    "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimate as % of true value:  100.0026\n"
     ]
    }
   ],
   "source": [
    "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lower bound (1 std. dev) as % of true value:  99.0935\n"
     ]
    }
   ],
   "source": [
    "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can serialize and deserialize the sketch, which will give us back the same structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2484"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sk_bytes = sk.serialize()\n",
    "len(sk_bytes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### CPC sketch summary:\n",
      "   lgK            : 12\n",
      "   seed hash      : 93cc\n",
      "   C              : 38212\n",
      "   flavor         : 4\n",
      "   merged         : false\n",
      "   compressed     : false\n",
      "   intresting col : 5\n",
      "   HIP estimate   : 2.09721e+06\n",
      "   kxp            : 11.4725\n",
      "   offset         : 6\n",
      "   table          : allocated\n",
      "   num SV         : 135\n",
      "   window         : allocated\n",
      "### End sketch summary\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sk2 = cpc_sketch.deserialize(sk_bytes)\n",
    "print(sk2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sketch Union Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = 12\n",
    "n = 1 << 20\n",
    "offset = int(3 * n / 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk1 = cpc_sketch(k)\n",
    "sk2 = cpc_sketch(k + 1)\n",
    "for i in range(0, n):\n",
    "    sk1.update(i)\n",
    "    sk2.update(i + offset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "union = cpc_union(k+1)\n",
    "union.update(sk1)\n",
    "union.update(sk2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how log config k has automatically adopted the value of the smaller input sketch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### CPC sketch summary:\n",
      "   lgK            : 12\n",
      "   seed hash      : 93cc\n",
      "   C              : 37418\n",
      "   flavor         : 4\n",
      "   merged         : true\n",
      "   compressed     : false\n",
      "   intresting col : 5\n",
      "   HIP estimate   : 0\n",
      "   kxp            : 4096\n",
      "   offset         : 6\n",
      "   table          : allocated\n",
      "   num SV         : 123\n",
      "   window         : allocated\n",
      "### End sketch summary\n",
      "\n"
     ]
    }
   ],
   "source": [
    "result = union.get_result()\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can again compare against the exact result, in this case 1.75*n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimate as % of true value:  99.6646\n"
     ]
    }
   ],
   "source": [
    "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}