jupyter/comparison-to-datasketch/api-differences.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "46ea1d79-8904-4ca0-9f40-2b8b11baa9f7", "metadata": {}, "source": [ "# API Differences between the libraries\n", "\n", "We outline the key API differences between the libraries that users should be aware of when using sketches. The Apache DataSketches are designed to have, as far as possible, a consistent API. Therefore, although only the following examples are provided here, they roughly map on to all other sketches provided." ] }, { "cell_type": "code", "execution_count": 1, "id": "9fba04a9-c359-40ad-937b-ff25103ddf56", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import datasketches as asf\n", "import datasketch as ds\n", "import mmh3" ] }, { "cell_type": "markdown", "id": "5168ce81-9a7f-4889-878c-c6285e1bf580", "metadata": {}, "source": [ "1. The `update()` method for `asf.hll_sketch` accepts inputs as integers, strings, bytes, and floats. On the other hand, `datasketch.HyperLogLogPlusPlus` only accepts byte and string type inputs." ] }, { "cell_type": "code", "execution_count": 2, "id": "4e410c91-b156-4c91-9d85-fe1695c6492e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.000000029802323" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Datasketches HLL can accept multiple inputs\n", "# These are treated as different items in a single sketch.\n", "asf_hll_types = asf.hll_sketch(14, asf.HLL_8)\n", "asf_hll_types.update(1)\n", "asf_hll_types.update(1.0)\n", "asf_hll_types.update(str(1))\n", "\n", "xx = 1\n", "xx_bytes = xx.to_bytes(64, \"little\")\n", "asf_hll_types.update(xx_bytes)\n", "\n", "asf_hll_types.get_estimate()" ] }, { "cell_type": "code", "execution_count": 3, "id": "ac159800-f984-40b5-9a1d-f26213f7ce49", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Exception on integer input\n", "Exception on float input\n", "Accepts bytes input\n", "Accepts string input\n", "2.000122080247517\n" ] } ], "source": [ "# datasketch HLL needs bytes\n", "dhll_type = ds.HyperLogLogPlusPlus(14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])\n", "try:\n", " dhll_type.update(1)\n", "except:\n", " print(\"Exception on integer input\")\n", " \n", "try:\n", " dhll_type.update(1.0)\n", "except:\n", " print(\"Exception on float input\")\n", " \n", "try:\n", " dhll_type.update(xx_bytes)\n", " print(\"Accepts bytes input\")\n", "except:\n", " print(\"Exception on string input\")\n", " \n", "try:\n", " dhll_type.update(str(1))\n", " print(\"Accepts string input\")\n", "except:\n", " print(\"Exception on string input\")\n", " \n", "print(dhll_type.count()) # only two distinct items inserted into the sketch." ] }, { "cell_type": "markdown", "id": "5cfe6a15-a932-4351-aa34-ddf99ebc2e39", "metadata": {}, "source": [ "2. The ASF HLL implementation comes with `get_upper_bound()` and `get_lower_bound()` functions. These enable the user to understand with what confidence. On the other hand, the `datasketch` implementation returns only the estimated count.\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "7e6e81d4-4aa4-4692-82ae-e1b9209a9896", "metadata": {}, "outputs": [], "source": [ "a_hll = asf.hll_sketch(14, asf.HLL_8)\n", "d_hll = ds.HyperLogLogPlusPlus(14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])\n", "\n", "n = 1<<15\n", "for x in range(n):\n", " a_hll.update(x)\n", " d_hll.update(str(x))" ] }, { "cell_type": "code", "execution_count": 21, "id": "6a01bf67-a885-4bfc-a9ed-69a728af55f1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lower bound (1 std. dev) as % of true value: 99.5952\n", "ASF HyperLogLog estimate as % of true value: 100.2430\n", "Upper bound (1 std. dev) as % of true value: 100.8992\n" ] } ], "source": [ "#asf_hll_sketch = all_asf_hll[0] \n", "print(f\"Lower bound (1 std. dev) as % of true value: {(100*a_hll.get_lower_bound(1) / n):.4f}\")\n", "print(f\"ASF HyperLogLog estimate as % of true value: {(100*a_hll.get_estimate() / n):.4f}\")\n", "print(f\"Upper bound (1 std. dev) as % of true value: {(100*a_hll.get_upper_bound(1) / n):.4f}\")\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "bcd01928-30ca-4d82-8fb4-b704ba0498da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "datasketch HyperLogLog estimate as % of true value: 100.6836\n" ] } ], "source": [ "print(f\"datasketch HyperLogLog estimate as % of true value: {(100*d_hll.count() / n):.4f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }

jupyter/comparison-to-datasketch/api-differences.ipynb (200 lines of code) (raw):