README.md.jinja2

# Parquet Dedupe Estimator This estimates the amount of chunk level dedupe available in a collection of parquet files. All files will be chunked together so if there are multiple versions of the file, all versions should be provided together to see how much can be saved using a CDC based content addressable storage system. This tool is primarily designed to evaluate the deduplication effectiveness of a content defined chunking feature in Apache Parquet. Content Defined Chunking (CDC) is a technique used to divide data into variable-sized chunks based on the content of the data itself, rather than fixed-size boundaries. This method is particularly effective for deduplication because it ensures that identical data segments are consistently chunked in the same way, even if their positions within the dataset change. By using CDC, systems can more accurately identify and eliminate duplicate data, leading to significant storage savings. [Apache Parquet](https://parquet.apache.org/) is a columnar storage format that is widely used in the data engineering community. It is a binary file format that stores data in a columnar fashion, which allows for efficient reads and writes. In the context of Apache Parquet, CDC can enhance the efficiency of data storage and retrieval by creating chunks that are robust to data modifications such as updates, inserts, and deletes. # Arrow C++ / PyArrow Implementation The implementation has not yet been merged upstream and is currently available on the following branch: [content-defined-chunking](https://github.com/kszucs/arrow/tree/content-defined-chunking). This implementation introduces a gearhash-based chunker for each leaf column within the `ColumnWriter` objects, which persist throughout the lifecycle of the Parquet file writer. As new Arrow Arrays are appended to the column writers, the record-shredded arrays `(def_levels, rep_levels, leaf_array)` are processed by the chunker to identify chunk boundaries. Unlike the traditional approach where pages are split once a page's size reaches the default limit (typically 1MB), this implementation splits pages based on the chunk boundaries identified by the chunker. Consequently, the resulting chunks will have variable sizes but will be more resilient to data modifications such as updates, inserts, and deletes. This method enhances the robustness and efficiency of data storage and retrieval in Apache Parquet by ensuring that identical data segments are consistently chunked in the same manner, regardless of their position within the dataset. # Quick Start Compile: ```bash maturin develop -r ``` This will install a command line tool called `de`. ## Available Commands Run the deduplication estimator on two or more files: ```bash de dedup a.parquet b.parquet ``` Generate synthetic data and visualize deduplication: ```bash de synthetic -s 1 -e 1 '{"a": "int", "b": "str", "c": ["float"]}' ``` Checkout all revisions of a file within a git repository: https://huggingface.co/datasets/cfahlgren1/hub-stats/blob/main/datasets.parquet ```bash ❯ de revisions -d /tmp/datasets ~/Datasets/hub-stats/datasets.parquet datasets.parquet has 194 revisions Checking out 50a6ff0 Checking out 1ae78e6 Checking out 03613ea ... ``` Generate deduplication statistics for a directory of parquet files: ```bash ❯ de stats --with-json /tmp/datasets 100%|██████████████████████████████████████████| 194/194 [01:18<00:00, 2.46it/s] Writing CDC Parquet files 100%|██████████████████████████████████████████| 194/194 [00:12<00:00, 16.01it/s] 100%|██████████████████████████████████████████| 194/194 [00:09<00:00, 19.48it/s] Estimating deduplication for JSONLines Estimating deduplication for Parquet Estimating deduplication for CDC Snappy Estimating deduplication for CDC ZSTD ┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃ ┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃ ┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ JSONLines │ 93.0 GiB │ 64.9 GiB │ 12.4 GiB │ 70% │ 13% │ │ Parquet │ 16.2 GiB │ 15.0 GiB │ 13.4 GiB │ 93% │ 83% │ │ CDC Snappy │ 16.2 GiB │ 8.3 GiB │ 7.9 GiB │ 51% │ 48% │ │ CDC ZSTD │ 8.9 GiB │ 5.5 GiB │ 5.5 GiB │ 61% │ 61% │ └────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘ ``` It also generates a plot comparing the results: ![Datasets Deduplication Statistics](images/datasets.png) # Results on Synthetic Data The experiment can be reproduced by running the following command: ```bash ❯ de synthetic -s $S -e $E '{"a": "int"}' ``` This will generate `$S` million (2^20) records of schema `<a: int>` with variants including `$E` edits (inserted/deleted/updated block of 10 records) evenly distributed in the table and another variant with 5% of the records appended to the end. The resulting parquet files have `$S` row groups. The tool also generates .jsonlines files for comparison since content defined chunking is well suited for row major formats. {% for schema in [{"a": "int"}, {"a": "int", "b": "str", "c": ["float"]}] %} {% set cols = schema | count %} {% for rows in [1, 4] %} ## Results on Synthetic Data with {{ schema }} and {{ rows }}Mi rows The experiment can be reproduced by running the following command: ```bash ❯ de synthetic -s {{ rows }} -e 1 '{{ schema | tojson() }}' ❯ de synthetic -s {{ rows }} -e 2 '{{ schema | tojson() }}' ``` {%- set operations = ["appended", "updated"] -%} {%- if cols > 1 -%} {%- for key in schema.keys() -%} {%- set _ = operations.append("updated_" ~ key) -%} {%- endfor -%} {%- endif -%} {%- set operations = operations + ["inserted", "deleted"] -%} {% for operation in operations -%} {% if operation == "appended" -%} {% set edits = [1] %} {% else %} {% set edits = [1, 2] %} {% endif %} {%- for edit in edits -%} ### {{ operation|capitalize }} - {{ rows }}Mi Rows / {{ cols }} Columns / {{ edit }} Edits: | Compression | Vanilla Parquet | CDC Parquet | JSONLines | SQLite | | :---------: | ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | | None | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-none-{{ operation|lower }}-nocdc.parquet.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-none-{{ operation|lower }}-cdc.parquet.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-none-{{ operation|lower }}.jsonlines.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-none-{{ operation|lower }}.sqlite.png) | | Snappy | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-snappy-{{ operation|lower }}-nocdc.parquet.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-snappy-{{ operation|lower }}-cdc.parquet.png) | | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-snappy-{{ operation|lower }}.sqlite.png) | | ZSTD | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-zstd-{{ operation|lower }}-nocdc.parquet.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-zstd-{{ operation|lower }}-cdc.parquet.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-zstd-{{ operation|lower }}.jsonlines.png) | ![](synthetic/s{{ rows }}c{{ cols }}e{{ edit }}-zstd-{{ operation|lower }}.sqlite.png) | {% endfor %} {% endfor %} {% endfor %} {% endfor %} ## Datasets Results: cfahlgren1/hub-stats/spaces.parquet https://huggingface.co/datasets/cfahlgren1/hub-stats/blob/main/spaces.parquet Checkout all revisions of spaces.parquet: ```bash ❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/cfahlgren1/hub-stats.git ~/Datasets/hub-stats ❯ de revisions -d /tmp/spaces ~/Datasets/hub-stats/spaces.parquet ``` This will take some time depending on the network speed. Then the evaluation which automatically: 1. Converts the parquet files to JSONLines 2. Converts the parquet files to CDC Parquet with Snappy and ZSTD compressions 3. Estimates the deduplication ratio for each format 4. Prints the results and generates a plot for comparison ```bash ❯ de stats --with-json /tmp/spaces Writing JSONLines files 100%|██████████████████████████████████████████| 193/193 [07:14<00:00, 2.25s/it] Writing CDC Parquet files 100%|██████████████████████████████████████████| 193/193 [00:38<00:00, 4.97it/s] 100%|██████████████████████████████████████████| 193/193 [00:31<00:00, 6.12it/s] Estimating deduplication for JSONLines Estimating deduplication for Parquet Estimating deduplication for CDC Snappy Estimating deduplication for CDC ZSTD ┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃ ┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃ ┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ JSONLines │ 157.8 GiB │ 12.8 GiB │ 2.7 GiB │ 8% │ 2% │ │ Parquet │ 24.3 GiB │ 23.5 GiB │ 22.6 GiB │ 97% │ 93% │ │ CDC Snappy │ 24.3 GiB │ 8.8 GiB │ 8.5 GiB │ 36% │ 35% │ │ CDC ZSTD │ 15.3 GiB │ 6.2 GiB │ 6.2 GiB │ 41% │ 41% │ └────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘ ``` ![Spaces Deduplication Statistics](images/spaces.png) ## Datasets Results: openfoodfacts/product-database/food.parquet ```bash ❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/openfoodfacts/product-database.git ~/Datasets/product-database ❯ de revisions -d /tmp/food ~/Datasets/product-database/food.parquet food.parquet has 32 revisions Checking out 2e19b51 Checking out 1f84d31 Checking out d31d108 Checking out 9cd809c Checking out 41e5f38 Checking out 9a30ddd ... ``` Generate the stats but skip generating the JSONLines files since the compressed parquet files are around 6GB each so the uncompressed JSONLines files would require too much disk space. ```bash ❯ de stats /tmp/food --max-processes 8 Writing CDC Parquet files 100%|████████████████████████████████████████████| 32/32 [07:17<00:00, 13.67s/it] 100%|████████████████████████████████████████████| 32/32 [06:48<00:00, 12.77s/it] Estimating deduplication for Parquet Estimating deduplication for CDC Snappy Estimating deduplication for CDC ZSTD ┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ Compressed ┃ ┃ ┃ ┃ ┃ Compressed ┃ Deduplicat… ┃ Deduplica… ┃ ┃ Title ┃ Total Size ┃ Chunk Size ┃ Chunk Size ┃ Ratio ┃ Ratio ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Parquet │ 182.6 GiB │ 148.0 GiB │ 140.5 GiB │ 81% │ 77% │ │ CDC Snappy │ 178.3 GiB │ 75.6 GiB │ 73.1 GiB │ 42% │ 41% │ │ CDC ZSTD │ 109.6 GiB │ 55.9 GiB │ 55.5 GiB │ 51% │ 51% │ └────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘ ``` ![Food Deduplication Statistics](images/food.png)

README.md.jinja2 (201 lines of code) (raw):