Path Lines of Code README.md 428 examples/exact_substrings.py 38 examples/filter_hf_dataset.py 37 examples/fineweb.py 144 examples/minhash_deduplication.py 92 examples/process_common_crawl_dump.py 45 examples/sentence_deduplication.py 39 examples/summary_stats.py 70 examples/tokenize_c4.py 42 examples/tokenize_from_hf_to_s3.py 90 examples/url_deduplication.py 56 src/datatrove/assets/banned_subwords.txt 135 src/datatrove/assets/banned_words.txt 216 src/datatrove/assets/soft_banned_words.txt 89 src/datatrove/tools/fast_mh3/README.md 37