huggingface / datatrove
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
32% | 0% | 29% | 16% | 21%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
py34% | 0% | 25% | 17% | 22%
rs0% | 0% | 100% | 0% | 0%
toml0% | 0% | 0% | 85% | 14%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
src32% | 0% | 30% | 15% | 21%
ROOT0% | 0% | 0% | 100% | 0%
Longest Files (Top 50)
File# lines# units
typeshelper.py
in src/datatrove/utils
4106 -
minhash.py
in src/datatrove/pipeline/dedup
499 18
s3_union_find.rs
in src/datatrove/tools/fast_mh3/src
477 4
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14
word_tokenizers.py
in src/datatrove/utils
361 60
url_dedup.py
in src/datatrove/pipeline/dedup
301 15
tokenizer.py
in src/datatrove/pipeline/tokens
298 16
stats.py
in src/datatrove/utils
297 41
local_union_find.rs
in src/datatrove/tools/fast_mh3/src
272 2
text.py
in src/datatrove/utils
254 6
slurm.py
in src/datatrove/executor
235 11
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17
japanese_tokenizer.py
in src/datatrove/utils
216 16
c4_filters.py
in src/datatrove/pipeline/filters
197 8
dataset.py
in src/datatrove/utils
194 12
io.py
in src/datatrove
188 27
ray.py
in src/datatrove/executor
160 5
merger.py
in src/datatrove/pipeline/tokens
153 5
152 -
n_grams.py
in src/datatrove/pipeline/decont
150 6
base.py
in src/datatrove/pipeline/readers
143 9
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12
perplexity.py
in src/datatrove/utils
138 12
inspect_data.py
in src/datatrove/tools
129 4
base.py
in src/datatrove/pipeline/extractors
117 11
megatron_tokenizer.py
in src/datatrove/pipeline/tokens
107 9
huggingface.py
in src/datatrove/pipeline/writers
104 4
local.py
in src/datatrove/executor
101 4
base.py
in src/datatrove/executor
98 10
huggingface.py
in src/datatrove/pipeline/readers
95 4
warc.py
in src/datatrove/pipeline/readers
94 3
url_filter.py
in src/datatrove/pipeline/filters
91 6
base.py
in src/datatrove/pipeline/stats
91 4
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6
disk_base.py
in src/datatrove/pipeline/writers
83 11
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3
gopher_quality_filter.py
in src/datatrove/pipeline/filters
76 2
jobs_status.py
in src/datatrove/tools
71 1
pii.py
in src/datatrove/pipeline/formatters
68 5
parquet.py
in src/datatrove/pipeline/writers
67 5
ipc.py
in src/datatrove/pipeline/readers
64 4
tokenization.py
in src/datatrove/utils
58 6
check_dataset.py
in src/datatrove/tools
58 4
line_stats.py
in src/datatrove/pipeline/stats
57 5
parquet.py
in src/datatrove/pipeline/readers
57 2
jsonl.py
in src/datatrove/pipeline/readers
57 2
word_stats.py
in src/datatrove/pipeline/stats
56 4
failed_logs.py
in src/datatrove/tools
55 1
binaryio.py
in src/datatrove/utils
54 3
_import_utils.py
in src/datatrove/utils
54 12
Files With Most Units (Top 50)
File# lines# units
word_tokenizers.py
in src/datatrove/utils
361 60
stats.py
in src/datatrove/utils
297 41
io.py
in src/datatrove
188 27
minhash.py
in src/datatrove/pipeline/dedup
499 18
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17
tokenizer.py
in src/datatrove/pipeline/tokens
298 16
japanese_tokenizer.py
in src/datatrove/utils
216 16
url_dedup.py
in src/datatrove/pipeline/dedup
301 15
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12
_import_utils.py
in src/datatrove/utils
54 12
dataset.py
in src/datatrove/utils
194 12
perplexity.py
in src/datatrove/utils
138 12
disk_base.py
in src/datatrove/pipeline/writers
83 11
base.py
in src/datatrove/pipeline/extractors
117 11
slurm.py
in src/datatrove/executor
235 11
base.py
in src/datatrove/executor
98 10
base.py
in src/datatrove/pipeline/readers
143 9
megatron_tokenizer.py
in src/datatrove/pipeline/tokens
107 9
c4_filters.py
in src/datatrove/pipeline/filters
197 8
base.py
in src/datatrove/pipeline
35 8
hashing.py
in src/datatrove/utils
37 8
logging.py
in src/datatrove/utils
37 7
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6
url_filter.py
in src/datatrove/pipeline/filters
91 6
n_grams.py
in src/datatrove/pipeline/decont
150 6
text.py
in src/datatrove/utils
254 6
tokenization.py
in src/datatrove/utils
58 6
lid.py
in src/datatrove/utils
45 6
pii.py
in src/datatrove/pipeline/formatters
68 5
base_filter.py
in src/datatrove/pipeline/filters
49 5
line_stats.py
in src/datatrove/pipeline/stats
57 5
parquet.py
in src/datatrove/pipeline/writers
67 5
merger.py
in src/datatrove/pipeline/tokens
153 5
ray.py
in src/datatrove/executor
160 5
unigram_log_probs.py
in src/datatrove/pipeline/filters
46 4
word_stats.py
in src/datatrove/pipeline/stats
56 4
paragraph_stats.py
in src/datatrove/pipeline/stats
51 4
sentence_stats.py
in src/datatrove/pipeline/stats
47 4
base.py
in src/datatrove/pipeline/stats
91 4
huggingface.py
in src/datatrove/pipeline/readers
95 4
ipc.py
in src/datatrove/pipeline/readers
64 4
huggingface.py
in src/datatrove/pipeline/writers
104 4
local.py
in src/datatrove/executor
101 4
check_dataset.py
in src/datatrove/tools
58 4
inspect_data.py
in src/datatrove/tools
129 4
s3_union_find.rs
in src/datatrove/tools/fast_mh3/src
477 4
base.py
in src/datatrove/pipeline/formatters
17 3
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3
merger.py
in src/datatrove/pipeline/stats
51 3
Files With Long Lines (Top 9)

There are 9 files with lines longer than 120 characters. In total, there are 11 long lines.

File# lines# units# long lines
c4_filters.py
in src/datatrove/pipeline/filters
197 8 2
jobs_status.py
in src/datatrove/tools
71 1 2
152 - 1
huggingface.py
in src/datatrove/pipeline/readers
95 4 1
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12 1
dataset.py
in src/datatrove/utils
194 12 1
stats.py
in src/datatrove/utils
297 41 1
local.py
in src/datatrove/executor
101 4 1
check_dataset.py
in src/datatrove/tools
58 4 1