huggingface / datablations
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
66% | 13% | 4% | 4% | 9%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
ipynb82% | 17% | 0% | 0% | 0%
py0% | 0% | 24% | 24% | 50%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
filtering_notebooks100% | 0% | 0% | 0% | 0%
plotstables63% | 36% | 0% | 0% | 0%
training0% | 0% | 100% | 0% | 0%
utils0% | 0% | 0% | 34% | 65%
filtering0% | 0% | 0% | 30% | 69%
Longest Files (Top 22)
File# lines# units
blindspots.ipynb
in filtering_notebooks
2462 -
repetition.ipynb
in plotstables
1376 -
filtering.ipynb
in plotstables
800 -
mup.py
in training
278 4
hub_sync.py
in utils
153 9
add_dedup_info.py
in filtering/deduplication
123 5
79 3
71 4
hf_dataset_to_file.py
in filtering/deduplication
71 2
46 -
save_roots_sample.py
in filtering/deduplication
46 1
save_dataset_sample.py
in filtering/deduplication
43 1
36 4
35 -
dedup_oscar.py
in filtering/deduplication
26 2
filter_oscar_jsonl.py
in filtering/deduplication
23 1
save_rust_format.py
in filtering/deduplication
22 1
suffix_dedup.py
in filtering/deduplication
22 2
download_oscar.py
in filtering/deduplication
15 -
hub_auth.py
in utils
12 -
cleandirs.py
in utils
10 -
save_dataset.py
in filtering/deduplication
9 -
Files With Most Units (Top 13)
File# lines# units
hub_sync.py
in utils
153 9
add_dedup_info.py
in filtering/deduplication
123 5
mup.py
in training
278 4
71 4
36 4
79 3
hf_dataset_to_file.py
in filtering/deduplication
71 2
suffix_dedup.py
in filtering/deduplication
22 2
dedup_oscar.py
in filtering/deduplication
26 2
save_rust_format.py
in filtering/deduplication
22 1
save_roots_sample.py
in filtering/deduplication
46 1
filter_oscar_jsonl.py
in filtering/deduplication
23 1
save_dataset_sample.py
in filtering/deduplication
43 1
Files With Long Lines (Top 7)

There are 7 files with lines longer than 120 characters. In total, there are 107 long lines.

File# lines# units# long lines
repetition.ipynb
in plotstables
1376 - 50
filtering.ipynb
in plotstables
800 - 48
36 4 4
filter_oscar_jsonl.py
in filtering/deduplication
23 1 2
hub_sync.py
in utils
153 9 1
71 4 1
add_dedup_info.py
in filtering/deduplication
123 5 1