huggingface / fineweb-2
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
0% | 0% | 22% | 38% | 38%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
py0% | 0% | 22% | 38% | 38%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
ROOT0% | 0% | 100% | 0% | 0%
ablations0% | 0% | 47% | 29% | 23%
misc0% | 0% | 0% | 49% | 50%
Longest Files (Top 28)
File# lines# units
227 1
launch_exp.py
in ablations/training
226 2
download_cc-100.py
in misc/reference_datasets/multilingual
162 1
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9
launch_evals.py
in ablations/evaluation
139 4
part jsons.py
in misc/reference_datasets/multilingual
112 8
run_precision_filtering.py
in misc/precision_filtering
106 5
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2
slurm_count_word.py
in misc/precision_filtering
71 3
launch_tokenization.py
in ablations/tokenization
63 -
wordlist_gen.py
in misc/precision_filtering
56 4
launch_random_evals.py
in ablations/evaluation
50 1
count_common.py
in misc/precision_filtering
44 2
download_hplt.py
in misc/reference_datasets/multilingual
41 1
wordlist_score.py
in misc/precision_filtering
40 2
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 -
download_mc4.py
in misc/reference_datasets/multilingual
29 1
download_culturax.py
in misc/reference_datasets/multilingual
24 1
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 -
download_hplt_split.py
in misc/reference_datasets/multilingual
20 -
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 -
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 -
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 -
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 -
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 -
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 -
Files With Most Units (Top 18)
File# lines# units
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9
part jsons.py
in misc/reference_datasets/multilingual
112 8
run_precision_filtering.py
in misc/precision_filtering
106 5
launch_evals.py
in ablations/evaluation
139 4
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4
wordlist_gen.py
in misc/precision_filtering
56 4
slurm_count_word.py
in misc/precision_filtering
71 3
launch_exp.py
in ablations/training
226 2
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2
wordlist_score.py
in misc/precision_filtering
40 2
count_common.py
in misc/precision_filtering
44 2
launch_random_evals.py
in ablations/evaluation
50 1
227 1
download_culturax.py
in misc/reference_datasets/multilingual
24 1
download_mc4.py
in misc/reference_datasets/multilingual
29 1
download_hplt.py
in misc/reference_datasets/multilingual
41 1
download_cc-100.py
in misc/reference_datasets/multilingual
162 1
Files With Long Lines (Top 17)

There are 17 files with lines longer than 120 characters. In total, there are 30 long lines.

File# lines# units# long lines
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 6
launch_exp.py
in ablations/training
226 2 3
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 3
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2
launch_evals.py
in ablations/evaluation
139 4 1
launch_random_evals.py
in ablations/evaluation
50 1 1
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 1
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 1
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 1
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 1
slurm_count_word.py
in misc/precision_filtering
71 3 1
run_precision_filtering.py
in misc/precision_filtering
106 5 1
wordlist_score.py
in misc/precision_filtering
40 2 1
wordlist_gen.py
in misc/precision_filtering
56 4 1