huggingface / fineweb-2
File Age & Freshness

File age measurements show the distribution of file ages (days since the first commit) and the file freshness (days since the latest commit).

Summary
File Change History Overall
File Age Distribution Overall
Days since first update
  • There are 28 files with 1,973 lines of code in files.
    • 0 files that are 366+ days old (0 lines of code)
    • 23 files that are 181-365 days old (1,656 lines of code)
    • 0 files that are 91-180 days old (0 lines of code)
    • 0 files that are 31-90 days old (0 lines of code)
    • 5 files that are 1-30 days old (317 lines of code)
0% | 83% | 0% | 0% | 16%
Legend:
366+
181-365
91-180
31-90
1-30

explore: grouped by folders | grouped by age
File Freshness Distribution Overall
Days since last update
  • There are 28 files with 1,973 lines of code in files.
    • 0 files have been last changed 366+ days ago (0 lines of code)
    • 23 files have been last changed 181-365 days ago (1,656 lines of code)
    • 0 files have been last changed 91-180 days ago (0 lines of code)
    • 0 files have been last changed 31-90 days ago (0 lines of code)
    • 5 files have been last changed 1-30 days ago (317 lines of code)
0% | 83% | 0% | 0% | 16%
Legend:
366+
181-365
91-180
31-90
1-30

explore: grouped by folders | grouped by freshness
File Change History per File Extension
txt, yaml, json, py, md, sh
File Age Distribution per Extension
Days since first update
366+
181-365
91-180
31-90
1-30
py0% | 83% | 0% | 0% | 16%
File Freshness Distribution per Extension
Days since last update
366+
181-365
91-180
31-90
1-30
py0% | 83% | 0% | 0% | 16%
File Change History per Logical Decomposition
primary
primary (file age distribution)
Days since first update
366+
181-365
91-180
31-90
1-30
misc0% | 75% | 0% | 0% | 25%
ablations0% | 100% | 0% | 0% | 0%
ROOT0% | 100% | 0% | 0% | 0%
primary (file freshness distribution)
Days since last update
366+
181-365
91-180
31-90
1-30
misc0% | 75% | 0% | 0% | 25%
ablations0% | 100% | 0% | 0% | 0%
ROOT0% | 100% | 0% | 0% | 0%
Oldest Files (Top 28)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
Files Not Recently Changed (Top 28)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
Most Recently Created Files (Top 28)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
Most Recently Changed Files (Top 28)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com