huggingface / fineweb-2
File Change Frequency

File change frequency (churn) shows the distribution of file updates (days with at least one commit).

Overview
File Change Frequency Overall
  • There are 28 files with 1,973 lines of code.
    • 0 files changed more than 100 times (0 lines of code)
    • 0 files changed 51-100 times (0 lines of code)
    • 0 files changed 21-50 times (0 lines of code)
    • 0 files changed 6-20 times (0 lines of code)
    • 28 files changed 1-5 times (1,973 lines of code)
0% | 0% | 0% | 0% | 100%
Legend:
101+
51-100
21-50
6-20
1-5

explore: grouped by folders | grouped by update frequency | data
Contributors Count Frequency Overall
  • There are 28 files with 1,973 lines of code.
    • 0 files changed by more than 25 contributors (0 lines of code)
    • 0 files changed by 11-25 contributors (0 lines of code)
    • 0 files changed by 6-10 contributors (0 lines of code)
    • 6 files changed by 2-5 contributors (543 lines of code)
    • 22 files changed by 1 contributor (1,430 lines of code)
0% | 0% | 0% | 27% | 72%
Legend:
26+
11-25
6-10
2-5
1

explore: grouped by folders | grouped by contributors count | data
File Change Frequency per File Extension
txt, yaml, json, py, md, sh
File Change Frequency per Extension
The number of recorded file updates
101+
51-100
21-50
6-20
1-5
py0% | 0% | 0% | 0% | 100%
File Change Frequency per Logical Decomposition
primary
primary (file change frequency)
The number of recorded file updates
101+
51-100
21-50
6-20
1-5
misc0% | 0% | 0% | 0% | 100%
ablations0% | 0% | 0% | 0% | 100%
ROOT0% | 0% | 0% | 0% | 100%
Most Frequently Changed Files (Top 28)

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
Files With Most Contributors (Top 28)
Based on the number of unique email addresses found in commits.

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
Files With Least Contributors (Top 28)
Based on the number of unique email addresses found in commits.

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
227 1 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
download_cc-100.py
in misc/reference_datasets/multilingual
162 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mapcc.py
in misc/reference_datasets/monolingual/zh
140 9 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_evals.py
in ablations/evaluation
139 4 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
part jsons.py
in misc/reference_datasets/multilingual
112 8 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
copy_raw_data.py
in misc/reference_datasets/multilingual
105 4 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_croissant.py
in misc/reference_datasets/monolingual/fr
75 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_arabicweb24.py
in misc/reference_datasets/monolingual/ar
74 2 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_tokenization.py
in ablations/tokenization
63 - 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_random_evals.py
in ablations/evaluation
50 1 2024-12-06 2024-12-06 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt.py
in misc/reference_datasets/multilingual
41 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/te
39 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mc4.py
in misc/reference_datasets/multilingual
29 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_culturax.py
in misc/reference_datasets/multilingual
24 1 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sangraha.py
in misc/reference_datasets/monolingual/hi
21 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_hplt_split.py
in misc/reference_datasets/multilingual
20 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_mnbvc.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_tigerbot.py
in misc/reference_datasets/monolingual/zh
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_vngrs.py
in misc/reference_datasets/monolingual/tr
19 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_101b_arabicwords.py
in misc/reference_datasets/monolingual/ar
18 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_odaigen.py
in misc/reference_datasets/monolingual/hi
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
download_sea_commoncrawl.py
in misc/reference_datasets/monolingual/th
17 - 2024-12-05 2024-12-05 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
launch_exp.py
in ablations/training
226 2 2024-12-06 2024-12-06 1 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
run_precision_filtering.py
in misc/precision_filtering
106 5 2025-06-18 2025-06-25 3 2 guilherme@huggingface.co guilherme@huggingface.co
slurm_count_word.py
in misc/precision_filtering
71 3 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
count_common.py
in misc/precision_filtering
44 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_score.py
in misc/precision_filtering
40 2 2025-06-04 2025-06-16 2 2 kargaranamir@gmail.com nostrumg@gmail.com
wordlist_gen.py
in misc/precision_filtering
56 4 2025-06-04 2025-06-25 3 3 kargaranamir@gmail.com guilherme@huggingface.co