duplicated block id: 1 size: 54 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/ar/download_arabicweb24.py (8:67) - misc/reference_datasets/monolingual/fr/download_croissant.py (7:66) duplicated block id: 2 size: 34 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/zh/download_mapcc.py (46:89) - misc/reference_datasets/multilingual/part jsons.py (56:98) duplicated block id: 3 size: 31 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/zh/download_mapcc.py (10:44) - misc/reference_datasets/multilingual/part jsons.py (9:43) duplicated block id: 4 size: 15 cleaned lines of code in 2 files: - misc/reference_datasets/multilingual/copy_raw_data.py (96:114) - misc/reference_datasets/multilingual/part jsons.py (115:132) duplicated block id: 5 size: 14 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/zh/download_mapcc.py (90:103) - misc/reference_datasets/multilingual/part jsons.py (100:113) duplicated block id: 6 size: 12 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/ar/download_arabicweb24.py (33:44) - misc/reference_datasets/multilingual/copy_raw_data.py (49:60) duplicated block id: 7 size: 12 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/fr/download_croissant.py (32:43) - misc/reference_datasets/multilingual/copy_raw_data.py (49:60) duplicated block id: 8 size: 11 cleaned lines of code in 2 files: - misc/precision_filtering/count_common.py (7:27) - misc/precision_filtering/wordlist_gen.py (66:86) duplicated block id: 9 size: 10 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/hi/download_sangraha.py (11:20) - misc/reference_datasets/monolingual/te/download_sangraha.py (11:20) duplicated block id: 10 size: 9 cleaned lines of code in 2 files: - misc/precision_filtering/slurm_count_word.py (37:45) - misc/precision_filtering/wordlist_score.py (41:49) duplicated block id: 11 size: 9 cleaned lines of code in 2 files: - ablations/evaluation/launch_evals.py (49:66) - ablations/evaluation/launch_random_evals.py (11:28) duplicated block id: 12 size: 8 cleaned lines of code in 2 files: - misc/precision_filtering/run_precision_filtering.py (18:25) - misc/precision_filtering/wordlist_score.py (42:49) duplicated block id: 13 size: 8 cleaned lines of code in 2 files: - misc/precision_filtering/run_precision_filtering.py (18:25) - misc/precision_filtering/slurm_count_word.py (38:45) duplicated block id: 14 size: 7 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/ar/download_101b_arabicwords.py (13:19) - misc/reference_datasets/monolingual/ar/download_arabicweb24.py (75:81) duplicated block id: 15 size: 7 cleaned lines of code in 2 files: - ablations/evaluation/launch_evals.py (189:269) - ablations/evaluation/launch_random_evals.py (58:106) duplicated block id: 16 size: 7 cleaned lines of code in 2 files: - misc/precision_filtering/slurm_count_word.py (4:17) - misc/precision_filtering/wordlist_gen.py (66:77) duplicated block id: 17 size: 7 cleaned lines of code in 2 files: - misc/precision_filtering/count_common.py (7:18) - misc/precision_filtering/slurm_count_word.py (4:17) duplicated block id: 18 size: 6 cleaned lines of code in 2 files: - misc/reference_datasets/monolingual/te/download_sangraha.py (36:41) - misc/reference_datasets/monolingual/th/download_sea_commoncrawl.py (14:19)