duplicated block id: 1 size: 111 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (113:228) - obelics/callers/filter_web_documents.py (93:208) duplicated block id: 2 size: 24 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (14:41) - build_obelics/02_extract_html_get_image_urls.py (13:40) duplicated block id: 3 size: 21 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (96:117) - obelics/callers/extract_web_documents.py (125:146) duplicated block id: 4 size: 21 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (68:91) - build_obelics/02_extract_html_get_image_urls.py (75:98) duplicated block id: 5 size: 19 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (66:86) - obelics/callers/download_warc.py (48:68) duplicated block id: 6 size: 18 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (61:79) - build_obelics/13_final_processing.py (190:208) duplicated block id: 7 size: 14 cleaned lines of code in 2 files: - obelics/visualization/web_document_and_filtering_visualization.py (684:697) - obelics/visualization/web_document_visualization.py (35:48) duplicated block id: 8 size: 13 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (125:138) - build_obelics/02_extract_html_get_image_urls.py (125:138) duplicated block id: 9 size: 12 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (45:58) - build_obelics/02_extract_html_get_image_urls.py (50:63) duplicated block id: 10 size: 12 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:34) - build_obelics/06_03_remove_image_duplicates.py (26:37) duplicated block id: 11 size: 12 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (11:27) - build_obelics/11_03_set_img_urls_dedup.py (13:29) duplicated block id: 12 size: 12 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (50:63) - obelics/callers/extract_html.py (31:44) duplicated block id: 13 size: 12 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (101:112) - obelics/visualization/global_visualization.py (51:62) duplicated block id: 14 size: 12 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (45:58) - obelics/callers/extract_html.py (31:44) duplicated block id: 15 size: 11 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:34) - build_obelics/12_02_remove_opt_out_images.py (12:26) duplicated block id: 16 size: 11 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:63) - obelics/callers/download_warc.py (32:44) duplicated block id: 17 size: 11 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:51) - obelics/callers/download_warc.py (32:44) duplicated block id: 18 size: 11 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:26) - build_obelics/12_02_remove_opt_out_images.py (12:26) duplicated block id: 19 size: 11 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:51) - obelics/callers/extract_html.py (32:44) duplicated block id: 20 size: 11 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (124:135) - build_obelics/13_final_processing.py (315:326) duplicated block id: 21 size: 11 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:26) - build_obelics/10_final_cleaning.py (14:29) duplicated block id: 22 size: 11 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:63) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:51) duplicated block id: 23 size: 11 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:58) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:51) duplicated block id: 24 size: 11 cleaned lines of code in 2 files: - obelics/processors/dom_tree_simplificator.py (15:25) - obelics/visualization/global_visualization.py (51:61) duplicated block id: 25 size: 11 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:58) - obelics/callers/download_warc.py (32:44) duplicated block id: 26 size: 11 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (101:111) - obelics/processors/dom_tree_simplificator.py (15:25) duplicated block id: 27 size: 11 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:34) - build_obelics/07_03_nsfw_image_removal.py (12:26) duplicated block id: 28 size: 11 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (32:44) - obelics/callers/extract_html.py (32:44) duplicated block id: 29 size: 10 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (13:26) - obelics/callers/extract_web_documents.py (11:24) duplicated block id: 30 size: 10 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (59:69) - build_obelics/06_03_remove_image_duplicates.py (102:112) duplicated block id: 31 size: 10 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:21) - build_obelics/10_final_cleaning.py (19:31) duplicated block id: 32 size: 10 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:36) - build_obelics/06_03_remove_image_duplicates.py (11:24) duplicated block id: 33 size: 10 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (30:41) - build_obelics/13_final_processing.py (266:277) duplicated block id: 34 size: 10 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:22) - obelics/callers/download_warc.py (10:21) duplicated block id: 35 size: 10 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (43:55) - build_obelics/02_extract_html_get_image_urls.py (56:68) duplicated block id: 36 size: 10 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (54:63) - obelics/callers/filter_web_documents.py (47:56) duplicated block id: 37 size: 9 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:67) - obelics/callers/extract_html.py (32:40) duplicated block id: 38 size: 9 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 39 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) duplicated block id: 40 size: 9 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (53:61) - obelics/callers/extract_web_documents.py (66:74) duplicated block id: 41 size: 9 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 42 size: 9 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (14:25) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 43 size: 9 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:67) - obelics/callers/download_warc.py (32:40) duplicated block id: 44 size: 9 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (12:22) - build_obelics/13_final_processing.py (18:29) duplicated block id: 45 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (90:101) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (45:56) duplicated block id: 46 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:22) - build_obelics/08_02_urldedup.py (11:21) duplicated block id: 47 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/08_02_urldedup.py (11:21) duplicated block id: 48 size: 9 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:26) - build_obelics/04_merge_web_docs_with_images.py (18:28) duplicated block id: 49 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/13_final_processing.py (18:29) duplicated block id: 50 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 51 size: 9 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (11:21) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 52 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:26) - build_obelics/09_06_line_dedup.py (10:20) duplicated block id: 53 size: 9 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (11:21) - build_obelics/13_final_processing.py (18:29) duplicated block id: 54 size: 9 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) - build_obelics/13_final_processing.py (18:29) duplicated block id: 55 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 56 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/10_final_cleaning.py (14:25) duplicated block id: 57 size: 9 cleaned lines of code in 2 files: - obelics/processors/web_document_filtering.py (24:32) - obelics/processors/web_document_filtering.py (520:528) duplicated block id: 58 size: 9 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:27) - build_obelics/04_merge_web_docs_with_images.py (18:28) duplicated block id: 59 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:22) - build_obelics/10_final_cleaning.py (14:25) duplicated block id: 60 size: 9 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (126:135) - build_obelics/11_03_set_img_urls_dedup.py (102:111) duplicated block id: 61 size: 9 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:59) - build_obelics/04_merge_web_docs_with_images.py (59:67) duplicated block id: 62 size: 9 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:54) - build_obelics/04_merge_web_docs_with_images.py (59:67) duplicated block id: 63 size: 9 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (52:60) - obelics/processors/web_document_line_deduplication.py (114:122) duplicated block id: 64 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/13_final_processing.py (18:29) duplicated block id: 65 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:30) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) duplicated block id: 66 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:22) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) duplicated block id: 67 size: 9 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (13:25) - obelics/callers/extract_web_documents.py (12:24) duplicated block id: 68 size: 9 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (14:25) - build_obelics/13_final_processing.py (18:29) duplicated block id: 69 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:30) - build_obelics/08_02_urldedup.py (11:21) duplicated block id: 70 size: 9 cleaned lines of code in 2 files: - obelics/processors/web_document_filtering.py (674:682) - obelics/visualization/web_document_and_filtering_visualization.py (553:561) duplicated block id: 71 size: 9 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (34:43) - build_obelics/10_final_cleaning.py (96:105) duplicated block id: 72 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:30) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 73 size: 9 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (11:21) - build_obelics/10_final_cleaning.py (14:25) duplicated block id: 74 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/07_03_nsfw_image_removal.py (12:22) duplicated block id: 75 size: 9 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (13:23) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 76 size: 9 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (14:25) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 77 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/07_01_nsfw_image_filtering.py (20:30) duplicated block id: 78 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/07_01_nsfw_image_filtering.py (20:30) duplicated block id: 79 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (12:23) - obelics/callers/filter_web_documents.py (11:22) duplicated block id: 80 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/07_03_nsfw_image_removal.py (12:22) duplicated block id: 81 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (12:23) - obelics/visualization/web_document_and_filtering_visualization.py (13:23) duplicated block id: 82 size: 9 cleaned lines of code in 2 files: - obelics/utils/simplification_utils.py (69:77) - obelics/utils/simplification_utils.py (80:88) duplicated block id: 83 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:22) - build_obelics/13_final_processing.py (18:29) duplicated block id: 84 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/10_final_cleaning.py (14:25) duplicated block id: 85 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 86 size: 9 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (11:21) - build_obelics/12_02_remove_opt_out_images.py (12:22) duplicated block id: 87 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:30) - build_obelics/10_final_cleaning.py (14:25) duplicated block id: 88 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:34) - build_obelics/09_06_line_dedup.py (10:20) duplicated block id: 89 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) duplicated block id: 90 size: 9 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:19) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:26) duplicated block id: 91 size: 9 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:20) - build_obelics/12_02_remove_opt_out_images.py (16:26) duplicated block id: 92 size: 9 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (102:111) - build_obelics/13_final_processing.py (317:326) duplicated block id: 93 size: 9 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (13:23) - build_obelics/13_final_processing.py (18:29) duplicated block id: 94 size: 9 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (20:30) - build_obelics/13_final_processing.py (18:29) duplicated block id: 95 size: 9 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:67) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:47) duplicated block id: 96 size: 9 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (23:33) - build_obelics/08_02_urldedup.py (11:21) duplicated block id: 97 size: 9 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (12:22) - build_obelics/11_03_set_img_urls_dedup.py (13:23) duplicated block id: 98 size: 9 cleaned lines of code in 2 files: - obelics/callers/filter_web_documents.py (11:22) - obelics/visualization/web_document_and_filtering_visualization.py (13:23) duplicated block id: 99 size: 9 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (11:21) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (11:22) duplicated block id: 100 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/02_extract_html_get_image_urls.py (16:25) duplicated block id: 101 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - build_obelics/05_filtering_web_docs.py (27:36) duplicated block id: 102 size: 8 cleaned lines of code in 2 files: - obelics/callers/extract_web_documents.py (15:24) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 103 size: 8 cleaned lines of code in 2 files: - obelics/callers/filter_web_documents.py (11:18) - obelics/visualization/choose_filtering_parameters_web_documents_node_level.py (16:23) duplicated block id: 104 size: 8 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:24) - build_obelics/11_03_set_img_urls_dedup.py (17:26) duplicated block id: 105 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 106 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 107 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - obelics/callers/extract_html.py (10:19) duplicated block id: 108 size: 8 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (10:19) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 109 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (12:19) - obelics/visualization/choose_filtering_parameters_web_documents_node_level.py (16:23) duplicated block id: 110 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 111 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - obelics/callers/download_warc.py (10:19) duplicated block id: 112 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/03_dl_images_create_dataset.py (9:18) duplicated block id: 113 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - obelics/callers/extract_html.py (10:19) duplicated block id: 114 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 115 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:36) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 116 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - obelics/callers/download_warc.py (10:19) duplicated block id: 117 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - obelics/callers/extract_html.py (10:19) duplicated block id: 118 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - build_obelics/05_filtering_web_docs.py (27:36) duplicated block id: 119 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - obelics/callers/download_warc.py (10:19) duplicated block id: 120 size: 8 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 121 size: 8 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:24) - build_obelics/11_01_create_set_img_urls.py (9:18) duplicated block id: 122 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/05_filtering_web_docs.py (27:36) duplicated block id: 123 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - build_obelics/05_filtering_web_docs.py (27:36) duplicated block id: 124 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (12:19) - obelics/utils/__init__.py (2:9) duplicated block id: 125 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - build_obelics/03_dl_images_create_dataset.py (9:18) duplicated block id: 126 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:36) - obelics/callers/download_warc.py (10:19) duplicated block id: 127 size: 8 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:24) - obelics/callers/download_warc.py (10:19) duplicated block id: 128 size: 8 cleaned lines of code in 2 files: - obelics/utils/__init__.py (2:9) - obelics/visualization/choose_filtering_parameters_web_documents_node_level.py (16:23) duplicated block id: 129 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 130 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 131 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) duplicated block id: 132 size: 8 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:19) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:19) duplicated block id: 133 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (83:90) - obelics/callers/extract_web_documents.py (78:85) duplicated block id: 134 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - build_obelics/05_filtering_web_docs.py (27:36) duplicated block id: 135 size: 8 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) - obelics/callers/extract_html.py (10:19) duplicated block id: 136 size: 8 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 137 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - obelics/callers/download_warc.py (10:19) duplicated block id: 138 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:79) - obelics/callers/extract_web_documents.py (43:50) duplicated block id: 139 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 140 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 141 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 142 size: 8 cleaned lines of code in 2 files: - obelics/utils/__init__.py (2:9) - obelics/visualization/web_document_and_filtering_visualization.py (13:20) duplicated block id: 143 size: 8 cleaned lines of code in 2 files: - obelics/visualization/choose_filtering_parameters_web_documents_node_level.py (16:23) - obelics/visualization/web_document_and_filtering_visualization.py (13:20) duplicated block id: 144 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:26) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 145 size: 8 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (75:83) - obelics/callers/extract_html.py (48:56) duplicated block id: 146 size: 8 cleaned lines of code in 2 files: - obelics/callers/filter_web_documents.py (11:18) - obelics/utils/__init__.py (2:9) duplicated block id: 147 size: 8 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:24) - obelics/callers/extract_html.py (10:19) duplicated block id: 148 size: 8 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:18) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:24) duplicated block id: 149 size: 8 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:19) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:23) duplicated block id: 150 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 151 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 152 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - obelics/callers/extract_html.py (10:19) duplicated block id: 153 size: 8 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:18) - build_obelics/11_03_set_img_urls_dedup.py (17:26) duplicated block id: 154 size: 8 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (10:19) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 155 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:36) - obelics/callers/extract_html.py (10:19) duplicated block id: 156 size: 8 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:24) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:24) duplicated block id: 157 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 158 size: 8 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 159 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - build_obelics/06_03_remove_image_duplicates.py (15:24) duplicated block id: 160 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (54:61) - obelics/callers/extract_web_documents.py (43:50) duplicated block id: 161 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) duplicated block id: 162 size: 8 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (10:19) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 163 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 164 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:79) - obelics/callers/filter_web_documents.py (47:54) duplicated block id: 165 size: 8 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:27) - obelics/callers/extract_html.py (10:19) duplicated block id: 166 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:25) - build_obelics/03_dl_images_create_dataset.py (9:18) duplicated block id: 167 size: 8 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (10:19) - obelics/callers/extract_html.py (10:19) duplicated block id: 168 size: 8 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:19) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:23) duplicated block id: 169 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - build_obelics/04_merge_web_docs_with_images.py (18:27) duplicated block id: 170 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 171 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:18) - build_obelics/04_merge_web_docs_with_images.py (18:27) duplicated block id: 172 size: 8 cleaned lines of code in 2 files: - obelics/callers/extract_web_documents.py (43:50) - obelics/callers/filter_web_documents.py (47:54) duplicated block id: 173 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:36) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 174 size: 8 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:24) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 175 size: 8 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:24) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 176 size: 8 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:21) - obelics/callers/download_warc.py (10:19) duplicated block id: 177 size: 8 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:36) - obelics/callers/filter_web_documents.py (25:34) duplicated block id: 178 size: 8 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (10:19) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 179 size: 8 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (82:90) - obelics/callers/extract_html.py (48:56) duplicated block id: 180 size: 8 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:79) - build_obelics/05_filtering_web_docs.py (54:61) duplicated block id: 181 size: 8 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:20) - obelics/callers/extract_web_documents.py (15:24) duplicated block id: 182 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 183 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 184 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 185 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 186 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/callers/download_warc.py (10:16) duplicated block id: 187 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 188 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 189 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 190 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 191 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 192 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 193 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 194 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:78) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:45) duplicated block id: 195 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 196 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 197 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 198 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/13_final_processing.py (23:29) duplicated block id: 199 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 200 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 201 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 202 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/13_final_processing.py (23:29) duplicated block id: 203 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/callers/download_warc.py (10:16) duplicated block id: 204 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/callers/download_warc.py (10:16) duplicated block id: 205 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 206 size: 7 cleaned lines of code in 2 files: - obelics/processors/web_document_extractor.py (15:21) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 207 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 208 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 209 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 210 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/callers/extract_html.py (10:16) duplicated block id: 211 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 212 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 213 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 214 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 215 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:65) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 216 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/callers/download_warc.py (10:16) duplicated block id: 217 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 218 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (54:60) - obelics/callers/extract_html.py (32:38) duplicated block id: 219 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 220 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 221 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 222 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 223 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 224 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 225 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 226 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 227 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 228 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:57) - build_obelics/03_dl_images_create_dataset.py (72:78) duplicated block id: 229 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:52) - build_obelics/05_filtering_web_docs.py (54:60) duplicated block id: 230 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/callers/download_warc.py (10:16) duplicated block id: 231 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:65) - build_obelics/05_filtering_web_docs.py (54:60) duplicated block id: 232 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 233 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/callers/download_warc.py (10:16) duplicated block id: 234 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 235 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 236 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 237 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 238 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 239 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/callers/download_warc.py (10:16) duplicated block id: 240 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 241 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 242 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 243 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 244 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 245 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 246 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 247 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 248 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 249 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 250 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 251 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 252 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (10:16) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 253 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 254 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/callers/extract_html.py (10:16) duplicated block id: 255 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 256 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 257 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 258 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 259 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - build_obelics/13_final_processing.py (23:29) duplicated block id: 260 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 261 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 262 size: 7 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (10:16) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 263 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 264 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 265 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 266 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 267 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 268 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 269 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 270 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 271 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 272 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 273 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/13_final_processing.py (23:29) duplicated block id: 274 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 275 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 276 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/13_final_processing.py (23:29) duplicated block id: 277 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 278 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 279 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 280 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 281 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 282 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 283 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/13_final_processing.py (23:29) duplicated block id: 284 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 285 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:52) - build_obelics/03_dl_images_create_dataset.py (72:78) duplicated block id: 286 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 287 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:57) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 288 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 289 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 290 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:45) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 291 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 292 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 293 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 294 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 295 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 296 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 297 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 298 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 299 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 300 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 301 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (77:84) - build_obelics/11_03_set_img_urls_dedup.py (76:83) duplicated block id: 302 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 303 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 304 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/13_final_processing.py (23:29) duplicated block id: 305 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 306 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 307 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 308 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 309 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - build_obelics/13_final_processing.py (23:29) duplicated block id: 310 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/callers/extract_html.py (10:16) duplicated block id: 311 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 312 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 313 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (32:38) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 314 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 315 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/callers/extract_html.py (10:16) duplicated block id: 316 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:78) - obelics/callers/download_warc.py (32:38) duplicated block id: 317 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (10:16) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 318 size: 7 cleaned lines of code in 2 files: - obelics/callers/filter_web_documents.py (25:31) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 319 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 320 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 321 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 322 size: 7 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (32:38) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 323 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/13_final_processing.py (23:29) duplicated block id: 324 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 325 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 326 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_web_documents.py (15:21) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 327 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 328 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 329 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/callers/download_warc.py (10:16) duplicated block id: 330 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/callers/extract_html.py (10:16) duplicated block id: 331 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/13_final_processing.py (23:29) duplicated block id: 332 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 333 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 334 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 335 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 336 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 337 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/callers/download_warc.py (10:16) duplicated block id: 338 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 339 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 340 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 341 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 342 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 343 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/13_final_processing.py (23:29) duplicated block id: 344 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_html.py (32:38) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 345 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 346 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 347 size: 7 cleaned lines of code in 2 files: - obelics/processors/web_document_extractor.py (280:289) - obelics/processors/web_document_extractor.py (299:308) duplicated block id: 348 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 349 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 350 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/callers/download_warc.py (10:16) duplicated block id: 351 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 352 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 353 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 354 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:52) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 355 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - obelics/callers/extract_html.py (10:16) duplicated block id: 356 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 357 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 358 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 359 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 360 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 361 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 362 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/callers/download_warc.py (10:16) duplicated block id: 363 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 364 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 365 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 366 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 367 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 368 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/callers/extract_html.py (10:16) duplicated block id: 369 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 370 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 371 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 372 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 373 size: 7 cleaned lines of code in 2 files: - build_obelics/12_02_remove_opt_out_images.py (16:22) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 374 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 375 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 376 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 377 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 378 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 379 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/callers/download_warc.py (10:16) duplicated block id: 380 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 381 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (46:52) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 382 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/callers/download_warc.py (10:16) duplicated block id: 383 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 384 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 385 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 386 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 387 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 388 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 389 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 390 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 391 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/callers/download_warc.py (10:16) duplicated block id: 392 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 393 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 394 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/callers/extract_html.py (10:16) duplicated block id: 395 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 396 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 397 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (54:60) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:45) duplicated block id: 398 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 399 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 400 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:78) - build_obelics/04_merge_web_docs_with_images.py (59:65) duplicated block id: 401 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (29:35) - build_obelics/04_merge_web_docs_with_images.py (30:36) duplicated block id: 402 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 403 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/callers/download_warc.py (10:16) duplicated block id: 404 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 405 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (28:34) - build_obelics/04_merge_web_docs_with_images.py (30:36) duplicated block id: 406 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/callers/extract_html.py (10:16) duplicated block id: 407 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 408 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 409 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 410 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 411 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 412 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 413 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 414 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 415 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 416 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 417 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 418 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 419 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 420 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 421 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 422 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 423 size: 7 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (10:16) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 424 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 425 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/callers/extract_html.py (10:16) duplicated block id: 426 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/callers/extract_html.py (10:16) duplicated block id: 427 size: 7 cleaned lines of code in 2 files: - build_obelics/11_03_set_img_urls_dedup.py (17:23) - obelics/callers/extract_html.py (10:16) duplicated block id: 428 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (72:78) - obelics/callers/extract_html.py (32:38) duplicated block id: 429 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 430 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 431 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 432 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 433 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 434 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 435 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 436 size: 7 cleaned lines of code in 2 files: - build_obelics/08_02_urldedup.py (15:21) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 437 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (39:45) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 438 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 439 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 440 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (27:33) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 441 size: 7 cleaned lines of code in 2 files: - obelics/callers/filter_web_documents.py (25:31) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 442 size: 7 cleaned lines of code in 2 files: - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 443 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/07_01_nsfw_image_filtering.py (24:30) duplicated block id: 444 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 445 size: 7 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (19:25) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 446 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 447 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 448 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 449 size: 7 cleaned lines of code in 2 files: - obelics/callers/extract_web_documents.py (15:21) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 450 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 451 size: 7 cleaned lines of code in 2 files: - build_obelics/07_01_nsfw_image_filtering.py (24:30) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 452 size: 7 cleaned lines of code in 2 files: - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 453 size: 7 cleaned lines of code in 2 files: - build_obelics/09_06_line_dedup.py (10:16) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 454 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/11_01_create_set_img_urls.py (9:15) duplicated block id: 455 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/callers/filter_web_documents.py (25:31) duplicated block id: 456 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:57) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 457 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/processors/web_document_line_deduplication.py (11:17) duplicated block id: 458 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 459 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/09_02_get_domain_to_positions.py (10:16) duplicated block id: 460 size: 7 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (54:60) - obelics/callers/download_warc.py (32:38) duplicated block id: 461 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 462 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 463 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/08_02_urldedup.py (15:21) duplicated block id: 464 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 465 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 466 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (51:57) - build_obelics/05_filtering_web_docs.py (54:60) duplicated block id: 467 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - obelics/callers/extract_html.py (10:16) duplicated block id: 468 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - build_obelics/13_final_processing.py (23:29) duplicated block id: 469 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 470 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/09_07_merge_web_docs_texts_only_and_rest.py (16:22) duplicated block id: 471 size: 7 cleaned lines of code in 2 files: - obelics/callers/download_warc.py (32:38) - obelics/callers/filter_web_documents.py (47:53) duplicated block id: 472 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - obelics/callers/extract_web_documents.py (15:21) duplicated block id: 473 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 474 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (59:65) - obelics/callers/extract_web_documents.py (43:49) duplicated block id: 475 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 476 size: 7 cleaned lines of code in 2 files: - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) - obelics/callers/extract_html.py (10:16) duplicated block id: 477 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/11_02_get_docs_to_remove_by_set_img_urls_dedup.py (15:21) duplicated block id: 478 size: 7 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (16:22) - build_obelics/12_02_remove_opt_out_images.py (16:22) duplicated block id: 479 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/07_03_nsfw_image_removal.py (16:22) duplicated block id: 480 size: 7 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (12:18) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 481 size: 7 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (9:15) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 482 size: 7 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (23:29) - obelics/callers/extract_html.py (10:16) duplicated block id: 483 size: 7 cleaned lines of code in 2 files: - build_obelics/11_01_create_set_img_urls.py (9:15) - obelics/processors/web_document_extractor.py (15:21) duplicated block id: 484 size: 7 cleaned lines of code in 2 files: - build_obelics/09_02_get_domain_to_positions.py (10:16) - build_obelics/13_final_processing.py (23:29) duplicated block id: 485 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 486 size: 7 cleaned lines of code in 2 files: - build_obelics/06_03_remove_image_duplicates.py (15:21) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 487 size: 7 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (17:23) - build_obelics/10_final_cleaning.py (19:25) duplicated block id: 488 size: 7 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (16:22) - build_obelics/09_01_create_web_docs_texts_only.py (9:15) duplicated block id: 489 size: 7 cleaned lines of code in 2 files: - build_obelics/09_01_create_web_docs_texts_only.py (9:15) - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) duplicated block id: 490 size: 7 cleaned lines of code in 2 files: - build_obelics/09_04_get_domain_to_duplicated_texts.py (10:16) - build_obelics/11_03_set_img_urls_dedup.py (17:23) duplicated block id: 491 size: 7 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (18:24) - build_obelics/09_05_merge_domain_to_duplicated_texts_sharded.py (14:20) duplicated block id: 492 size: 7 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (11:17) - build_obelics/09_06_line_dedup.py (10:16) duplicated block id: 493 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (31:36) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 494 size: 6 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (29:34) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 495 size: 6 cleaned lines of code in 2 files: - obelics/processors/dom_tree_simplificator.py (125:130) - obelics/processors/dom_tree_simplificator.py (163:168) duplicated block id: 496 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/05_filtering_web_docs.py (38:43) duplicated block id: 497 size: 6 cleaned lines of code in 2 files: - build_obelics/06_01_create_set_image_urls_in_webdocs.py (50:56) - build_obelics/06_03_remove_image_duplicates.py (93:99) duplicated block id: 498 size: 6 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (95:101) - build_obelics/06_03_remove_image_duplicates.py (93:99) duplicated block id: 499 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 500 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (43:50) - obelics/callers/download_warc.py (37:44) duplicated block id: 501 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (31:36) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 502 size: 6 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (30:35) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 503 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/02_extract_html_get_image_urls.py (29:34) duplicated block id: 504 size: 6 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (29:34) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 505 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/03_dl_images_create_dataset.py (20:25) duplicated block id: 506 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (43:50) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (44:51) duplicated block id: 507 size: 6 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (29:34) - build_obelics/03_dl_images_create_dataset.py (20:25) duplicated block id: 508 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 509 size: 6 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (89:96) - obelics/callers/filter_web_documents.py (82:89) duplicated block id: 510 size: 6 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (38:43) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 511 size: 6 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (20:25) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 512 size: 6 cleaned lines of code in 2 files: - build_obelics/13_final_processing.py (130:135) - build_obelics/13_final_processing.py (145:150) duplicated block id: 513 size: 6 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (53:59) - build_obelics/13_final_processing.py (181:187) duplicated block id: 514 size: 6 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (29:34) - build_obelics/05_filtering_web_docs.py (38:43) duplicated block id: 515 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (31:36) - build_obelics/05_filtering_web_docs.py (38:43) duplicated block id: 516 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (30:35) duplicated block id: 517 size: 6 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (20:25) - build_obelics/05_filtering_web_docs.py (38:43) duplicated block id: 518 size: 6 cleaned lines of code in 2 files: - build_obelics/05_filtering_web_docs.py (38:43) - build_obelics/06_03_remove_image_duplicates.py (26:31) duplicated block id: 519 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (43:50) - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (51:58) duplicated block id: 520 size: 6 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (20:25) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 521 size: 6 cleaned lines of code in 2 files: - build_obelics/10_final_cleaning.py (88:93) - build_obelics/13_final_processing.py (243:248) duplicated block id: 522 size: 6 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (30:35) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (23:28) duplicated block id: 523 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (108:114) - build_obelics/05_filtering_web_docs.py (95:101) duplicated block id: 524 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (43:50) - obelics/callers/extract_html.py (37:44) duplicated block id: 525 size: 6 cleaned lines of code in 2 files: - build_obelics/07_03_nsfw_image_removal.py (70:76) - build_obelics/09_06_line_dedup.py (77:83) duplicated block id: 526 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (108:114) - build_obelics/06_01_create_set_image_urls_in_webdocs.py (50:56) duplicated block id: 527 size: 6 cleaned lines of code in 2 files: - build_obelics/04_merge_web_docs_with_images.py (108:114) - build_obelics/06_03_remove_image_duplicates.py (93:99) duplicated block id: 528 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (44:52) - build_obelics/03_dl_images_create_dataset.py (99:107) duplicated block id: 529 size: 6 cleaned lines of code in 2 files: - build_obelics/03_dl_images_create_dataset.py (20:25) - build_obelics/04_merge_web_docs_with_images.py (31:36) duplicated block id: 530 size: 6 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (30:35) - build_obelics/05_filtering_web_docs.py (38:43) duplicated block id: 531 size: 6 cleaned lines of code in 2 files: - build_obelics/02_bis_extract_html_get_image_urls_new_rules.py (30:35) - build_obelics/03_dl_images_create_dataset.py (20:25) duplicated block id: 532 size: 6 cleaned lines of code in 2 files: - build_obelics/02_extract_html_get_image_urls.py (57:65) - build_obelics/03_dl_images_create_dataset.py (99:107) duplicated block id: 533 size: 6 cleaned lines of code in 2 files: - build_obelics/01_download_warc.py (22:27) - build_obelics/04_merge_web_docs_with_images.py (31:36)