duplicated block id: 1 size: 51 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (298:367) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (490:559) duplicated block id: 2 size: 33 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/csv.py (34:66) - src/datatrove/pipeline/readers/jsonl.py (35:67) duplicated block id: 3 size: 25 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (40:64) - src/datatrove/pipeline/readers/jsonl.py (42:66) duplicated block id: 4 size: 25 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/jsonl.py (42:66) - src/datatrove/pipeline/readers/parquet.py (43:67) duplicated block id: 5 size: 25 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (40:64) - src/datatrove/pipeline/readers/parquet.py (43:67) duplicated block id: 6 size: 25 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/csv.py (41:65) - src/datatrove/pipeline/readers/ipc.py (40:64) duplicated block id: 7 size: 25 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/csv.py (41:65) - src/datatrove/pipeline/readers/parquet.py (43:67) duplicated block id: 8 size: 23 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (167:192) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (352:378) duplicated block id: 9 size: 22 cleaned lines of code in 2 files: - src/datatrove/executor/base.py (85:111) - src/datatrove/executor/ray.py (228:254) duplicated block id: 10 size: 20 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (373:394) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (456:477) duplicated block id: 11 size: 20 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (246:267) - src/datatrove/pipeline/dedup/url_dedup.py (248:269) duplicated block id: 12 size: 18 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (111:131) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (283:303) duplicated block id: 13 size: 17 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (223:247) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (413:437) duplicated block id: 14 size: 17 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/csv.py (34:50) - src/datatrove/pipeline/readers/warc.py (38:54) duplicated block id: 15 size: 17 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/jsonl.py (35:51) - src/datatrove/pipeline/readers/warc.py (38:54) duplicated block id: 16 size: 15 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (50:64) - src/datatrove/pipeline/readers/warc.py (56:70) duplicated block id: 17 size: 15 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/jsonl.py (52:66) - src/datatrove/pipeline/readers/warc.py (56:70) duplicated block id: 18 size: 15 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/csv.py (51:65) - src/datatrove/pipeline/readers/warc.py (56:70) duplicated block id: 19 size: 15 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/parquet.py (53:67) - src/datatrove/pipeline/readers/warc.py (56:70) duplicated block id: 20 size: 14 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (42:59) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (83:100) duplicated block id: 21 size: 12 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (139:150) - src/datatrove/pipeline/readers/csv.py (39:50) duplicated block id: 22 size: 12 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (305:320) - src/datatrove/tools/fast_mh3/src/local_union_find.rs (324:339) duplicated block id: 23 size: 12 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (139:150) - src/datatrove/pipeline/readers/jsonl.py (40:51) duplicated block id: 24 size: 12 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (139:150) - src/datatrove/pipeline/readers/warc.py (43:54) duplicated block id: 25 size: 12 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (305:320) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (516:531) duplicated block id: 26 size: 12 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (324:339) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (497:512) duplicated block id: 27 size: 12 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (497:512) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (516:531) duplicated block id: 28 size: 11 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (218:230) - src/datatrove/pipeline/dedup/url_dedup.py (214:226) duplicated block id: 29 size: 11 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (146:170) - src/datatrove/pipeline/dedup/url_dedup.py (145:158) duplicated block id: 30 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (141:150) - src/datatrove/pipeline/readers/parquet.py (43:52) duplicated block id: 31 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (235:244) - src/datatrove/pipeline/dedup/url_dedup.py (237:246) duplicated block id: 32 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/parquet.py (43:52) - src/datatrove/pipeline/readers/warc.py (45:54) duplicated block id: 33 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (141:150) - src/datatrove/pipeline/readers/ipc.py (40:49) duplicated block id: 34 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/stats/paragraph_stats.py (38:47) - src/datatrove/pipeline/stats/sentence_stats.py (38:47) duplicated block id: 35 size: 10 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (88:102) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (258:272) duplicated block id: 36 size: 10 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (40:49) - src/datatrove/pipeline/readers/warc.py (45:54) duplicated block id: 37 size: 10 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (22:38) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (40:56) duplicated block id: 38 size: 9 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (88:96) - src/datatrove/pipeline/readers/parquet.py (79:87) duplicated block id: 39 size: 9 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (494:503) - src/datatrove/pipeline/dedup/url_dedup.py (406:416) duplicated block id: 40 size: 9 cleaned lines of code in 2 files: - src/datatrove/pipeline/tokens/merger.py (169:177) - src/datatrove/pipeline/tokens/merger.py (186:194) duplicated block id: 41 size: 9 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (259:268) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (452:461) duplicated block id: 42 size: 8 cleaned lines of code in 2 files: - src/datatrove/pipeline/stats/paragraph_stats.py (40:47) - src/datatrove/pipeline/stats/word_stats.py (47:54) duplicated block id: 43 size: 8 cleaned lines of code in 2 files: - src/datatrove/pipeline/stats/sentence_stats.py (40:47) - src/datatrove/pipeline/stats/word_stats.py (47:54) duplicated block id: 44 size: 7 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (472:478) - src/datatrove/pipeline/dedup/url_dedup.py (384:390) duplicated block id: 45 size: 7 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (386:394) - src/datatrove/pipeline/dedup/url_dedup.py (321:329) duplicated block id: 46 size: 7 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/ipc.py (31:38) - src/datatrove/pipeline/readers/parquet.py (33:40) duplicated block id: 47 size: 7 cleaned lines of code in 2 files: - src/datatrove/pipeline/tokens/tokenizer.py (202:209) - src/datatrove/pipeline/tokens/tokenizer.py (235:241) duplicated block id: 48 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (135:140) - src/datatrove/pipeline/readers/ipc.py (33:38) duplicated block id: 49 size: 6 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (187:192) - src/datatrove/tools/fast_mh3/src/local_union_find.rs (263:268) duplicated block id: 50 size: 6 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (263:268) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (373:378) duplicated block id: 51 size: 6 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (187:192) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (456:461) duplicated block id: 52 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (284:291) - src/datatrove/pipeline/dedup/url_dedup.py (281:288) duplicated block id: 53 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/tokens/tokenizer.py (234:239) - src/datatrove/pipeline/tokens/tokenizer.py (394:399) duplicated block id: 54 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/minhash.py (467:472) - src/datatrove/pipeline/dedup/minhash.py (656:661) duplicated block id: 55 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/readers/base.py (135:140) - src/datatrove/pipeline/readers/parquet.py (35:40) duplicated block id: 56 size: 6 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (215:221) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (404:410) duplicated block id: 57 size: 6 cleaned lines of code in 2 files: - src/datatrove/tools/fast_mh3/src/local_union_find.rs (104:109) - src/datatrove/tools/fast_mh3/src/s3_union_find.rs (275:280) duplicated block id: 58 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (210:215) - src/datatrove/pipeline/dedup/sentence_dedup.py (471:476) duplicated block id: 59 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/url_dedup.py (205:210) - src/datatrove/pipeline/dedup/url_dedup.py (383:388) duplicated block id: 60 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (87:92) - src/datatrove/pipeline/dedup/url_dedup.py (88:93) duplicated block id: 61 size: 6 cleaned lines of code in 2 files: - src/datatrove/pipeline/dedup/sentence_dedup.py (211:216) - src/datatrove/pipeline/dedup/url_dedup.py (206:211)