huggingface / datatrove
File Age & Freshness

File age measurements show the distribution of file ages (days since the first commit) and the file freshness (days since the latest commit).

Summary
File Change History Overall
File Age Distribution Overall
Days since first update
  • There are 97 files with 12,801 lines of code in files.
    • 90 files that are 366+ days old (11,536 lines of code)
    • 5 files that are 181-365 days old (998 lines of code)
    • 1 files that are 91-180 days old (107 lines of code)
    • 1 files that are 31-90 days old (160 lines of code)
    • 0 files that are 1-30 days old (0 lines of code)
90% | 7% | <1% | 1% | 0%
Legend:
366+
181-365
91-180
31-90
1-30

explore: grouped by folders | grouped by age
File Freshness Distribution Overall
Days since last update
  • There are 97 files with 12,801 lines of code in files.
    • 40 files have been last changed 366+ days ago (1,516 lines of code)
    • 20 files have been last changed 181-365 days ago (6,398 lines of code)
    • 14 files have been last changed 91-180 days ago (1,388 lines of code)
    • 23 files have been last changed 31-90 days ago (3,499 lines of code)
    • 0 files have been last changed 1-30 days ago (0 lines of code)
11% | 49% | 10% | 27% | 0%
Legend:
366+
181-365
91-180
31-90
1-30

explore: grouped by folders | grouped by freshness
File Change History per File Extension
py, txt, yaml, rs, md, toml, gitignore
File Age Distribution per Extension
Days since first update
366+
181-365
91-180
31-90
1-30
py95% | 1% | <1% | 1% | 0%
toml85% | 14% | 0% | 0% | 0%
rs0% | 100% | 0% | 0% | 0%
File Freshness Distribution per Extension
Days since last update
366+
181-365
91-180
31-90
1-30
py12% | 47% | 11% | 28% | 0%
rs0% | 100% | 0% | 0% | 0%
toml0% | 14% | 0% | 85% | 0%
File Change History per Logical Decomposition
primary
primary (file age distribution)
Days since first update
366+
181-365
91-180
31-90
1-30
src89% | 7% | <1% | 1% | 0%
ROOT100% | 0% | 0% | 0% | 0%
primary (file freshness distribution)
Days since last update
366+
181-365
91-180
31-90
1-30
src11% | 50% | 10% | 26% | 0%
ROOT0% | 0% | 0% | 100% | 0%
Oldest Files (Top 50)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
local.py
in src/datatrove/executor
101 4 2023-06-14 2025-04-29 26 6 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/executor
98 10 2023-06-14 2024-05-28 27 7 nostrumg@gmail.com jun.untitled@kakaobrain.com
__init__.py
in src/datatrove/pipeline/filters
11 - 2023-06-14 2024-04-24 13 4 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/readers
6 - 2023-06-14 2024-01-19 6 3 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/writers
3 - 2023-06-14 2024-02-23 4 1 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline
35 8 2023-06-15 2024-05-21 16 6 alessandro@lighton.ai nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/dedup
12 - 2023-06-16 2024-04-12 10 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
typeshelper.py
in src/datatrove/utils
4106 - 2023-06-19 2024-12-06 10 6 alessandro@lighton.ai nostrumg@gmail.com
base.py
in src/datatrove/pipeline/extractors
117 11 2023-06-20 2025-04-29 17 6 alessandro@lighton.ai kydlicek.hynek@gmail.com
trafilatura.py
in src/datatrove/pipeline/extractors
28 2 2023-06-20 2025-01-02 11 6 alessandro@lighton.ai nostrumg@gmail.com
modular.py
in src/datatrove/pipeline/extractors
23 2 2023-06-20 2024-02-29 8 5 alessandro@lighton.ai thomwolf@users.noreply.gith...
__init__.py
in src/datatrove/pipeline/extractors
2 - 2023-06-20 2023-07-03 3 3 alessandro@lighton.ai anton@huggingface.co
base.py
in src/datatrove/pipeline/readers
143 9 2023-06-21 2025-04-29 28 10 nostrumg@gmail.com vivien.cabannes@gmail.com
warc.py
in src/datatrove/pipeline/readers
94 3 2023-06-21 2024-12-11 24 9 nostrumg@gmail.com nostrumg@gmail.com
disk_base.py
in src/datatrove/pipeline/writers
83 11 2023-06-21 2024-08-28 23 8 nostrumg@gmail.com justhungryman@gmail.com
jsonl.py
in src/datatrove/pipeline/readers
57 2 2023-06-21 2024-06-19 26 10 nostrumg@gmail.com nostrumg@gmail.com
jsonl.py
in src/datatrove/pipeline/writers
28 2 2023-06-21 2024-08-28 18 8 nostrumg@gmail.com justhungryman@gmail.com
slurm.py
in src/datatrove/executor
235 11 2023-06-22 2025-04-29 39 9 nostrumg@gmail.com nostrumg@gmail.com
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6 2023-06-22 2024-12-06 16 5 47321860+alexchapeaux@users... nostrumg@gmail.com
gopher_quality_filter.py
in src/datatrove/pipeline/filters
76 2 2023-06-22 2025-01-30 16 6 47321860+alexchapeaux@users... nostrumg@gmail.com
base_filter.py
in src/datatrove/pipeline/filters
49 5 2023-06-22 2024-07-05 15 5 47321860+alexchapeaux@users... nostrumg@gmail.com
language_filter.py
in src/datatrove/pipeline/filters
43 2 2023-06-22 2025-01-09 18 9 47321860+alexchapeaux@users... sk8terbo2@gmail.com
regex_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 11 6 47321860+alexchapeaux@users... nostrumg@gmail.com
lambda_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 6 4 47321860+alexchapeaux@users... nostrumg@gmail.com
tokenizer.py
in src/datatrove/pipeline/tokens
298 16 2023-06-26 2025-04-30 28 6 nostrumg@gmail.com nostrumg@gmail.com
stats.py
in src/datatrove/utils
297 41 2023-06-28 2024-06-14 17 7 alessandro@lighton.ai kydlicek.hynek@gmail.com
merger.py
in src/datatrove/pipeline/tokens
153 5 2023-06-29 2025-04-29 20 4 nostrumg@gmail.com nostrumg@gmail.com
152 - 2023-07-03 2025-04-30 24 14 nostrumg@gmail.com nostrumg@gmail.com
data.py
in src/datatrove
19 - 2023-07-03 2024-03-01 6 4 nostrumg@gmail.com thomwolf@users.noreply.gith...
__init__.py
in src/datatrove/pipeline/tokens
5 - 2023-07-03 2025-01-09 5 5 anton@huggingface.co 74564958+tj-solergibert@use...
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14 2023-07-10 2025-01-24 26 7 alessandro@lighton.ai muzzynine@gmail.com
__init__.py
in src/datatrove/pipeline/stats
11 - 2023-07-10 2024-06-14 4 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
parquet.py
in src/datatrove/pipeline/readers
57 2 2023-07-19 2024-06-19 12 5 nostrumg@gmail.com nostrumg@gmail.com
url_filter.py
in src/datatrove/pipeline/filters
91 6 2023-07-24 2024-12-06 14 6 47321860+alexchapeaux@users... nostrumg@gmail.com
unigram_log_probs.py
in src/datatrove/pipeline/filters
46 4 2023-07-24 2024-12-06 10 4 47321860+alexchapeaux@users... nostrumg@gmail.com
minhash.py
in src/datatrove/pipeline/dedup
499 18 2023-07-26 2025-04-29 33 7 nostrumg@gmail.com nostrumg@gmail.com
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17 2023-07-26 2025-04-29 23 7 47321860+alexchapeaux@users... nostrumg@gmail.com
check_dataset.py
in src/datatrove/tools
58 4 2023-07-31 2025-04-29 10 4 nostrumg@gmail.com craffel@gmail.com
merge_stats.py
in src/datatrove/tools
33 1 2023-07-31 2024-05-17 9 3 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/tools
1 - 2023-07-31 2023-07-31 1 1 nostrumg@gmail.com nostrumg@gmail.com
sampler_filter.py
in src/datatrove/pipeline/filters
17 2 2023-08-21 2024-02-06 4 2 nostrumg@gmail.com nostrumg@gmail.com
context_shuffler.py
in src/datatrove/pipeline/tokens
46 3 2023-09-01 2024-07-08 9 3 nostrumg@gmail.com 134823555+sippycoder@users....
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12 2023-10-05 2024-07-17 11 5 47321860+alexchapeaux@users... nostrumg@gmail.com
csv.py
in src/datatrove/pipeline/readers
50 2 2023-10-05 2024-06-19 11 6 nostrumg@gmail.com nostrumg@gmail.com
logging.py
in src/datatrove/utils
37 7 2023-11-14 2024-05-17 5 1 nostrumg@gmail.com nostrumg@gmail.com
launch_pickled_pipeline.py
in src/datatrove/tools
13 1 2023-11-14 2024-02-21 3 1 nostrumg@gmail.com nostrumg@gmail.com
counter.py
in src/datatrove/pipeline/tokens
36 3 2023-11-20 2025-01-09 12 5 nostrumg@gmail.com nostrumg@gmail.com
failed_logs.py
in src/datatrove/tools
55 1 2023-11-22 2025-04-29 5 2 nostrumg@gmail.com nostrumg@gmail.com
inspect_data.py
in src/datatrove/tools
129 4 2023-11-30 2024-03-20 7 4 nostrumg@gmail.com nostrumg@gmail.com
ipc.py
in src/datatrove/pipeline/readers
64 4 2023-12-22 2024-06-19 8 5 mariosasko777@gmail.com nostrumg@gmail.com
Files Not Recently Changed (Top 50)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
__init__.py
in src/datatrove/pipeline/extractors
2 - 2023-06-20 2023-07-03 3 3 alessandro@lighton.ai anton@huggingface.co
__init__.py
in src/datatrove/tools
1 - 2023-07-31 2023-07-31 1 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/readers
6 - 2023-06-14 2024-01-19 6 3 nostrumg@gmail.com nostrumg@gmail.com
lambda_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 6 4 47321860+alexchapeaux@users... nostrumg@gmail.com
regex_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 11 6 47321860+alexchapeaux@users... nostrumg@gmail.com
sampler_filter.py
in src/datatrove/pipeline/filters
17 2 2023-08-21 2024-02-06 4 2 nostrumg@gmail.com nostrumg@gmail.com
launch_pickled_pipeline.py
in src/datatrove/tools
13 1 2023-11-14 2024-02-21 3 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/writers
3 - 2023-06-14 2024-02-23 4 1 nostrumg@gmail.com nostrumg@gmail.com
modular.py
in src/datatrove/pipeline/extractors
23 2 2023-06-20 2024-02-29 8 5 alessandro@lighton.ai thomwolf@users.noreply.gith...
data.py
in src/datatrove
19 - 2023-07-03 2024-03-01 6 4 nostrumg@gmail.com thomwolf@users.noreply.gith...
inspect_data.py
in src/datatrove/tools
129 4 2023-11-30 2024-03-20 7 4 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/formatters
17 3 2024-03-22 2024-03-22 1 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/dedup
12 - 2023-06-16 2024-04-12 10 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
binaryio.py
in src/datatrove/utils
54 3 2024-03-22 2024-04-12 2 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
__init__.py
in src/datatrove/pipeline/formatters
3 - 2024-03-22 2024-04-16 2 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/filters
11 - 2023-06-14 2024-04-24 13 4 nostrumg@gmail.com nostrumg@gmail.com
pii.py
in src/datatrove/pipeline/formatters
68 5 2024-04-16 2024-05-02 2 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/decont
1 - 2024-05-04 2024-05-04 1 1 nostrumg@gmail.com nostrumg@gmail.com
merge_stats.py
in src/datatrove/tools
33 1 2023-07-31 2024-05-17 9 3 nostrumg@gmail.com nostrumg@gmail.com
logging.py
in src/datatrove/utils
37 7 2023-11-14 2024-05-17 5 1 nostrumg@gmail.com nostrumg@gmail.com
xxhash.py
in src/datatrove/utils/hashes
5 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
sha1.py
in src/datatrove/utils/hashes
6 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline
35 8 2023-06-15 2024-05-21 16 6 alessandro@lighton.ai nostrumg@gmail.com
hashing.py
in src/datatrove/utils
37 8 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/executor
98 10 2023-06-14 2024-05-28 27 7 nostrumg@gmail.com jun.untitled@kakaobrain.com
config.py
in src/datatrove/pipeline/stats
9 - 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
__init__.py
in src/datatrove/pipeline/stats
11 - 2023-07-10 2024-06-14 4 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
perplexity_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
doc_stats.py
in src/datatrove/pipeline/stats
30 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
contamination_stats.py
in src/datatrove/pipeline/stats
33 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
sentence_stats.py
in src/datatrove/pipeline/stats
47 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
merger.py
in src/datatrove/pipeline/stats
51 3 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
paragraph_stats.py
in src/datatrove/pipeline/stats
51 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
word_stats.py
in src/datatrove/pipeline/stats
56 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
stats.py
in src/datatrove/utils
297 41 2023-06-28 2024-06-14 17 7 alessandro@lighton.ai kydlicek.hynek@gmail.com
lang_stats.py
in src/datatrove/pipeline/stats
26 2 2024-06-14 2024-06-19 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
csv.py
in src/datatrove/pipeline/readers
50 2 2023-10-05 2024-06-19 11 6 nostrumg@gmail.com nostrumg@gmail.com
jsonl.py
in src/datatrove/pipeline/readers
57 2 2023-06-21 2024-06-19 26 10 nostrumg@gmail.com nostrumg@gmail.com
parquet.py
in src/datatrove/pipeline/readers
57 2 2023-07-19 2024-06-19 12 5 nostrumg@gmail.com nostrumg@gmail.com
ipc.py
in src/datatrove/pipeline/readers
64 4 2023-12-22 2024-06-19 8 5 mariosasko777@gmail.com nostrumg@gmail.com
batching.py
in src/datatrove/utils
7 1 2024-07-05 2024-07-05 1 1 nostrumg@gmail.com nostrumg@gmail.com
base_filter.py
in src/datatrove/pipeline/filters
49 5 2023-06-22 2024-07-05 15 5 47321860+alexchapeaux@users... nostrumg@gmail.com
context_shuffler.py
in src/datatrove/pipeline/tokens
46 3 2023-09-01 2024-07-08 9 3 nostrumg@gmail.com 134823555+sippycoder@users....
base.py
in src/datatrove/pipeline/stats
91 4 2024-06-14 2024-07-15 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12 2023-10-05 2024-07-17 11 5 47321860+alexchapeaux@users... nostrumg@gmail.com
jsonl.py
in src/datatrove/pipeline/writers
28 2 2023-06-21 2024-08-28 18 8 nostrumg@gmail.com justhungryman@gmail.com
disk_base.py
in src/datatrove/pipeline/writers
83 11 2023-06-21 2024-08-28 23 8 nostrumg@gmail.com justhungryman@gmail.com
line_stats.py
in src/datatrove/pipeline/stats
57 5 2024-06-14 2024-09-02 2 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
symbol_lines_remover.py
in src/datatrove/pipeline/formatters
25 2 2024-03-22 2024-12-06 3 1 nostrumg@gmail.com nostrumg@gmail.com
Cargo.toml
in src/datatrove/tools/fast_mh3
26 - 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
Most Recently Created Files (Top 50)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
__init__.py
in src/datatrove/pipeline
1 -
__init__.py
in src/datatrove/utils
1 -
__init__.py
in src/datatrove
1 -
ray.py
in src/datatrove/executor
160 5 2025-04-29 2025-04-29 1 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
megatron_tokenizer.py
in src/datatrove/pipeline/tokens
107 9 2025-01-09 2025-01-09 1 2 nostrumg@gmail.com 74564958+tj-solergibert@use...
s3_union_find.rs
in src/datatrove/tools/fast_mh3/src
477 4 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
local_union_find.rs
in src/datatrove/tools/fast_mh3/src
272 2 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
japanese_tokenizer.py
in src/datatrove/utils
216 16 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
Cargo.toml
in src/datatrove/tools/fast_mh3
26 - 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
batching.py
in src/datatrove/utils
7 1 2024-07-05 2024-07-05 1 1 nostrumg@gmail.com nostrumg@gmail.com
perplexity.py
in src/datatrove/utils
138 12 2024-06-14 2025-04-29 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/stats
91 4 2024-06-14 2024-07-15 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
line_stats.py
in src/datatrove/pipeline/stats
57 5 2024-06-14 2024-09-02 2 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
word_stats.py
in src/datatrove/pipeline/stats
56 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
paragraph_stats.py
in src/datatrove/pipeline/stats
51 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
merger.py
in src/datatrove/pipeline/stats
51 3 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
sentence_stats.py
in src/datatrove/pipeline/stats
47 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
lid.py
in src/datatrove/utils
45 6 2024-06-14 2025-01-09 5 3 kydlicek.hynek@gmail.com sk8terbo2@gmail.com
contamination_stats.py
in src/datatrove/pipeline/stats
33 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
doc_stats.py
in src/datatrove/pipeline/stats
30 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
lang_stats.py
in src/datatrove/pipeline/stats
26 2 2024-06-14 2024-06-19 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
perplexity_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
token_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2025-01-09 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
config.py
in src/datatrove/pipeline/stats
9 - 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
word_tokenizers.py
in src/datatrove/utils
361 60 2024-05-21 2025-01-30 5 3 beme248@users.noreply.githu... nostrumg@gmail.com
hashing.py
in src/datatrove/utils
37 8 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
sha1.py
in src/datatrove/utils/hashes
6 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
xxhash.py
in src/datatrove/utils/hashes
5 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
n_grams.py
in src/datatrove/pipeline/decont
150 6 2024-05-04 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/decont
1 - 2024-05-04 2024-05-04 1 1 nostrumg@gmail.com nostrumg@gmail.com
c4_filters.py
in src/datatrove/pipeline/filters
197 8 2024-04-24 2024-12-06 4 3 nostrumg@gmail.com nostrumg@gmail.com
dataset.py
in src/datatrove/utils
194 12 2024-04-24 2025-04-30 4 2 nostrumg@gmail.com nostrumg@gmail.com
fineweb_quality_filter.py
in src/datatrove/pipeline/filters
47 2 2024-04-20 2025-05-19 5 3 nostrumg@gmail.com cnhx27@gmail.com
pii.py
in src/datatrove/pipeline/formatters
68 5 2024-04-16 2024-05-02 2 1 nostrumg@gmail.com nostrumg@gmail.com
url_dedup.py
in src/datatrove/pipeline/dedup
301 15 2024-04-12 2025-04-29 6 3 kydlicek.hynek@gmail.com nostrumg@gmail.com
text.py
in src/datatrove/utils
254 6 2024-03-22 2024-12-06 7 2 nostrumg@gmail.com nostrumg@gmail.com
binaryio.py
in src/datatrove/utils
54 3 2024-03-22 2024-04-12 2 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
ftfy.py
in src/datatrove/pipeline/formatters
43 2 2024-03-22 2025-01-08 2 1 nostrumg@gmail.com nostrumg@gmail.com
symbol_lines_remover.py
in src/datatrove/pipeline/formatters
25 2 2024-03-22 2024-12-06 3 1 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/formatters
17 3 2024-03-22 2024-03-22 1 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/formatters
3 - 2024-03-22 2024-04-16 2 1 nostrumg@gmail.com nostrumg@gmail.com
tokenization.py
in src/datatrove/utils
58 6 2024-03-19 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
jobs_status.py
in src/datatrove/tools
71 1 2024-03-05 2025-04-29 4 2 nostrumg@gmail.com nostrumg@gmail.com
huggingface.py
in src/datatrove/pipeline/writers
104 4 2024-02-23 2025-01-30 6 2 nostrumg@gmail.com 2779410+bramvanroy@users.no...
parquet.py
in src/datatrove/pipeline/writers
67 5 2024-02-22 2025-01-30 5 3 nostrumg@gmail.com 2779410+bramvanroy@users.no...
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3 2024-02-21 2025-01-09 7 4 nostrumg@gmail.com sk8terbo2@gmail.com
huggingface.py
in src/datatrove/pipeline/readers
95 4 2024-01-19 2025-04-29 11 7 nostrumg@gmail.com huseyinabanozis@gmail.com
__init__.py
in src/datatrove/executor
3 - 2024-01-19 2025-04-29 2 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
io.py
in src/datatrove
188 27 2024-01-18 2025-04-30 22 5 nostrumg@gmail.com nostrumg@gmail.com
_import_utils.py
in src/datatrove/utils
54 12 2024-01-18 2025-04-30 6 3 mariosasko777@gmail.com nostrumg@gmail.com
Most Recently Changed Files (Top 50)
File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
__init__.py
in src/datatrove/pipeline
1 -
__init__.py
in src/datatrove/utils
1 -
__init__.py
in src/datatrove
1 -
fineweb_quality_filter.py
in src/datatrove/pipeline/filters
47 2 2024-04-20 2025-05-19 5 3 nostrumg@gmail.com cnhx27@gmail.com
tokenizer.py
in src/datatrove/pipeline/tokens
298 16 2023-06-26 2025-04-30 28 6 nostrumg@gmail.com nostrumg@gmail.com
dataset.py
in src/datatrove/utils
194 12 2024-04-24 2025-04-30 4 2 nostrumg@gmail.com nostrumg@gmail.com
io.py
in src/datatrove
188 27 2024-01-18 2025-04-30 22 5 nostrumg@gmail.com nostrumg@gmail.com
152 - 2023-07-03 2025-04-30 24 14 nostrumg@gmail.com nostrumg@gmail.com
_import_utils.py
in src/datatrove/utils
54 12 2024-01-18 2025-04-30 6 3 mariosasko777@gmail.com nostrumg@gmail.com
minhash.py
in src/datatrove/pipeline/dedup
499 18 2023-07-26 2025-04-29 33 7 nostrumg@gmail.com nostrumg@gmail.com
url_dedup.py
in src/datatrove/pipeline/dedup
301 15 2024-04-12 2025-04-29 6 3 kydlicek.hynek@gmail.com nostrumg@gmail.com
slurm.py
in src/datatrove/executor
235 11 2023-06-22 2025-04-29 39 9 nostrumg@gmail.com nostrumg@gmail.com
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17 2023-07-26 2025-04-29 23 7 47321860+alexchapeaux@users... nostrumg@gmail.com
ray.py
in src/datatrove/executor
160 5 2025-04-29 2025-04-29 1 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
merger.py
in src/datatrove/pipeline/tokens
153 5 2023-06-29 2025-04-29 20 4 nostrumg@gmail.com nostrumg@gmail.com
n_grams.py
in src/datatrove/pipeline/decont
150 6 2024-05-04 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/readers
143 9 2023-06-21 2025-04-29 28 10 nostrumg@gmail.com vivien.cabannes@gmail.com
perplexity.py
in src/datatrove/utils
138 12 2024-06-14 2025-04-29 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/extractors
117 11 2023-06-20 2025-04-29 17 6 alessandro@lighton.ai kydlicek.hynek@gmail.com
local.py
in src/datatrove/executor
101 4 2023-06-14 2025-04-29 26 6 nostrumg@gmail.com nostrumg@gmail.com
huggingface.py
in src/datatrove/pipeline/readers
95 4 2024-01-19 2025-04-29 11 7 nostrumg@gmail.com huseyinabanozis@gmail.com
jobs_status.py
in src/datatrove/tools
71 1 2024-03-05 2025-04-29 4 2 nostrumg@gmail.com nostrumg@gmail.com
tokenization.py
in src/datatrove/utils
58 6 2024-03-19 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
check_dataset.py
in src/datatrove/tools
58 4 2023-07-31 2025-04-29 10 4 nostrumg@gmail.com craffel@gmail.com
failed_logs.py
in src/datatrove/tools
55 1 2023-11-22 2025-04-29 5 2 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/executor
3 - 2024-01-19 2025-04-29 2 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
word_tokenizers.py
in src/datatrove/utils
361 60 2024-05-21 2025-01-30 5 3 beme248@users.noreply.githu... nostrumg@gmail.com
huggingface.py
in src/datatrove/pipeline/writers
104 4 2024-02-23 2025-01-30 6 2 nostrumg@gmail.com 2779410+bramvanroy@users.no...
gopher_quality_filter.py
in src/datatrove/pipeline/filters
76 2 2023-06-22 2025-01-30 16 6 47321860+alexchapeaux@users... nostrumg@gmail.com
parquet.py
in src/datatrove/pipeline/writers
67 5 2024-02-22 2025-01-30 5 3 nostrumg@gmail.com 2779410+bramvanroy@users.no...
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14 2023-07-10 2025-01-24 26 7 alessandro@lighton.ai muzzynine@gmail.com
megatron_tokenizer.py
in src/datatrove/pipeline/tokens
107 9 2025-01-09 2025-01-09 1 2 nostrumg@gmail.com 74564958+tj-solergibert@use...
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3 2024-02-21 2025-01-09 7 4 nostrumg@gmail.com sk8terbo2@gmail.com
lid.py
in src/datatrove/utils
45 6 2024-06-14 2025-01-09 5 3 kydlicek.hynek@gmail.com sk8terbo2@gmail.com
language_filter.py
in src/datatrove/pipeline/filters
43 2 2023-06-22 2025-01-09 18 9 47321860+alexchapeaux@users... sk8terbo2@gmail.com
counter.py
in src/datatrove/pipeline/tokens
36 3 2023-11-20 2025-01-09 12 5 nostrumg@gmail.com nostrumg@gmail.com
token_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2025-01-09 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/tokens
5 - 2023-07-03 2025-01-09 5 5 anton@huggingface.co 74564958+tj-solergibert@use...
ftfy.py
in src/datatrove/pipeline/formatters
43 2 2024-03-22 2025-01-08 2 1 nostrumg@gmail.com nostrumg@gmail.com
trafilatura.py
in src/datatrove/pipeline/extractors
28 2 2023-06-20 2025-01-02 11 6 alessandro@lighton.ai nostrumg@gmail.com
warc.py
in src/datatrove/pipeline/readers
94 3 2023-06-21 2024-12-11 24 9 nostrumg@gmail.com nostrumg@gmail.com
typeshelper.py
in src/datatrove/utils
4106 - 2023-06-19 2024-12-06 10 6 alessandro@lighton.ai nostrumg@gmail.com
s3_union_find.rs
in src/datatrove/tools/fast_mh3/src
477 4 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
local_union_find.rs
in src/datatrove/tools/fast_mh3/src
272 2 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
text.py
in src/datatrove/utils
254 6 2024-03-22 2024-12-06 7 2 nostrumg@gmail.com nostrumg@gmail.com
japanese_tokenizer.py
in src/datatrove/utils
216 16 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
c4_filters.py
in src/datatrove/pipeline/filters
197 8 2024-04-24 2024-12-06 4 3 nostrumg@gmail.com nostrumg@gmail.com
url_filter.py
in src/datatrove/pipeline/filters
91 6 2023-07-24 2024-12-06 14 6 47321860+alexchapeaux@users... nostrumg@gmail.com
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6 2023-06-22 2024-12-06 16 5 47321860+alexchapeaux@users... nostrumg@gmail.com
unigram_log_probs.py
in src/datatrove/pipeline/filters
46 4 2023-07-24 2024-12-06 10 4 47321860+alexchapeaux@users... nostrumg@gmail.com