huggingface / datatrove
File Change Frequency

File change frequency (churn) shows the distribution of file updates (days with at least one commit).

Overview
File Change Frequency Overall
  • There are 97 files with 12,801 lines of code.
    • 0 files changed more than 100 times (0 lines of code)
    • 0 files changed 51-100 times (0 lines of code)
    • 13 files changed 21-50 times (2,547 lines of code)
    • 37 files changed 6-20 times (6,962 lines of code)
    • 47 files changed 1-5 times (3,292 lines of code)
0% | 0% | 19% | 54% | 25%
Legend:
101+
51-100
21-50
6-20
1-5

explore: grouped by folders | grouped by update frequency | data
Contributors Count Frequency Overall
  • There are 97 files with 12,801 lines of code.
    • 0 files changed by more than 25 contributors (0 lines of code)
    • 1 file changed by 11-25 contributors (152 lines of code)
    • 23 files changed by 6-10 contributors (7,184 lines of code)
    • 46 files changed by 2-5 contributors (3,849 lines of code)
    • 27 files changed by 1 contributor (1,616 lines of code)
0% | 1% | 56% | 30% | 12%
Legend:
26+
11-25
6-10
2-5
1

explore: grouped by folders | grouped by contributors count | data
File Change Frequency per File Extension
py, txt, yaml, rs, md, toml, gitignore
File Change Frequency per Extension
The number of recorded file updates
101+
51-100
21-50
6-20
1-5
py0% | 0% | 20% | 58% | 21%
toml0% | 0% | 85% | 0% | 14%
rs0% | 0% | 0% | 0% | 100%
File Change Frequency per Logical Decomposition
primary
primary (file change frequency)
The number of recorded file updates
101+
51-100
21-50
6-20
1-5
src0% | 0% | 18% | 55% | 26%
ROOT0% | 0% | 100% | 0% | 0%
Most Frequently Changed Files (Top 50)

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
slurm.py
in src/datatrove/executor
235 11 2023-06-22 2025-04-29 39 9 nostrumg@gmail.com nostrumg@gmail.com
minhash.py
in src/datatrove/pipeline/dedup
499 18 2023-07-26 2025-04-29 33 7 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/readers
143 9 2023-06-21 2025-04-29 28 10 nostrumg@gmail.com vivien.cabannes@gmail.com
tokenizer.py
in src/datatrove/pipeline/tokens
298 16 2023-06-26 2025-04-30 28 6 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/executor
98 10 2023-06-14 2024-05-28 27 7 nostrumg@gmail.com jun.untitled@kakaobrain.com
jsonl.py
in src/datatrove/pipeline/readers
57 2 2023-06-21 2024-06-19 26 10 nostrumg@gmail.com nostrumg@gmail.com
local.py
in src/datatrove/executor
101 4 2023-06-14 2025-04-29 26 6 nostrumg@gmail.com nostrumg@gmail.com
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14 2023-07-10 2025-01-24 26 7 alessandro@lighton.ai muzzynine@gmail.com
warc.py
in src/datatrove/pipeline/readers
94 3 2023-06-21 2024-12-11 24 9 nostrumg@gmail.com nostrumg@gmail.com
152 - 2023-07-03 2025-04-30 24 14 nostrumg@gmail.com nostrumg@gmail.com
disk_base.py
in src/datatrove/pipeline/writers
83 11 2023-06-21 2024-08-28 23 8 nostrumg@gmail.com justhungryman@gmail.com
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17 2023-07-26 2025-04-29 23 7 47321860+alexchapeaux@users... nostrumg@gmail.com
io.py
in src/datatrove
188 27 2024-01-18 2025-04-30 22 5 nostrumg@gmail.com nostrumg@gmail.com
merger.py
in src/datatrove/pipeline/tokens
153 5 2023-06-29 2025-04-29 20 4 nostrumg@gmail.com nostrumg@gmail.com
jsonl.py
in src/datatrove/pipeline/writers
28 2 2023-06-21 2024-08-28 18 8 nostrumg@gmail.com justhungryman@gmail.com
language_filter.py
in src/datatrove/pipeline/filters
43 2 2023-06-22 2025-01-09 18 9 47321860+alexchapeaux@users... sk8terbo2@gmail.com
base.py
in src/datatrove/pipeline/extractors
117 11 2023-06-20 2025-04-29 17 6 alessandro@lighton.ai kydlicek.hynek@gmail.com
stats.py
in src/datatrove/utils
297 41 2023-06-28 2024-06-14 17 7 alessandro@lighton.ai kydlicek.hynek@gmail.com
base.py
in src/datatrove/pipeline
35 8 2023-06-15 2024-05-21 16 6 alessandro@lighton.ai nostrumg@gmail.com
gopher_quality_filter.py
in src/datatrove/pipeline/filters
76 2 2023-06-22 2025-01-30 16 6 47321860+alexchapeaux@users... nostrumg@gmail.com
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6 2023-06-22 2024-12-06 16 5 47321860+alexchapeaux@users... nostrumg@gmail.com
base_filter.py
in src/datatrove/pipeline/filters
49 5 2023-06-22 2024-07-05 15 5 47321860+alexchapeaux@users... nostrumg@gmail.com
url_filter.py
in src/datatrove/pipeline/filters
91 6 2023-07-24 2024-12-06 14 6 47321860+alexchapeaux@users... nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/filters
11 - 2023-06-14 2024-04-24 13 4 nostrumg@gmail.com nostrumg@gmail.com
counter.py
in src/datatrove/pipeline/tokens
36 3 2023-11-20 2025-01-09 12 5 nostrumg@gmail.com nostrumg@gmail.com
parquet.py
in src/datatrove/pipeline/readers
57 2 2023-07-19 2024-06-19 12 5 nostrumg@gmail.com nostrumg@gmail.com
regex_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 11 6 47321860+alexchapeaux@users... nostrumg@gmail.com
trafilatura.py
in src/datatrove/pipeline/extractors
28 2 2023-06-20 2025-01-02 11 6 alessandro@lighton.ai nostrumg@gmail.com
csv.py
in src/datatrove/pipeline/readers
50 2 2023-10-05 2024-06-19 11 6 nostrumg@gmail.com nostrumg@gmail.com
huggingface.py
in src/datatrove/pipeline/readers
95 4 2024-01-19 2025-04-29 11 7 nostrumg@gmail.com huseyinabanozis@gmail.com
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12 2023-10-05 2024-07-17 11 5 47321860+alexchapeaux@users... nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/dedup
12 - 2023-06-16 2024-04-12 10 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
unigram_log_probs.py
in src/datatrove/pipeline/filters
46 4 2023-07-24 2024-12-06 10 4 47321860+alexchapeaux@users... nostrumg@gmail.com
check_dataset.py
in src/datatrove/tools
58 4 2023-07-31 2025-04-29 10 4 nostrumg@gmail.com craffel@gmail.com
typeshelper.py
in src/datatrove/utils
4106 - 2023-06-19 2024-12-06 10 6 alessandro@lighton.ai nostrumg@gmail.com
merge_stats.py
in src/datatrove/tools
33 1 2023-07-31 2024-05-17 9 3 nostrumg@gmail.com nostrumg@gmail.com
context_shuffler.py
in src/datatrove/pipeline/tokens
46 3 2023-09-01 2024-07-08 9 3 nostrumg@gmail.com 134823555+sippycoder@users....
modular.py
in src/datatrove/pipeline/extractors
23 2 2023-06-20 2024-02-29 8 5 alessandro@lighton.ai thomwolf@users.noreply.gith...
ipc.py
in src/datatrove/pipeline/readers
64 4 2023-12-22 2024-06-19 8 5 mariosasko777@gmail.com nostrumg@gmail.com
tokenization.py
in src/datatrove/utils
58 6 2024-03-19 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3 2024-02-21 2025-01-09 7 4 nostrumg@gmail.com sk8terbo2@gmail.com
inspect_data.py
in src/datatrove/tools
129 4 2023-11-30 2024-03-20 7 4 nostrumg@gmail.com nostrumg@gmail.com
n_grams.py
in src/datatrove/pipeline/decont
150 6 2024-05-04 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
text.py
in src/datatrove/utils
254 6 2024-03-22 2024-12-06 7 2 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/readers
6 - 2023-06-14 2024-01-19 6 3 nostrumg@gmail.com nostrumg@gmail.com
lambda_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 6 4 47321860+alexchapeaux@users... nostrumg@gmail.com
data.py
in src/datatrove
19 - 2023-07-03 2024-03-01 6 4 nostrumg@gmail.com thomwolf@users.noreply.gith...
_import_utils.py
in src/datatrove/utils
54 12 2024-01-18 2025-04-30 6 3 mariosasko777@gmail.com nostrumg@gmail.com
huggingface.py
in src/datatrove/pipeline/writers
104 4 2024-02-23 2025-01-30 6 2 nostrumg@gmail.com 2779410+bramvanroy@users.no...
url_dedup.py
in src/datatrove/pipeline/dedup
301 15 2024-04-12 2025-04-29 6 3 kydlicek.hynek@gmail.com nostrumg@gmail.com
Files With Most Contributors (Top 50)
Based on the number of unique email addresses found in commits.

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
152 - 2023-07-03 2025-04-30 24 14 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/readers
143 9 2023-06-21 2025-04-29 28 10 nostrumg@gmail.com vivien.cabannes@gmail.com
jsonl.py
in src/datatrove/pipeline/readers
57 2 2023-06-21 2024-06-19 26 10 nostrumg@gmail.com nostrumg@gmail.com
slurm.py
in src/datatrove/executor
235 11 2023-06-22 2025-04-29 39 9 nostrumg@gmail.com nostrumg@gmail.com
warc.py
in src/datatrove/pipeline/readers
94 3 2023-06-21 2024-12-11 24 9 nostrumg@gmail.com nostrumg@gmail.com
language_filter.py
in src/datatrove/pipeline/filters
43 2 2023-06-22 2025-01-09 18 9 47321860+alexchapeaux@users... sk8terbo2@gmail.com
disk_base.py
in src/datatrove/pipeline/writers
83 11 2023-06-21 2024-08-28 23 8 nostrumg@gmail.com justhungryman@gmail.com
jsonl.py
in src/datatrove/pipeline/writers
28 2 2023-06-21 2024-08-28 18 8 nostrumg@gmail.com justhungryman@gmail.com
minhash.py
in src/datatrove/pipeline/dedup
499 18 2023-07-26 2025-04-29 33 7 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/executor
98 10 2023-06-14 2024-05-28 27 7 nostrumg@gmail.com jun.untitled@kakaobrain.com
sentence_dedup.py
in src/datatrove/pipeline/dedup
370 14 2023-07-10 2025-01-24 26 7 alessandro@lighton.ai muzzynine@gmail.com
exact_substrings.py
in src/datatrove/pipeline/dedup
229 17 2023-07-26 2025-04-29 23 7 47321860+alexchapeaux@users... nostrumg@gmail.com
stats.py
in src/datatrove/utils
297 41 2023-06-28 2024-06-14 17 7 alessandro@lighton.ai kydlicek.hynek@gmail.com
huggingface.py
in src/datatrove/pipeline/readers
95 4 2024-01-19 2025-04-29 11 7 nostrumg@gmail.com huseyinabanozis@gmail.com
tokenizer.py
in src/datatrove/pipeline/tokens
298 16 2023-06-26 2025-04-30 28 6 nostrumg@gmail.com nostrumg@gmail.com
local.py
in src/datatrove/executor
101 4 2023-06-14 2025-04-29 26 6 nostrumg@gmail.com nostrumg@gmail.com
base.py
in src/datatrove/pipeline/extractors
117 11 2023-06-20 2025-04-29 17 6 alessandro@lighton.ai kydlicek.hynek@gmail.com
gopher_quality_filter.py
in src/datatrove/pipeline/filters
76 2 2023-06-22 2025-01-30 16 6 47321860+alexchapeaux@users... nostrumg@gmail.com
base.py
in src/datatrove/pipeline
35 8 2023-06-15 2024-05-21 16 6 alessandro@lighton.ai nostrumg@gmail.com
url_filter.py
in src/datatrove/pipeline/filters
91 6 2023-07-24 2024-12-06 14 6 47321860+alexchapeaux@users... nostrumg@gmail.com
regex_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 11 6 47321860+alexchapeaux@users... nostrumg@gmail.com
csv.py
in src/datatrove/pipeline/readers
50 2 2023-10-05 2024-06-19 11 6 nostrumg@gmail.com nostrumg@gmail.com
trafilatura.py
in src/datatrove/pipeline/extractors
28 2 2023-06-20 2025-01-02 11 6 alessandro@lighton.ai nostrumg@gmail.com
typeshelper.py
in src/datatrove/utils
4106 - 2023-06-19 2024-12-06 10 6 alessandro@lighton.ai nostrumg@gmail.com
io.py
in src/datatrove
188 27 2024-01-18 2025-04-30 22 5 nostrumg@gmail.com nostrumg@gmail.com
gopher_repetition_filter.py
in src/datatrove/pipeline/filters
90 6 2023-06-22 2024-12-06 16 5 47321860+alexchapeaux@users... nostrumg@gmail.com
base_filter.py
in src/datatrove/pipeline/filters
49 5 2023-06-22 2024-07-05 15 5 47321860+alexchapeaux@users... nostrumg@gmail.com
parquet.py
in src/datatrove/pipeline/readers
57 2 2023-07-19 2024-06-19 12 5 nostrumg@gmail.com nostrumg@gmail.com
counter.py
in src/datatrove/pipeline/tokens
36 3 2023-11-20 2025-01-09 12 5 nostrumg@gmail.com nostrumg@gmail.com
bloom_filter.py
in src/datatrove/pipeline/dedup
143 12 2023-10-05 2024-07-17 11 5 47321860+alexchapeaux@users... nostrumg@gmail.com
ipc.py
in src/datatrove/pipeline/readers
64 4 2023-12-22 2024-06-19 8 5 mariosasko777@gmail.com nostrumg@gmail.com
modular.py
in src/datatrove/pipeline/extractors
23 2 2023-06-20 2024-02-29 8 5 alessandro@lighton.ai thomwolf@users.noreply.gith...
__init__.py
in src/datatrove/pipeline/tokens
5 - 2023-07-03 2025-01-09 5 5 anton@huggingface.co 74564958+tj-solergibert@use...
merger.py
in src/datatrove/pipeline/tokens
153 5 2023-06-29 2025-04-29 20 4 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/filters
11 - 2023-06-14 2024-04-24 13 4 nostrumg@gmail.com nostrumg@gmail.com
unigram_log_probs.py
in src/datatrove/pipeline/filters
46 4 2023-07-24 2024-12-06 10 4 47321860+alexchapeaux@users... nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/dedup
12 - 2023-06-16 2024-04-12 10 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
check_dataset.py
in src/datatrove/tools
58 4 2023-07-31 2025-04-29 10 4 nostrumg@gmail.com craffel@gmail.com
fasttext_filter.py
in src/datatrove/pipeline/filters
78 3 2024-02-21 2025-01-09 7 4 nostrumg@gmail.com sk8terbo2@gmail.com
inspect_data.py
in src/datatrove/tools
129 4 2023-11-30 2024-03-20 7 4 nostrumg@gmail.com nostrumg@gmail.com
lambda_filter.py
in src/datatrove/pipeline/filters
11 2 2023-06-22 2024-02-06 6 4 47321860+alexchapeaux@users... nostrumg@gmail.com
data.py
in src/datatrove
19 - 2023-07-03 2024-03-01 6 4 nostrumg@gmail.com thomwolf@users.noreply.gith...
__init__.py
in src/datatrove/pipeline/stats
11 - 2023-07-10 2024-06-14 4 4 alessandro@lighton.ai kydlicek.hynek@gmail.com
context_shuffler.py
in src/datatrove/pipeline/tokens
46 3 2023-09-01 2024-07-08 9 3 nostrumg@gmail.com 134823555+sippycoder@users....
merge_stats.py
in src/datatrove/tools
33 1 2023-07-31 2024-05-17 9 3 nostrumg@gmail.com nostrumg@gmail.com
n_grams.py
in src/datatrove/pipeline/decont
150 6 2024-05-04 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
tokenization.py
in src/datatrove/utils
58 6 2024-03-19 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/readers
6 - 2023-06-14 2024-01-19 6 3 nostrumg@gmail.com nostrumg@gmail.com
url_dedup.py
in src/datatrove/pipeline/dedup
301 15 2024-04-12 2025-04-29 6 3 kydlicek.hynek@gmail.com nostrumg@gmail.com
_import_utils.py
in src/datatrove/utils
54 12 2024-01-18 2025-04-30 6 3 mariosasko777@gmail.com nostrumg@gmail.com
Files With Least Contributors (Top 50)
Based on the number of unique email addresses found in commits.

See data for all files...

File# lines# unitscreatedlast modified# changes
(days)
# contributorsfirst
contributor
latest
contributor
__init__.py
in src/datatrove/pipeline
1 -
__init__.py
in src/datatrove/utils
1 -
__init__.py
in src/datatrove
1 -
s3_union_find.rs
in src/datatrove/tools/fast_mh3/src
477 4 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
local_union_find.rs
in src/datatrove/tools/fast_mh3/src
272 2 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
japanese_tokenizer.py
in src/datatrove/utils
216 16 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
pii.py
in src/datatrove/pipeline/formatters
68 5 2024-04-16 2024-05-02 2 1 nostrumg@gmail.com nostrumg@gmail.com
line_stats.py
in src/datatrove/pipeline/stats
57 5 2024-06-14 2024-09-02 2 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
word_stats.py
in src/datatrove/pipeline/stats
56 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
paragraph_stats.py
in src/datatrove/pipeline/stats
51 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
merger.py
in src/datatrove/pipeline/stats
51 3 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
sentence_stats.py
in src/datatrove/pipeline/stats
47 4 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
ftfy.py
in src/datatrove/pipeline/formatters
43 2 2024-03-22 2025-01-08 2 1 nostrumg@gmail.com nostrumg@gmail.com
logging.py
in src/datatrove/utils
37 7 2023-11-14 2024-05-17 5 1 nostrumg@gmail.com nostrumg@gmail.com
hashing.py
in src/datatrove/utils
37 8 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
contamination_stats.py
in src/datatrove/pipeline/stats
33 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
doc_stats.py
in src/datatrove/pipeline/stats
30 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
Cargo.toml
in src/datatrove/tools/fast_mh3
26 - 2024-12-06 2024-12-06 1 1 nostrumg@gmail.com nostrumg@gmail.com
symbol_lines_remover.py
in src/datatrove/pipeline/formatters
25 2 2024-03-22 2024-12-06 3 1 nostrumg@gmail.com nostrumg@gmail.com
perplexity_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
base.py
in src/datatrove/pipeline/formatters
17 3 2024-03-22 2024-03-22 1 1 nostrumg@gmail.com nostrumg@gmail.com
launch_pickled_pipeline.py
in src/datatrove/tools
13 1 2023-11-14 2024-02-21 3 1 nostrumg@gmail.com nostrumg@gmail.com
config.py
in src/datatrove/pipeline/stats
9 - 2024-06-14 2024-06-14 1 1 kydlicek.hynek@gmail.com kydlicek.hynek@gmail.com
batching.py
in src/datatrove/utils
7 1 2024-07-05 2024-07-05 1 1 nostrumg@gmail.com nostrumg@gmail.com
sha1.py
in src/datatrove/utils/hashes
6 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
xxhash.py
in src/datatrove/utils/hashes
5 2 2024-05-21 2024-05-21 1 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/formatters
3 - 2024-03-22 2024-04-16 2 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/writers
3 - 2023-06-14 2024-02-23 4 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/pipeline/decont
1 - 2024-05-04 2024-05-04 1 1 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/tools
1 - 2023-07-31 2023-07-31 1 1 nostrumg@gmail.com nostrumg@gmail.com
text.py
in src/datatrove/utils
254 6 2024-03-22 2024-12-06 7 2 nostrumg@gmail.com nostrumg@gmail.com
dataset.py
in src/datatrove/utils
194 12 2024-04-24 2025-04-30 4 2 nostrumg@gmail.com nostrumg@gmail.com
ray.py
in src/datatrove/executor
160 5 2025-04-29 2025-04-29 1 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
perplexity.py
in src/datatrove/utils
138 12 2024-06-14 2025-04-29 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
megatron_tokenizer.py
in src/datatrove/pipeline/tokens
107 9 2025-01-09 2025-01-09 1 2 nostrumg@gmail.com 74564958+tj-solergibert@use...
huggingface.py
in src/datatrove/pipeline/writers
104 4 2024-02-23 2025-01-30 6 2 nostrumg@gmail.com 2779410+bramvanroy@users.no...
base.py
in src/datatrove/pipeline/stats
91 4 2024-06-14 2024-07-15 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
jobs_status.py
in src/datatrove/tools
71 1 2024-03-05 2025-04-29 4 2 nostrumg@gmail.com nostrumg@gmail.com
failed_logs.py
in src/datatrove/tools
55 1 2023-11-22 2025-04-29 5 2 nostrumg@gmail.com nostrumg@gmail.com
binaryio.py
in src/datatrove/utils
54 3 2024-03-22 2024-04-12 2 2 nostrumg@gmail.com kydlicek.hynek@gmail.com
lang_stats.py
in src/datatrove/pipeline/stats
26 2 2024-06-14 2024-06-19 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
token_stats.py
in src/datatrove/pipeline/stats
25 2 2024-06-14 2025-01-09 2 2 kydlicek.hynek@gmail.com nostrumg@gmail.com
sampler_filter.py
in src/datatrove/pipeline/filters
17 2 2023-08-21 2024-02-06 4 2 nostrumg@gmail.com nostrumg@gmail.com
__init__.py
in src/datatrove/executor
3 - 2024-01-19 2025-04-29 2 2 nostrumg@gmail.com nelson-liu@users.noreply.gi...
word_tokenizers.py
in src/datatrove/utils
361 60 2024-05-21 2025-01-30 5 3 beme248@users.noreply.githu... nostrumg@gmail.com
url_dedup.py
in src/datatrove/pipeline/dedup
301 15 2024-04-12 2025-04-29 6 3 kydlicek.hynek@gmail.com nostrumg@gmail.com
c4_filters.py
in src/datatrove/pipeline/filters
197 8 2024-04-24 2024-12-06 4 3 nostrumg@gmail.com nostrumg@gmail.com
n_grams.py
in src/datatrove/pipeline/decont
150 6 2024-05-04 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com
parquet.py
in src/datatrove/pipeline/writers
67 5 2024-02-22 2025-01-30 5 3 nostrumg@gmail.com 2779410+bramvanroy@users.no...
tokenization.py
in src/datatrove/utils
58 6 2024-03-19 2025-04-29 7 3 nostrumg@gmail.com nostrumg@gmail.com