huggingface / tokenizers
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
14% | 36% | 24% | 12% | 11%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
rs16% | 40% | 22% | 10% | 9%
py0% | 0% | 41% | 35% | 22%
js0% | 0% | 100% | 0% | 0%
pyi0% | 0% | 35% | 16% | 48%
css0% | 0% | 0% | 100% | 0%
toml0% | 0% | 0% | 0% | 100%
cfg0% | 0% | 0% | 0% | 100%
in0% | 0% | 0% | 0% | 100%
ts0% | 0% | 0% | 0% | 100%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
tokenizers19% | 36% | 18% | 13% | 12%
bindings8% | 35% | 33% | 11% | 11%
Longest Files (Top 50)
File# lines# units
normalizer.rs
in tokenizers/src/tokenizer
2002 33
mod.rs
in tokenizers/src/tokenizer
1185 31
tokenizer.rs
in bindings/python/src
1060 74
template.rs
in tokenizers/src/processors
854 35
trainers.rs
in bindings/python/src
836 59
model.rs
in tokenizers/src/models/bpe
834 26
added_vocabulary.rs
in tokenizers/src/tokenizer
799 32
pre_tokenizers.rs
in bindings/python/src
791 62
models.rs
in bindings/python/src
785 58
encoding.rs
in tokenizers/src/tokenizer
780 10
normalizers.rs
in bindings/python/src
760 58
trainer.rs
in tokenizers/src/models/unigram
685 21
processors.rs
in bindings/python/src
642 48
trainer.rs
in tokenizers/src/models/bpe
598 15
lattice.rs
in tokenizers/src/models/unigram
588 13
decoders.rs
in bindings/python/src
577 53
byte_level.rs
in tokenizers/src/pre_tokenizers
513 18
model.rs
in tokenizers/src/models/unigram
509 18
normalization.rs
in bindings/python/src/utils
477 54
tokenizer.rs
in bindings/node/src
463 6
serde_pyo3.rs
in bindings/python/src/utils
454 44
index.js
in bindings/node
376 1
convert.py
in bindings/python/scripts
342 32
metaspace.rs
in tokenizers/src/pre_tokenizers
331 10
encoding.rs
in bindings/node/src
310 5
roberta.rs
in tokenizers/src/processors
309 5
models.rs
in bindings/node/src
303 11
truncation.rs
in tokenizers/src/utils
279 12
word.rs
in tokenizers/src/models/bpe
277 6
bert.rs
in tokenizers/src/processors
268 5
mod.rs
in tokenizers/src/pre_tokenizers
258 2
pretokenization.rs
in bindings/python/src/utils
254 18
mod.rs
in tokenizers/src/models/wordpiece
250 11
encoding.rs
in bindings/python/src
244 27
pre_tokenizers.rs
in bindings/node/src
231 3
pre_tokenizer.rs
in tokenizers/src/tokenizer
228 5
pyi
__init__.pyi
in bindings/python/py_src/tokenizers
221 90
parallelism.rs
in tokenizers/src/utils
219 19
serialization.rs
in tokenizers/src/models/bpe
217 6
serialization.rs
in tokenizers/src/tokenizer
215 6
visualizer.py
in bindings/python/py_src/tokenizers/tools
209 13
spm_parity_check.py
in bindings/python/scripts
208 5
mod.rs
in tokenizers/src/models/wordlevel
204 12
pattern.rs
in tokenizers/src/tokenizer
187 13
byte_level.rs
in tokenizers/src/normalizers
164 3
decoders.rs
in bindings/node/src
161 1
normalizers.rs
in bindings/node/src
159 1
sequence.rs
in tokenizers/src/processors
156 6
trainer.rs
in tokenizers/src/models/wordlevel
151 6
trainer.rs
in tokenizers/src/models/wordpiece
147 4
Files With Most Units (Top 50)
File# lines# units
pyi
__init__.pyi
in bindings/python/py_src/tokenizers
221 90
tokenizer.rs
in bindings/python/src
1060 74
pre_tokenizers.rs
in bindings/python/src
791 62
trainers.rs
in bindings/python/src
836 59
models.rs
in bindings/python/src
785 58
normalizers.rs
in bindings/python/src
760 58
normalization.rs
in bindings/python/src/utils
477 54
decoders.rs
in bindings/python/src
577 53
processors.rs
in bindings/python/src
642 48
serde_pyo3.rs
in bindings/python/src/utils
454 44
pyi
__init__.pyi
in bindings/python/py_src/tokenizers/normalizers
101 43
pyi
__init__.pyi
in bindings/python/py_src/tokenizers/pre_tokenizers
90 38
template.rs
in tokenizers/src/processors
854 35
base_tokenizer.py
in bindings/python/py_src/tokenizers/implementations
131 35
pyi
__init__.pyi
in bindings/python/py_src/tokenizers/models
93 35
normalizer.rs
in tokenizers/src/tokenizer
2002 33
added_vocabulary.rs
in tokenizers/src/tokenizer
799 32
convert.py
in bindings/python/scripts
342 32
mod.rs
in tokenizers/src/tokenizer
1185 31
encoding.rs
in bindings/python/src
244 27
model.rs
in tokenizers/src/models/bpe
834 26
pyi
__init__.pyi
in bindings/python/py_src/tokenizers/decoders
56 22
trainer.rs
in tokenizers/src/models/unigram
685 21
parallelism.rs
in tokenizers/src/utils
219 19
byte_level.rs
in tokenizers/src/pre_tokenizers
513 18
model.rs
in tokenizers/src/models/unigram
509 18
pretokenization.rs
in bindings/python/src/utils
254 18
pyi
__init__.pyi
in bindings/python/py_src/tokenizers/processors
40 17
trainer.rs
in tokenizers/src/models/bpe
598 15
lattice.rs
in tokenizers/src/models/unigram
588 13
pattern.rs
in tokenizers/src/tokenizer
187 13
visualizer.py
in bindings/python/py_src/tokenizers/tools
209 13
truncation.rs
in tokenizers/src/utils
279 12
mod.rs
in tokenizers/src/models/wordlevel
204 12
replace.rs
in tokenizers/src/normalizers
131 11
mod.rs
in tokenizers/src/models/wordpiece
250 11
models.rs
in bindings/node/src
303 11
metaspace.rs
in tokenizers/src/pre_tokenizers
331 10
encoding.rs
in tokenizers/src/tokenizer
780 10
bert.rs
in tokenizers/src/normalizers
102 9
stub.py
in bindings/python
141 9
tokenizer.rs
in bindings/node/src/tasks
111 8
unicode.rs
in tokenizers/src/normalizers
91 7
punctuation.rs
in tokenizers/src/pre_tokenizers
72 7
serialization.rs
in tokenizers/src/models/unigram
98 7
serialization.rs
in tokenizers/src/models/wordlevel
114 7
token.rs
in bindings/python/src
40 7
strip.rs
in tokenizers/src/normalizers
133 6
iter.rs
in tokenizers/src/utils
71 6
pre_tokenizer.rs
in tokenizers/src/pre_tokenizers/unicode_scripts
132 6
Files With Long Lines (Top 18)

There are 18 files with lines longer than 120 characters. In total, there are 54 long lines.

File# lines# units# long lines
pre_tokenizers.rs
in bindings/python/src
791 62 11
processors.rs
in bindings/python/src
642 48 9
serialization.rs
in tokenizers/src/models/bpe
217 6 5
normalizers.rs
in bindings/python/src
760 58 5
strip.rs
in tokenizers/src/normalizers
133 6 2
from_pretrained.rs
in tokenizers/src/utils
58 1 2
mod.rs
in tokenizers/src/pre_tokenizers
258 2 2
metaspace.rs
in tokenizers/src/pre_tokenizers
331 10 2
mod.rs
in tokenizers/src/models
59 2 2
mod.rs
in tokenizers/src/tokenizer
1185 31 2
trainers.rs
in bindings/python/src
836 59 2
decoders.rs
in bindings/python/src
577 53 2
models.rs
in bindings/python/src
785 58 2
tokenizer.rs
in bindings/python/src
1060 74 2
encoding.rs
in tokenizers/src/tokenizer
780 10 1
template.rs
in tokenizers/src/processors
854 35 1
sentencepiece_unigram.py
in bindings/python/py_src/tokenizers/implementations
120 4 1
convert.py
in bindings/python/scripts
342 32 1