facebookresearch / cc_net
File Size

The distribution of size of files (measured in lines of code).

Intro
  • File size measurements show the distribution of size of files.
  • Files are classified in four categories based on their size (lines of code): 1-100 (very small files), 101-200 (small files), 201-500 (medium size files), 501-1000 (long files), 1001+(very long files).
  • It is a good practice to keep files small. Long files may become "bloaters", code that have increased to such gargantuan proportions that they are hard to work with.
Learn more...
File Size Overall
  • There are 21 files with 3,781 lines of code.
    • 0 very long files (0 lines of code)
    • 1 long files (948 lines of code)
    • 5 medium size files (1,568 lines of codeclsfd_ftr_w_mp_ins)
    • 6 small files (929 lines of code)
    • 9 very small files (336 lines of code)
0% | 25% | 41% | 24% | 8%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: zoomable circles | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
py0% | 25% | 41% | 24% | 8%
toml0% | 0% | 0% | 0% | 100%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
cc_net0% | 28% | 40% | 24% | 6%
cc_net/tools0% | 0% | 54% | 31% | 13%
ROOT0% | 0% | 0% | 0% | 100%
Longest Files (Top 21)
File# lines# units
jsonql.py
in cc_net
948 97
mine.py
in cc_net
464 20
dedup.py
in cc_net
360 24
perplexity.py
in cc_net
284 26
expand_corpus.py
in cc_net/tools
230 15
minify.py
in cc_net
230 22
process_wet_file.py
in cc_net
197 18
execution.py
in cc_net
172 9
flat_hash_set.py
in cc_net
160 28
text_normalizer.py
in cc_net
150 8
dl_cc_100.py
in cc_net/tools
133 6
split_by_lang.py
in cc_net
117 10
regroup.py
in cc_net
81 5
get_wiki_cirrus.py
in cc_net
74 6
make_dmoz_corpus.py
in cc_net/tools
55 4
tokenizer.py
in cc_net
55 6
setup.py
in root
43 -
toml
pyproject.toml
in root
20 -
__main__.py
in cc_net
6 1
__init__.py
in cc_net/tools
1 -
__init__.py
in cc_net
1 -
Files With Most Units (Top 17)
File# lines# units
jsonql.py
in cc_net
948 97
flat_hash_set.py
in cc_net
160 28
perplexity.py
in cc_net
284 26
dedup.py
in cc_net
360 24
minify.py
in cc_net
230 22
mine.py
in cc_net
464 20
process_wet_file.py
in cc_net
197 18
expand_corpus.py
in cc_net/tools
230 15
split_by_lang.py
in cc_net
117 10
execution.py
in cc_net
172 9
text_normalizer.py
in cc_net
150 8
get_wiki_cirrus.py
in cc_net
74 6
dl_cc_100.py
in cc_net/tools
133 6
tokenizer.py
in cc_net
55 6
regroup.py
in cc_net
81 5
make_dmoz_corpus.py
in cc_net/tools
55 4
__main__.py
in cc_net
6 1
Files With Long Lines (Top 2)

There are 2 files with lines longer than 120 characters. In total, there are 2 long lines.

File# lines# units# long lines
dedup.py
in cc_net
360 24 1
process_wet_file.py
in cc_net
197 18 1