huggingface / cosmopedia
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
0% | 0% | 48% | 28% | 23%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
ipynb0% | 0% | 100% | 0% | 0%
py0% | 0% | 28% | 39% | 31%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
prompts0% | 0% | 71% | 11% | 17%
fulltext_search0% | 0% | 0% | 100% | 0%
generation0% | 0% | 0% | 88% | 11%
classification0% | 0% | 0% | 66% | 33%
decontamination0% | 0% | 0% | 0% | 100%
deduplication0% | 0% | 0% | 0% | 100%
Longest Files (Top 20)
File# lines# units
1_scraper.ipynb
in prompts/stanford
399 -
khan_dl.py
in prompts/khanacademy/khan_dl
321 21
287 -
filter_and_classify_clusters.py
in prompts/web_samples
211 3
llm_swarm_script.py
in generation
195 -
build_openstax_prompts.py
in prompts/openstax
190 4
train_edu_bert.py
in classification
128 2
index_docs.py
in fulltext_search
117 2
search_sharded.py
in fulltext_search
102 3
decontaminate.py
in decontamination
99 6
deduplicate_dataset.py
in deduplication
93 -
main.py
in prompts/khanacademy/khan_dl
66 2
run_edu_bert.py
in classification
64 1
build_web_prompts.py
in prompts/web_samples
48 2
47 2
42 2
build_science_prompts.py
in prompts/auto_math_text
38 2
generate_textbooks.py
in prompts/khanacademy
33 -
filter_openhermes.py
in prompts/stories
31 2
25 1
Files With Most Units (Top 15)
File# lines# units
khan_dl.py
in prompts/khanacademy/khan_dl
321 21
decontaminate.py
in decontamination
99 6
build_openstax_prompts.py
in prompts/openstax
190 4
filter_and_classify_clusters.py
in prompts/web_samples
211 3
search_sharded.py
in fulltext_search
102 3
main.py
in prompts/khanacademy/khan_dl
66 2
build_science_prompts.py
in prompts/auto_math_text
38 2
42 2
filter_openhermes.py
in prompts/stories
31 2
47 2
build_web_prompts.py
in prompts/web_samples
48 2
train_edu_bert.py
in classification
128 2
index_docs.py
in fulltext_search
117 2
25 1
run_edu_bert.py
in classification
64 1
Files With Long Lines (Top 7)

There are 7 files with lines longer than 120 characters. In total, there are 31 long lines.

File# lines# units# long lines
287 - 13
build_openstax_prompts.py
in prompts/openstax
190 4 10
1_scraper.ipynb
in prompts/stanford
399 - 3
generate_textbooks.py
in prompts/khanacademy
33 - 2
llm_swarm_script.py
in generation
195 - 1
filter_and_classify_clusters.py
in prompts/web_samples
211 3 1
index_docs.py
in fulltext_search
117 2 1