huggingface / OBELICS
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
12% | 8% | 23% | 29% | 25%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
py13% | 9% | 24% | 28% | 24%
html0% | 0% | 0% | 100% | 0%
yaml0% | 0% | 0% | 0% | 100%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
obelics21% | 14% | 28% | 25% | 11%
build_obelics0% | 0% | 16% | 36% | 46%
Longest Files (Top 50)
File# lines# units
web_document_filtering.py
in obelics/processors
1002 38
675 8
filtering_utils.py
in obelics/utils
427 -
web_document_extractor.py
in obelics/processors
371 20
global_visualization.py
in obelics/visualization
321 7
13_final_processing.py
in build_obelics
276 11
05_filtering_web_docs.py
in build_obelics
225 1
205 3
filter_web_documents.py
in obelics/callers
196 1
dom_tree_simplificator.py
in obelics/processors
195 16
extract_web_documents.py
in obelics/callers
176 1
simplification_utils.py
in obelics/utils
176 6
175 8
DOM_tree_viz.html
in obelics/visualization/assets
172 -
163 1
pre_extraction_simplificator.py
in obelics/processors
152 11
144 1
140 2
140 1
123 4
121 6
113 -
10_final_cleaning.py
in build_obelics
113 2
94 3
09_06_line_dedup.py
in build_obelics
91 3
90 1
90 1
01_download_warc.py
in build_obelics
89 1
86 -
08_02_urldedup.py
in build_obelics
82 3
tags_attributes.py
in obelics/utils
82 -
79 1
77 3
75 -
74 3
68 1
65 1
64 1
64 -
64 -
download_warc.py
in obelics/callers
63 1
62 1
web_document_visualization.py
in obelics/visualization
59 6
extract_html.py
in obelics/callers
53 1
html_extractor.py
in obelics/processors
52 2
50 -
49 -
48 -
warc_downloader.py
in obelics/processors
36 4
34 -
Files With Most Units (Top 39)
File# lines# units
web_document_filtering.py
in obelics/processors
1002 38
web_document_extractor.py
in obelics/processors
371 20
dom_tree_simplificator.py
in obelics/processors
195 16
13_final_processing.py
in build_obelics
276 11
pre_extraction_simplificator.py
in obelics/processors
152 11
175 8
675 8
global_visualization.py
in obelics/visualization
321 7
simplification_utils.py
in obelics/utils
176 6
121 6
web_document_visualization.py
in obelics/visualization
59 6
123 4
warc_downloader.py
in obelics/processors
36 4
09_06_line_dedup.py
in build_obelics
91 3
08_02_urldedup.py
in build_obelics
82 3
94 3
77 3
74 3
205 3
10_final_cleaning.py
in build_obelics
113 2
140 2
html_extractor.py
in obelics/processors
52 2
64 1
144 1
01_download_warc.py
in build_obelics
89 1
90 1
05_filtering_web_docs.py
in build_obelics
225 1
163 1
90 1
140 1
62 1
79 1
68 1
extract_web_documents.py
in obelics/callers
176 1
extract_html.py
in obelics/callers
53 1
filter_web_documents.py
in obelics/callers
196 1
download_warc.py
in obelics/callers
63 1
65 1
utils.py
in obelics/utils
4 1
Files With Long Lines (Top 13)

There are 13 files with lines longer than 120 characters. In total, there are 23 long lines.

File# lines# units# long lines
13_final_processing.py
in build_obelics
276 11 4
675 8 3
DOM_tree_viz.html
in obelics/visualization/assets
172 - 3
86 - 2
94 3 2
74 3 2
64 - 1
10_final_cleaning.py
in build_obelics
113 2 1
09_06_line_dedup.py
in build_obelics
91 3 1
90 1 1
50 - 1
175 8 1
65 1 1