huggingface / datasets
File Size

The distribution of size of files (measured in lines of code).

Intro
Learn more...
File Size Overall
31% | 13% | 26% | 13% | 15%
Legend:
1001+
501-1000
201-500
101-200
1-100


explore: grouped by folders | grouped by size | sunburst | 3D view
File Size per Extension
1001+
501-1000
201-500
101-200
1-100
py31% | 13% | 26% | 12% | 15%
yaml0% | 0% | 0% | 100% | 0%
toml0% | 0% | 0% | 0% | 100%
File Size per Logical Decomposition
primary
1001+
501-1000
201-500
101-200
1-100
src32% | 13% | 27% | 12% | 14%
ROOT0% | 0% | 0% | 85% | 14%
benchmarks0% | 0% | 0% | 26% | 73%
utils0% | 0% | 0% | 0% | 100%
Longest Files (Top 50)
File# lines# units
iterable_dataset.py
in src/datasets
2750 234
features.py
in src/datasets/features
1354 96
builder.py
in src/datasets
1185 50
dataset_dict.py
in src/datasets
1075 58
load.py
in src/datasets
952 26
table.py
in src/datasets
942 131
file_utils.py
in src/datasets/utils
807 69
data_files.py
in src/datasets
470 27
arrow_writer.py
in src/datasets
469 22
formatting.py
in src/datasets/formatting
464 74
py_utils.py
in src/datasets/utils
402 26
search.py
in src/datasets
393 33
tf_utils.py
in src/datasets/utils
390 21
folder_based_builder.py
in src/datasets/packaged_modules/folder_based_builder
347 8
_dill.py
in src/datasets/utils
334 17
arrow_reader.py
in src/datasets
309 26
spark.py
in src/datasets/packaged_modules/spark
289 20
webdataset.py
in src/datasets/packaged_modules/webdataset
263 12
splits.py
in src/datasets
260 42
fingerprint.py
in src/datasets
258 22
info.py
in src/datasets
254 17
extract.py
in src/datasets/utils
253 25
image.py
in src/datasets/features
250 11
video.py
in src/datasets/features
186 9
metadata.py
in src/datasets/utils
180 8
config.py
in src/datasets
176 -
download_manager.py
in src/datasets/download
172 14
_tenbin.py
in src/datasets/packaged_modules/webdataset
167 21
csv.py
in src/datasets/packaged_modules/csv
164 6
audio.py
in src/datasets/features
163 7
inspect.py
in src/datasets
149 5
json.py
in src/datasets/io
148 6
pdf.py
in src/datasets/features
141 8
json.py
in src/datasets/packaged_modules/json
141 8
setup.py
in root
131 -
csv.py
in src/datasets/io
122 6
readme_structure.yaml
in src/datasets/utils/resources
116 -
jax_formatter.py
in src/datasets/formatting
116 9
108 8
streaming_download_manager.py
in src/datasets/download
106 11
parquet.py
in src/datasets/io
103 5
sql.py
in src/datasets/io
101 6
hub.py
in src/datasets
100 2
sql.py
in src/datasets/packaged_modules/sql
92 7
combine.py
in src/datasets
90 2
polars_formatter.py
in src/datasets/formatting
88 11
parquet.py
in src/datasets/packaged_modules/parquet
86 5
__init__.py
in src/datasets/packaged_modules
85 1
__init__.py
in src/datasets/formatting
84 4
tf_formatter.py
in src/datasets/formatting
83 8
Files With Most Units (Top 50)
File# lines# units
iterable_dataset.py
in src/datasets
2750 234
table.py
in src/datasets
942 131
features.py
in src/datasets/features
1354 96
formatting.py
in src/datasets/formatting
464 74
file_utils.py
in src/datasets/utils
807 69
dataset_dict.py
in src/datasets
1075 58
builder.py
in src/datasets
1185 50
splits.py
in src/datasets
260 42
search.py
in src/datasets
393 33
data_files.py
in src/datasets
470 27
py_utils.py
in src/datasets/utils
402 26
arrow_reader.py
in src/datasets
309 26
load.py
in src/datasets
952 26
extract.py
in src/datasets/utils
253 25
fingerprint.py
in src/datasets
258 22
arrow_writer.py
in src/datasets
469 22
tf_utils.py
in src/datasets/utils
390 21
_tenbin.py
in src/datasets/packaged_modules/webdataset
167 21
spark.py
in src/datasets/packaged_modules/spark
289 20
_dill.py
in src/datasets/utils
334 17
info.py
in src/datasets
254 17
logging.py
in src/datasets/utils
69 14
download_manager.py
in src/datasets/download
172 14
webdataset.py
in src/datasets/packaged_modules/webdataset
263 12
version.py
in src/datasets/utils
52 11
polars_formatter.py
in src/datasets/formatting
88 11
image.py
in src/datasets/features
250 11
streaming_download_manager.py
in src/datasets/download
106 11
track.py
in src/datasets/utils
45 10
53 9
jax_formatter.py
in src/datasets/formatting
116 9
video.py
in src/datasets/features
186 9
108 8
metadata.py
in src/datasets/utils
180 8
np_formatter.py
in src/datasets/formatting
79 8
tf_formatter.py
in src/datasets/formatting
83 8
torch_formatter.py
in src/datasets/formatting
82 8
pdf.py
in src/datasets/features
141 8
json.py
in src/datasets/packaged_modules/json
141 8
folder_based_builder.py
in src/datasets/packaged_modules/folder_based_builder
347 8
deprecation_utils.py
in src/datasets/utils
70 7
audio.py
in src/datasets/features
163 7
sql.py
in src/datasets/packaged_modules/sql
92 7
41 6
csv.py
in src/datasets/io
122 6
sql.py
in src/datasets/io
101 6
json.py
in src/datasets/io
148 6
tqdm.py
in src/datasets/utils
40 6
patching.py
in src/datasets/utils
65 6
translation.py
in src/datasets/features
52 6
Files With Long Lines (Top 27)

There are 27 files with lines longer than 120 characters. In total, there are 105 long lines.

File# lines# units# long lines
iterable_dataset.py
in src/datasets
2750 234 20
fingerprint.py
in src/datasets
258 22 10
builder.py
in src/datasets
1185 50 10
dataset_dict.py
in src/datasets
1075 58 9
combine.py
in src/datasets
90 2 8
search.py
in src/datasets
393 33 7
features.py
in src/datasets/features
1354 96 5
folder_based_builder.py
in src/datasets/packaged_modules/folder_based_builder
347 8 5
load.py
in src/datasets
952 26 4
arrow_writer.py
in src/datasets
469 22 3
json.py
in src/datasets/io
148 6 2
_dataset_viewer.py
in src/datasets/utils
70 2 2
sharding.py
in src/datasets/utils
56 5 2
exceptions.py
in src/datasets
61 1 2
formatting.py
in src/datasets/formatting
464 74 2
data_files.py
in src/datasets
470 27 2
json.py
in src/datasets/packaged_modules/json
141 8 2
file_utils.py
in src/datasets/utils
807 69 1
__init__.py
in src/datasets/formatting
84 4 1
translation.py
in src/datasets/features
52 6 1
image.py
in src/datasets/features
250 11 1
keyhash.py
in src/datasets
38 5 1
arrow_reader.py
in src/datasets
309 26 1
sql.py
in src/datasets/packaged_modules/sql
92 7 1
spark.py
in src/datasets/packaged_modules/spark
289 20 1
parquet.py
in src/datasets/packaged_modules/parquet
86 5 1
table.py
in src/datasets
942 131 1