apache / tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
GitHub Repo 
138K
lines of main code
1.3K files
64K
lines of test code
938 files
25K
lines of other code
231 files
18y
age
6,613 days
54%
main code touched
1 year (75K LOC)
3%
new main code
1 year (4.2K LOC)
124K
java
13K
xml
1K
py
PROTO
0.1K
proto
0.08K
html
0.04K
yaml

110

692

838

836

606

324

234

400

701

531

474

398

132

311

465

344

323

160

136

7

22

15

21

27

23

20

29

49

44

14

11

9

8

9

7

5

6

6

2025 2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007

generated by sokrates.dev (configuration) on 2025-05-07