index.html (42 lines of code) (raw):
---
layout: default
slug: home
---
<div class="page-title">
<h1>A collection of resources for building low-latency, scalable web crawlers on Apache Storm®</h1>
</div>
</div>
<div class="row row-col">
<p><strong>Apache StormCrawler (Incubating)</strong> is an open source SDK for building distributed web crawlers based on <a href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache License v2 and consists of a collection of reusable resources and components, written mostly in Java.</p>
<p>The aim of Apache StormCrawler (Incubating) is to help build web crawlers that are :</p>
<ul>
<li>scalable</li>
<li>resilient</li>
<li>low latency</li>
<li>easy to extend</li>
<li>polite yet efficient</li>
</ul>
<p><strong>Apache StormCrawler (Incubating)</strong> is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward! Have a look at the <a href="getting-started/">Getting Started</a> section for more details.</p>
<p>Apart from the core components, we provide some <a href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external resources</a> that you can reuse in your project, like for instance our spout and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various document formats.</p>
<p><strong>Apache StormCrawler (Incubating)</strong> is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by <a href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many organisations</a> and is actively developed and maintained.</p>
<p>The <a href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a> page contains links to some recent presentations made about this project.</p>
</div>
<div class="row row-col">
<div class="used-by-panel">
<h2>Used by</h2>
<a href="https://pixray.com/" target="_blank">
<img src="{{ site.baseurl }}/img/pixray.png" alt="Pixray" height=80>
</a>
<a href="https://www.gov.nt.ca/" target="_blank">
<img src="{{ site.baseurl }}/img/gnwt.png" alt="Government of Northwest Territories">
</a>
<a href="https://www.stolencamerafinder.com/" target="_blank">
<img src="{{ site.baseurl }}/img/stolen-camera-finder.png" alt="StolenCameraFinder">
</a>
<a href="https://www.polecat.com/" target="_blank">
<img src="{{ site.baseurl }}/img/polecat.svg" alt="Polecat" height=70>
</a>
<br>
<a href="http://github.com/apache/incubator-stormcrawler/wiki/Powered-By">and many more...</a>
</div>
</div>