WARNING: The “main” branch of this project is based on a development version of StormCrawler. Stable branches are available for:
- storm-crawler-1.18 running on Storm 1.2.3
For more information about processing (and creating) WARC archives using StormCrawler, see
- related StormCrawler issues: #755
- StormCrawler's WARC module README
- StormCrawler's WARCSpout
-
install Apache Storm 2.3.0 - see Storm setup or use Docker (instructions below)
-
clone and compile StormCrawler
git clone https://github.com/DigitalPebble/storm-crawler.git cd storm-crawler mvn clean install cd ..
Maven will deploy the StormCrawler jars into your local Maven repository.
Note: this step is obsolete if a released StormCrawler version is used (see also stable storm-crawler-x.x branches).
-
build this project:
mvn clean package
All topologies expect that WARC files to be processed are listed in text files line by line using
- either a local file system path (ideally absolute, relative paths may not work in distributed mode)
- or a http:// or https:// URL
Text files are expected in the folder /data/input/
. The input folder is defined in the Flux files. Please change this location at your needs.
TODO
To submit a Flux to do the same:
storm local target/warc-crawler-2.2-SNAPSHOT.jar org.apache.storm.flux.Flux topology/warc-crawler-stdout/warc-crawler-stdout.flux
This will run the topology in local mode.
The command storm jar ...
is used to run the topology in distributed mode:
storm jar target/warc-crawler-2.2-SNAPSHOT.jar org.apache.storm.flux.Flux topology/warc-crawler-stdout/warc-crawler-stdout.flux
It is best to run the topology in distributed mode to benefit from the Storm UI and logging. In that case, the topology runs continuously, as intended.
Note that in local mode, Flux uses a default TTL for the topology of 60 secs. The command above runs the topology for 24 hours (24*60*60*1000
milliseconds). In distributed mode, the topology is run forever (until it is killed).
A Java topology class using the storm command:
storm local target/warc-crawler-2.2-SNAPSHOT.jar --local-ttl 600 -- org.commoncrawl.stormcrawler.CrawlTopology -conf topology/warc-crawler-stdout/warc-crawler-stdout-conf.yaml
This will launch the crawl topology in local mode for 10 minutes (600 seconds). Use storm jar ...
to run the topology in distributed mode. Note: --
is required to signal that remaining options (here -conf
) are not consumed by storm
and passed to the CrawlTopology as arguments.
Several Flux topologies are provided to test and evaluate crawling of WARC archives. Each Flux file is accompanied by a configuration file which fits the requirements of the topology run on a single host. You need to modify Flux file and configuration if you want to scale up and run the topology on a distributed Storm cluster.
warc-crawler-dev-null runs a single WARCSpout which sends the page captures to DevNullBolt which (you guess it) only ack's and discards each tuple. Useful to measure the performance of the WARCSpout.
warc-crawler-stdout reads WARC files, parse the content payload, maps content and metadata fields to index fields and writes fields (shortened) to the log output:
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] pagetype article
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] pageimage 169 chars
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Coronavirus
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords NHS
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Politics
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Boris Johnson
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Matt Hancock
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Rishi Sunak
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] keywords Sunderland
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] capturetime 1601983220000
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] description 114 chars
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] title Coronavirus LIVE updates: Boris Johnson Tory conference ...
2020-10-08 14:30:30.113 STDIO Thread-4-index-executor[2 2] [INFO] publicationdate 2020-10-06T11:05:33Z
This topology can be used to test parsers and extractors without the need to setup any indexer backend. The Java topology class (CrawlTopology) runs an equivalent topology.
warc-crawler-warc-rewrite reads WARC files and sends the content to a WARC writer bolt which stores it again in WARC files. Could be extended by additional bolts to filter and/or enrich the WARC records.
warc-crawler-index-elasticsearch reads WARC files, parses HTML pages, extracts text and metadata and sends documents into Elasticsearch for indexing.
This topology requires that Elasticsearch is running:
- install Elasticsearch (and Kibana) 7.5.0 - also higher versions of 7.x might work
- start Elasticsearch
- initialize Eleasticsearch indices by running ES_IndexInit.sh
- adapt the es-conf.yaml file, so that Elasticsearch is reachable from the Storm workers – the host name
elasticsearch
is used in the Docker setup, change the host name tolocalhost
when running in local mode with a local Elasticsearch installation.
See also the documentation of StormCrawler's Elasticsearch module.
warc-crawler-index-solr reads WARC files, parses HTML pages, extracts text and metadata and sends documents into Solr for indexing.
As a requirement Solr must be installed and running:
- install Solr 8.10.1
- start Solr
- initialize the cores using StormCrawler's Solr core config
bin/solr create -c status -d storm-crawler/external/solr/cores/status/ bin/solr create -c metrics -d storm-crawler/external/solr/cores/metrics/ bin/solr create -c docs -d storm-crawler/external/solr/cores/docs/
- adapt solr-conf.yaml file, so that Solr is reachable from the Storm workers – the host name
solr
is used in the Docker setup, change the host name tolocalhost
when running in local mode with a local Solr installation.
See also the documentation of StormCrawler's Solr module.
A configuration to run the topologies via docker-compose is provided. The file docker-compose.yaml puts every component (Storm Nimbus, Supervisor and UI, but also Elasticsearch and Solr) into its own container. The topology is launched from a separate container which is linked to container of Storm Nimbus.
WARC input is per default read from the folder warcdata
in the current directory. Another location can be defined by setting the environment variable WARCINPUT
:
WARCINPUT=/my/warc/data/path/
export WARCINPUT
First we launch all components:
docker-compose -f docker-compose.yaml up --build --renew-anon-volumes --remove-orphans
Now we can launch the container storm-crawler
docker-compose run --rm storm-crawler
and in the running container our topology:
$warc-crawler/> storm jar warc-crawler.jar org.apache.storm.flux.Flux topology/warc-crawler-dev-null/warc-crawler-dev-null.flux
Let's check whether topology is running:
$warc-crawler/> storm list
Topology_name Status Num_tasks Num_workers Uptime_secs
-------------------------------------------------------------------
warc-crawler-dev-null ACTIVE 6 1 240
Also the Storm UI on localhost is available and will provide metrics about the running topology.
To inspect the worker log files we need to attach to the container running Storm Supervisor
docker exec -it storm-supervisor /bin/bash
then find the log file and read it:
$> ls /logs/workers-artifacts/*/*/worker.log
/logs/workers-artifacts/warc-crawler-dev-null-1-1603368933/6700/worker.log
$> more /logs/workers-artifacts/warc-crawler-dev-null-1-1603368933/6700/worker.log
If done we kill the topology
$warc-crawler/> storm kill warc-crawler-dev-null -w 10
1636 [main] INFO o.a.s.c.kill-topology - Killed topology: warc-crawler-dev-null
leave the container (exit
) and shut down all running containers:
docker-compose down
Of course, the topology could be also launched in a single command:
docker-compose run --rm storm-crawler storm jar warc-crawler.jar org.apache.storm.flux.Flux topology/warc-crawler-dev-null/warc-crawler-dev-null.flux
First, the Elasticsearch indices need to be initialized by running ES_IndexInit.sh.
Then the Elasticsearch topology can be launched via
docker-compose run --rm storm-crawler storm jar warc-crawler.jar org.apache.storm.flux.Flux \
topology/warc-crawler-index-elasticsearch/warc-crawler-index-elasticsearch.flux
To create the Solr cores, the "solr" container needs access to StormCrawler's Solr core config:
- because Solr will write into the core folders, it's recommended to create a copy first and assign the necessary file permissions:
cp .../storm-crawler/external/solr/cores /tmp/storm-crawler-solr-conf chmod a+rwx -R /tmp/storm-crawler-solr-conf/
- point the environment variable
STORM_CRAWLER_SOLR_CONF
to this folder:STORM_CRAWLER_SOLR_CONF=/tmp/storm-crawler-solr-conf export STORM_CRAWLER_SOLR_CONF
- after all docker-compose services are running, create the Solr cores by
docker exec -it solr /opt/solr/bin/solr create -c status -d /storm-crawler-solr-conf/status/ docker exec -it solr /opt/solr/bin/solr create -c metrics -d /storm-crawler-solr-conf/metrics/ docker exec -it solr /opt/solr/bin/solr create -c docs -d /storm-crawler-solr-conf/docs/
- finally launch the Solr topology
docker-compose run --rm storm-crawler storm jar warc-crawler.jar org.apache.storm.flux.Flux \ topology/warc-crawler-index-solr/warc-crawler-index-solr.flux