Implementations of the URL Frontier Service. There are currently 2 implementations available:
- a simple memory-based which was used primarily for testing
- the default one which is scalable, persistent and is based on RocksDB
Web crawlers can connect to it using the gRPC code generated from the API. There is also a simple client available which can do basic interactions with a Frontier.
To build and run the service from source, compile with mvn clean package
java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer
You can specify the implementation to use for the service and its configuration by passing a configuration file with '-c'.
The configuration file below will set RocksDBService as the implementation to use and configure the path where its data should be stored.
implementation = crawlercommons.urlfrontier.service.rocksdb.RocksDBService
rocksdb.path = /pathToCrawlDir/rocksdb
The key values from the configuration file can also be passed on the command line. Since the RocksDBService is the default implementation, the call above can have the following equivalent without the config file:
java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer rocksdb.path=/pathToCrawlDir/rocksdb
If no path is set explicitly for RocksDB, the default value ./rocksdb will be used.
For implementation supporting a cluster mode, it is required to use the parameter -h xxx.xxx.xxx.xxx
with the private IP or hostname
on which it is running so that it can report its location with the heartbeat.
The logging is done with Logback. A default configuration is loaded and will dump logs on the console at INFO level and above but the configuration file can be overriden with
java -Xmx2G -Dlogback.configurationFile=test.xml ...
Alternatively, the Frontier service has a SetLogLevel endpoint and the CLI allows to set the level for a given package from the console.
The service implementation takes a parameter -s, the value of which is used as port number to expose metrics for Prometheus. A dashboard for Grafana is provided.
The easiest way to run the Frontier is to use Docker
docker pull crawlercommons/url-frontier
docker run --rm --name frontier -p 7071:7071 -p 9100:9100 crawlercommons/url-frontier -s 9100
The service will run on the default port (7071). Additional parameters can simply be added to the command, for instance, to persist RocksDB between runs
docker run --rm --name frontier -v /pathOnDisk:/crawldir -p 7071:7071 crawlercommons/url-frontier rocksdb.path=/crawldir/rocksdb
Specify a config file with a volume and the -c
flag:
docker run --rm --name frontier -p 7071:7071 -p 9100:9100 -v /path/to/config.ini:/config/config.ini ufrontier -s 9100 -c /config/config.ini