ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process 1.
You can download ache from Binstar 2 with Conda 3 by running:
conda install -c memex acheTo build ache from source, you can run the following commands in your terminal:
git clone https://github.com/chdoig/ache.git
cd ache
./gradlew clean installAppwhich will generate an installation package under /build/install/.
Alternatively, you can build a zip archive:
git clone https://github.com/chdoig/ache.git
cd ache
./gradlew clean distZipwhich will generate a zip file of your project under /build/distributions/.
Learn more about Gradle: http://www.gradle.org/documentation.
To run the ache crawler, you'll first need to build a model.
$ ache buildModel <target storage config path> <training data path> <output path><target storage config path> is the path to the configuration of the target storage.
<training_data> is the path to the directory containing positive and negative examples.
<output path> is the new directory where you want to save the generated model files: pageclassifier.model and
pageclassifier.features.
For example:
ache buildModel conf/sample_crawl/target_storage.cfg training_data models/sample_model/To start a crawl, run:
ache startCrawl <data output path> <config path> <seed path> <model path> <lang detect profile path><data output path> is the path to the directory you want to store your output.
<config path> is the path to the config directory.
<seed path> is the path to the seed list file.
<model path> is the path to the model directory (containing pageclassifier.model and pageclassifier.features).
<lang detect profile path> is the path to the language detection profile. Note: We are currently refactoring the code.
You'll be able to find it under resources in the near future. You can currently download here:
https://code.google.com/p/language-detection/wiki/Downloads.
For example,
$ ache startCrawl sample_crawl conf/sample_config seeds/sample_crawl.seeds models/sample_model/ libs/profiles/To use ache, you'll need the following:
- JDK 1.6+