Running crawlers in separate threads

Speed up crawling

Imagine the following situation:

You need to scrape product information from a large online store
The products are listed in different categories
The crawling takes a very long time to finish

To speed up the process, you could create a crawler for all the available product categories and run them in separate threads.

It is important to note that the Selenium WebDriver is not thread-safe! Do not try to use the same instance from different threads.

Implementation example

// Create the configurations
CrawlerConfiguration electronicsCrawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/products/electronics"))
        .build();

CrawlerConfiguration foodCrawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/products/food"))
        .build();

// Create the crawlers using the configurations above
ElectronicsCrawler electronicsCrawler = new ElectronicsCrawler(electronicsCrawlerConfig);
FoodCrawler foodCrawler = new FoodCrawler(foodCrawlerConfig);

// Start the crawlers in separate threads
ExecutorService executorService = Executors.newCachedThreadPool();
executorService.execute(electronicsCrawler::start);
executorService.execute(foodCrawler::start);

// Wait for termination and then shut down the executor service ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running crawlers in separate threads

Speed up crawling

Implementation example

Clone this wiki locally