Skip to content

Running crawlers in separate threads

Péter Bencze edited this page May 31, 2019 · 1 revision

Speed up crawling

Imagine the following situation:

  • You need to scrape product information from a large online store
  • The products are listed in different categories
  • The crawling takes a very long time to finish

To speed up the process, you could create a crawler for all the available product categories and run them in separate threads.

It is important to note that the Selenium WebDriver is not thread-safe! Do not try to use the same instance from different threads.

Implementation example

// Create the configurations
CrawlerConfiguration electronicsCrawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/products/electronics"))
        .build();

CrawlerConfiguration foodCrawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/products/food"))
        .build();

// Create the crawlers using the configurations above
ElectronicsCrawler electronicsCrawler = new ElectronicsCrawler(electronicsCrawlerConfig);
FoodCrawler foodCrawler = new FoodCrawler(foodCrawlerConfig);

// Start the crawlers in separate threads
ExecutorService executorService = Executors.newCachedThreadPool();
executorService.execute(electronicsCrawler::start);
executorService.execute(foodCrawler::start);

// Wait for termination and then shut down the executor service ...