-
Notifications
You must be signed in to change notification settings - Fork 15
Running crawlers in separate threads
Péter Bencze edited this page May 31, 2019
·
1 revision
Imagine the following situation:
- You need to scrape product information from a large online store
- The products are listed in different categories
- The crawling takes a very long time to finish
To speed up the process, you could create a crawler for all the available product categories and run them in separate threads.
It is important to note that the Selenium WebDriver is not thread-safe! Do not try to use the same instance from different threads.
// Create the configurations
CrawlerConfiguration electronicsCrawlerConfig = new CrawlerConfigurationBuilder()
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(createDefault("http://example.com/products/electronics"))
.build();
CrawlerConfiguration foodCrawlerConfig = new CrawlerConfigurationBuilder()
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(createDefault("http://example.com/products/food"))
.build();
// Create the crawlers using the configurations above
ElectronicsCrawler electronicsCrawler = new ElectronicsCrawler(electronicsCrawlerConfig);
FoodCrawler foodCrawler = new FoodCrawler(foodCrawlerConfig);
// Start the crawlers in separate threads
ExecutorService executorService = Executors.newCachedThreadPool();
executorService.execute(electronicsCrawler::start);
executorService.execute(foodCrawler::start);
// Wait for termination and then shut down the executor service ...