Best practices for multi-domain crawling and per-host configuration #1516
Replies: 1 comment 3 replies
-
|
Thank you for using Crawlee and for your kind words about the library! Optimal settings are quite specific to each individual project, as they depend on target websites, crawling volumes, infrastructure used, and proxy quality. Therefore, it's most optimal to configure the crawler for each specific domain. However, to address your questions in more detail:
I would divide target sites into groups. Simple HTML sites can be handled by a single crawler through different handlers. Heavy sites with extensive JS and slow performance should use one crawler per domain, especially if this requires something like
This topic has been coming up more frequently lately. I'd say that #1437 and #1396 are related to per-host concurrency capabilities. However, we don't have native support for this yet. Our current recommendation is to use different crawlers for domains if they require different rate limiting.
The current cookie implementation in
You can't. Isolate the heavy website into a separate crawler. Thus, my main recommendation is to divide target sites into groups based on configuration requirements:
This approach provides the best balance between resource efficiency and maintainability for multi-domain crawling scenarios. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Crawlee Team,
First off, thank you for this fantastic library.
As I've been integrating Crawlee into a project that requires crawling many different hosts simultaneously, I've noticed that the current documentation and examples primarily focus on single-domain scenarios. While these are great for getting started, they don't address some of the common challenges that arise in large-scale, multi-domain crawling.
I would be incredibly valuable to have a guide or an example that details the recommended strategies for this kind of setup. Specifically, I'm trying to figure out the best approach for the following:
- Resource Allocation: In a browser-based crawling scenario, how do we ensure that a slow, heavily-loaded website doesn't monopolize the entire browser pool, starving out other, faster websites?
Having a "Best Practices" guide for multi-domain crawling would be a fantastic addition to the documentation for anyone moving beyond single-site scraping. In the meantime, any guidance you could provide on the questions above would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions