Best practices for multi-domain crawling and per-host configuration #1516

loic-bellinger · 2025-10-26T07:35:46Z

loic-bellinger
Oct 26, 2025

Hi Crawlee Team,

First off, thank you for this fantastic library.

As I've been integrating Crawlee into a project that requires crawling many different hosts simultaneously, I've noticed that the current documentation and examples primarily focus on single-domain scenarios. While these are great for getting started, they don't address some of the common challenges that arise in large-scale, multi-domain crawling.
I would be incredibly valuable to have a guide or an example that details the recommended strategies for this kind of setup. Specifically, I'm trying to figure out the best approach for the following:

Overall Architecture: Is it better to spawn a separate crawler instance for each host, or to use a single, powerful crawler instance to handle all hosts? My intuition leans towards a single crawler to efficiently manage resources (like a browser pool), but I'm unsure how to manage per-host configurations in that context.
Per-Host Concurrency and Rate Limiting: How can we effectively manage concurrency when different hosts have vastly different rate limits? For example, Host A might handle 1,000 requests per minute, while Host B only allows 30. What is the recommended way to enforce maxConcurrency or maxRequestsPerMinute on a per-hostname basis within a single crawler?
Session Management: To manage things like cookies, headers, and proxies for different sites, the most logical approach seems to be using the SessionPool with a session ID tied to the hostname (e.g., session_id=host). Is this the recommended best practice?
- Resource Allocation: In a browser-based crawling scenario, how do we ensure that a slow, heavily-loaded website doesn't monopolize the entire browser pool, starving out other, faster websites?

Having a "Best Practices" guide for multi-domain crawling would be a fantastic addition to the documentation for anyone moving beyond single-site scraping. In the meantime, any guidance you could provide on the questions above would be greatly appreciated.

Mantisus · 2025-10-27T11:57:22Z

Mantisus
Oct 27, 2025
Collaborator

Thank you for using Crawlee and for your kind words about the library!

Optimal settings are quite specific to each individual project, as they depend on target websites, crawling volumes, infrastructure used, and proxy quality. Therefore, it's most optimal to configure the crawler for each specific domain.

However, to address your questions in more detail:

Is it better to spawn a separate crawler instance for each host, or to use a single, powerful crawler instance to handle all hosts?

I would divide target sites into groups. Simple HTML sites can be handled by a single crawler through different handlers. Heavy sites with extensive JS and slow performance should use one crawler per domain, especially if this requires something like Camoufox.

How can we effectively manage concurrency when different hosts have vastly different rate limits?

This topic has been coming up more frequently lately. I'd say that #1437 and #1396 are related to per-host concurrency capabilities. However, we don't have native support for this yet. Our current recommendation is to use different crawlers for domains if they require different rate limiting.

To manage things like cookies, headers, and proxies for different sites, the most logical approach seems to be using the SessionPool with a session ID tied to the hostname

The current cookie implementation in Session supports multi-domain usage, so in multi-domain scraping the main challenge for the pool is when one site works well with current proxies while another blocks them. If a site requires fine-grained control of cookie initialization and usage, I would recommend using a separate crawler for that domain, as it will be easier to maintain and debug.

In a browser-based crawling scenario, how do we ensure that a slow, heavily-loaded website doesn't monopolize the entire browser pool?

You can't. Isolate the heavy website into a separate crawler.

Thus, my main recommendation is to divide target sites into groups based on configuration requirements:

Simple HTML sites with high availability can be handled by a single crawler
Sites with low request-per-minute thresholds by multiple crawlers running in the same process
Heavy JS sites with one crawler per site

This approach provides the best balance between resource efficiency and maintainability for multi-domain crawling scenarios.

3 replies

loic-bellinger Oct 30, 2025
Author

Hi, thank you for the detailed breakdown. Your recommendation to split different crawling logics into separate crawler instances makes perfect sense.

For heavy js site, the idea would be to create each PlaywrightCrawler instance with its specific configuration, and then run them all concurrently using asyncio.gather. Could you confirm if the pattern below is the recommended best practice for this scenario?

import asyncio
from crawlee.playwright import PlaywrightCrawler

# Assume I have a factory function that returns a configured 
# (crawler, start_urls) tuple based on a config dict.
# def create_host_crawler(config): ...

async def main():
    # 1. Define configurations for different crawlers
    host_configs = [
        {'name': 'fast-site', 'max_concurrency': 10, 'start_urls': ['...']},
        {'name': 'slow-site', 'max_concurrency': 2,  'start_urls': ['...']},
    ]

    # 2. Create all the crawler instances
    crawlers_to_run = [create_host_crawler(config) for config in host_configs]

    # 3. Create an asyncio task for each crawler's run() method
    tasks = [
        asyncio.create_task(crawler.run(start_urls))
        for crawler, start_urls in crawlers_to_run
    ]

    # 4. Run all crawlers concurrently
    await asyncio.gather(*tasks)

Is this the correct and most performant way to handle this? Are there any potential pitfalls with this shared-resource model I should be aware of?

Thanks for your guidance

Mantisus Oct 30, 2025
Collaborator

To be honest, my intuition tells me to avoid creating multiple instances of BrowserPool in a single process.

My recommendation for a 'heavy js site' is one crawler per process.

loic-bellinger Nov 1, 2025
Author

So using multiprocessing ? Any example of this somewhere ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for multi-domain crawling and per-host configuration #1516

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best practices for multi-domain crawling and per-host configuration #1516

Uh oh!

loic-bellinger Oct 26, 2025

Replies: 1 comment · 3 replies

Uh oh!

Mantisus Oct 27, 2025 Collaborator

Uh oh!

loic-bellinger Oct 30, 2025 Author

Uh oh!

Mantisus Oct 30, 2025 Collaborator

Uh oh!

loic-bellinger Nov 1, 2025 Author

loic-bellinger
Oct 26, 2025

Replies: 1 comment 3 replies

Mantisus
Oct 27, 2025
Collaborator

loic-bellinger Oct 30, 2025
Author

Mantisus Oct 30, 2025
Collaborator

loic-bellinger Nov 1, 2025
Author