Skip to content

Sharing the RequestManager instance breaks maxRequestsPerCrawl limits #3330

@barjin

Description

@barjin

The BasicCrawler.loadHandledRequestCount implementation considers only request sources exclusive to the current crawler instance.

this.handledRequestsCount = await this.requestManager.handledCount();

If the request source has been used before (has handledCount > 0), the BasicCrawler.maxRequestsPerCrawl limits won't work correctly, see below:

import { RequestQueueV2, CheerioCrawler } from 'crawlee';

const queue = await RequestQueueV2.open();
    queue.addRequestsBatched(
        Array.from({ length: 100 }, (_, i) => (`https://example.com/page/${i}`)),
    );

for (const crawlerId of [1, 2]) {
    const crawler = new CheerioCrawler({
        requestQueue: queue,
        requestHandler: async ({ request }) => {
            console.log(`[${crawlerId}] Crawling ${request.url}...`);
        },
        maxRequestsPerCrawl: 10,
        maxConcurrency: 1,
    });

    await crawler.run();
}
Image

cc @Pijukatel

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions