-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
The BasicCrawler.loadHandledRequestCount implementation considers only request sources exclusive to the current crawler instance.
| this.handledRequestsCount = await this.requestManager.handledCount(); |
If the request source has been used before (has handledCount > 0), the BasicCrawler.maxRequestsPerCrawl limits won't work correctly, see below:
import { RequestQueueV2, CheerioCrawler } from 'crawlee';
const queue = await RequestQueueV2.open();
queue.addRequestsBatched(
Array.from({ length: 100 }, (_, i) => (`https://example.com/page/${i}`)),
);
for (const crawlerId of [1, 2]) {
const crawler = new CheerioCrawler({
requestQueue: queue,
requestHandler: async ({ request }) => {
console.log(`[${crawlerId}] Crawling ${request.url}...`);
},
maxRequestsPerCrawl: 10,
maxConcurrency: 1,
});
await crawler.run();
}
cc @Pijukatel
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.