Conversation
Related to apify/apify-sdk-js#486. I'm [developing generic sitemap scraper](apify/actor-scraper#205) and it's going to share a big utility function (main chunk of logic) with wcc - `discoverValidSitemaps`. I've asked @barjin if I could factor it out and he told this util could fit into crawlee. It's mainly copied from wcc, but to keep the dependencies unchanged, it's using got-scraping to check for url existence instead of impit (I think it doesn't matter for sitemaps), and `urlExists` is inlined (until we don't add http client to these utils in v4 as @barjin told me). It's also turned into an async generator. Let me know if you see a better place for this util.
…aps (#3370) Based on @nicklamonov 's remarks
There was a problem hiding this comment.
Pull request overview
This PR adds a new discoverValidSitemaps utility function that automatically discovers sitemap URLs for given domains by checking robots.txt files, well-known sitemap paths (sitemap.xml, sitemap.txt, sitemap_index.xml), and recognizing sitemap URLs in the input. The implementation also introduces a mergeAsyncIterables helper function to enable concurrent discovery across multiple domains while preserving order of discovery.
Changes:
- Added
discoverValidSitemapsasync generator function that discovers sitemaps from multiple domains concurrently - Added
mergeAsyncIterableshelper function to merge multiple async iterables with concurrent execution - Added comprehensive test coverage for sitemap discovery scenarios including robots.txt parsing, well-known paths, and multi-domain discovery
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| packages/utils/src/internals/sitemap.ts | Implements the core discoverValidSitemaps function with support for HttpClient and concurrent domain processing |
| packages/utils/src/internals/iterables.ts | Adds mergeAsyncIterables utility for concurrent async iteration, sourced from StackOverflow with proper attribution |
| packages/utils/test/sitemap.test.ts | Comprehensive test suite covering sitemap discovery from robots.txt, well-known paths, input URLs, and multi-domain scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I have rebased these to What's the best approach to merging these commits to wdyt @janbuchar @B4nan ? |
|
Should be fine, but let's not use |
discoverValidSitemaps utilitydiscoverValidSitemaps utility
discoverValidSitemaps utilitydiscoverValidSitemaps utility
Rebases #3339 and #3370 on top of
v4and addsHttpClientsupport for discoverValidSitemaps.Related to the discussion under apify/actor-scraper#214