Skip to content

chore: add discoverValidSitemaps utility#3392

Merged
barjin merged 5 commits intov4from
rebase/discover-sitemaps
Feb 6, 2026
Merged

chore: add discoverValidSitemaps utility#3392
barjin merged 5 commits intov4from
rebase/discover-sitemaps

Conversation

@barjin
Copy link
Member

@barjin barjin commented Feb 6, 2026

Rebases #3339 and #3370 on top of v4 and adds HttpClient support for discoverValidSitemaps.

Related to the discussion under apify/actor-scraper#214

foxt451 and others added 4 commits February 6, 2026 13:33
Related to apify/apify-sdk-js#486. I'm
[developing generic sitemap
scraper](apify/actor-scraper#205) and it's going
to share a big utility function (main chunk of logic) with wcc -
`discoverValidSitemaps`. I've asked @barjin if I could factor it out and
he told this util could fit into crawlee. It's mainly copied from wcc,
but to keep the dependencies unchanged, it's using got-scraping to check
for url existence instead of impit (I think it doesn't matter for
sitemaps), and `urlExists` is inlined (until we don't add http client to
these utils in v4 as @barjin told me). It's also turned into an async
generator. Let me know if you see a better place for this util.
@barjin barjin requested a review from Copilot February 6, 2026 12:51
@barjin barjin self-assigned this Feb 6, 2026
@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 6, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new discoverValidSitemaps utility function that automatically discovers sitemap URLs for given domains by checking robots.txt files, well-known sitemap paths (sitemap.xml, sitemap.txt, sitemap_index.xml), and recognizing sitemap URLs in the input. The implementation also introduces a mergeAsyncIterables helper function to enable concurrent discovery across multiple domains while preserving order of discovery.

Changes:

  • Added discoverValidSitemaps async generator function that discovers sitemaps from multiple domains concurrently
  • Added mergeAsyncIterables helper function to merge multiple async iterables with concurrent execution
  • Added comprehensive test coverage for sitemap discovery scenarios including robots.txt parsing, well-known paths, and multi-domain discovery

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
packages/utils/src/internals/sitemap.ts Implements the core discoverValidSitemaps function with support for HttpClient and concurrent domain processing
packages/utils/src/internals/iterables.ts Adds mergeAsyncIterables utility for concurrent async iteration, sourced from StackOverflow with proper attribution
packages/utils/test/sitemap.test.ts Comprehensive test suite covering sitemap discovery from robots.txt, well-known paths, input URLs, and multi-domain scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@barjin barjin requested review from B4nan and janbuchar February 6, 2026 13:04
@barjin
Copy link
Member Author

barjin commented Feb 6, 2026

I have rebased these to v4 to unblock the Sitemap Scraper guys (see apify/actor-scraper#214)

What's the best approach to merging these commits to v4? squash merge / rebase merge? Will either help in any way when rebasing the v4 commits on top of master later on?

wdyt @janbuchar @B4nan ?

@B4nan
Copy link
Member

B4nan commented Feb 6, 2026

Should be fine, but let's not use feat/fix so it won't end up twice in the changelog.

@barjin barjin changed the title feat(utils): add discoverValidSitemaps utility chore(rebase): add discoverValidSitemaps utility Feb 6, 2026
@barjin barjin changed the title chore(rebase): add discoverValidSitemaps utility chore: add discoverValidSitemaps utility Feb 6, 2026
@barjin barjin merged commit 5f890ae into v4 Feb 6, 2026
8 checks passed
@barjin barjin deleted the rebase/discover-sitemaps branch February 6, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants