Clarify content types supported by SitemapRequestLoader #1515

loic-bellinger · 2025-10-25T13:28:21Z

loic-bellinger
Oct 25, 2025

Quick question: does SitemapRequestLoader officially support only XML and plaintext sitemaps? From the code it selects text/plain or .txt for a line-based parser and defaults to XML otherwise. It seems text/html “sitemaps” (HTML pages listing links) aren’t parsed. Is that intended (I guess yes since the crawler would anyway probably properly discover and parse them), and could this be clarified in the docs ?

Current use case for context: i have a raw domains list that I cannot trust, so I have to test valid start seeds and also discovers the potential sitemaps (they may be missing from robots.txt and contain links that cannot be reached by classic crawling).

I then go that way

sitemap_loader = SitemapRequestLoader(
    http_client=http_client,
    sitemap_urls=sitemap_urls,
)
request_manager = await sitemap_loader.to_tandem()
await request_manager.add_requests(seeds_urls)

Answered by janbuchar

Oct 27, 2025

Hello @loic-bellinger. The SitemapRequestLoader is intended to handle sitemaps that follow the format described in https://www.sitemaps.org/protocol.html. Sitemaps in the "huge HTML full of links" fall in the scope of regular crawlers and enqueue_links, as you correctly assumed.

I believe this is something we should make clear in the docs - thank you for the input 🙂

View full answer

janbuchar · 2025-10-27T10:01:47Z

janbuchar
Oct 27, 2025
Maintainer

Hello @loic-bellinger. The SitemapRequestLoader is intended to handle sitemaps that follow the format described in https://www.sitemaps.org/protocol.html. Sitemaps in the "huge HTML full of links" fall in the scope of regular crawlers and enqueue_links, as you correctly assumed.

I believe this is something we should make clear in the docs - thank you for the input 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify content types supported by SitemapRequestLoader #1515

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarify content types supported by SitemapRequestLoader #1515

Uh oh!

Uh oh!

loic-bellinger Oct 25, 2025

Replies: 1 comment

Uh oh!

janbuchar Oct 27, 2025 Maintainer

loic-bellinger
Oct 25, 2025

janbuchar
Oct 27, 2025
Maintainer