Clarify content types supported by SitemapRequestLoader #1515
-
|
Quick question: does SitemapRequestLoader officially support only XML and plaintext sitemaps? From the code it selects text/plain or .txt for a line-based parser and defaults to XML otherwise. It seems text/html “sitemaps” (HTML pages listing links) aren’t parsed. Is that intended (I guess yes since the crawler would anyway probably properly discover and parse them), and could this be clarified in the docs ? Current use case for context: i have a raw domains list that I cannot trust, so I have to test valid start seeds and also discovers the potential sitemaps (they may be missing from robots.txt and contain links that cannot be reached by classic crawling). I then go that way sitemap_loader = SitemapRequestLoader(
http_client=http_client,
sitemap_urls=sitemap_urls,
)
request_manager = await sitemap_loader.to_tandem()
await request_manager.add_requests(seeds_urls) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hello @loic-bellinger. The I believe this is something we should make clear in the docs - thank you for the input 🙂 |
Beta Was this translation helpful? Give feedback.
Hello @loic-bellinger. The
SitemapRequestLoaderis intended to handle sitemaps that follow the format described in https://www.sitemaps.org/protocol.html. Sitemaps in the "huge HTML full of links" fall in the scope of regular crawlers andenqueue_links, as you correctly assumed.I believe this is something we should make clear in the docs - thank you for the input 🙂