Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.
- Supports all sitemap formats:
- Field-tested with ~1 million URLs as part of the Media Cloud project
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in
robots.txt - Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
pip install ultimate-sitemap-parseror using Anaconda:
conda install -c conda-forge ultimate-sitemap-parserfrom usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.example.org/')
for page in tree.all_pages():
print(page.url)sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap
hierarchy found on the website; see a reference of AbstractSitemap subclasses. AbstractSitemap.all_pages() returns a generator to efficiently iterate over pages without loading the entire tree into memory.
For more examples and details, see the documentation.