-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Multiple problems with Sitemap Scraper have been discovered during testing:
1. Stuck runs (reason unknown)
a. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/5wHZZusa6D3hgQbYV#log (https://warehouse-theme-metal.myshopify.com/ - sitemap is there, but maybe format of urls is not being processed)
b. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/hc53VPklsgZN2Qk3B#log (https://seomator.com/ - sitemap looks fine)
c. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/XMD4gDBACklbC2rvV#log (https://sitegpt.ai/ - sitemap look ok. it's a sitemap index, but again pretty valid)
d. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/v2iLnieKCh3BqDKsY#log (https://landing-page.io/ - sitemap look ok. it's a sitemap index, but again pretty valid)
e. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/KUDhmzm5yGlAx1oI6#log (https://www.screamingfrog.co.uk/ - sitemap looks ok)
f. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/mdOWlrNXTY4pB4sT4#log (https://www.hubspot.com/ - sitemap looks ok)
2. No valid sitemaps discovered
a. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/u7cPO18EoHAmMKbfg#log (https://www.alza.cz/ - there are a bunch of sitemaps in robots.txt, which weren't discovered)
a. 1. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/3xuZgcqm6xwaxWOpb#log (when I give it a direct URL to a sitemap, it also doesn't process it)
b. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/wb8pznVQB0KZC0ivI#log (https://www.datart.cz/ - there is a sitemap in robots.txt, which wasn't discovered)
b.1. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/ATFC1iAhTdomwsb3C#input (when I give it a direct URL to a sitemap, it also doesn't process it)
c. https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/RPwTeiK0h9UFBiH7d#log (https://backlinko.com/ - sitemap is mentioned in robots.txt, but is not picked up)
3. Large sitemap throws bunch of errors
https://console.apify.com/organization/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/runs/L3xe0rTA2iYS7NcYN#log (https://www.nytimes.com/ - it picks up sitemaps ok, start processing, but after some time starts to throw many warnings/errors related to performance limits. also I'm not sure if it processes .xml.gz sitemaps correctly. this run was too large to evaluate it)