-
-
Notifications
You must be signed in to change notification settings - Fork 125
Description
Seeed: https://www.1318m.at
Expected result: crawl every page one time and done (should be small, around 10 MB)
Result: Crawler is always clicking on the same links which got a defect in the JS and the urls get concated ad infinity:
Example Datenschutz:
First Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/
Second Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/
Third Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/
4th:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/
and so on.
So If I going into this site with high limits (because I dont know the Site and the crawl is for an automtically generated Domain Crawl, which should go deep inside) I will have to wait quite a long time (as it is only text) until one limit is going to be hit and I lose a lot of crawl time for valid crawls.
Proposal:
Do not follow links in case after a TopLevelDomain another Toplevel domain is in the URL (here: .at .com )
and/or skip Links with multiple "www." inside
Docker Start:
docker run -d --name test_1318m -v /home/antares/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.9.2 crawl --scopeType domain --depth 100 --headless --delay 0 --behaviorTimeout 30 --pageLoadTimeout 30 --waitUntil networkidle0 --saveState always --limit 10000 --logging stats,info --sitemap --url https://www.1318m.at
And another question to this topic is how could the crawler live can be moved out of the trap? In the K8s deployment I could enter block URLs commands directly, even in Heretrix I could manipulate the Frontier to get the Crawler out of the trap, or inject a command, but how to proceed in such a case with browsertrix?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status