Skip to content

Crawler Trap due invalid Links #931

@gitreich

Description

@gitreich

Seeed: https://www.1318m.at

Expected result: crawl every page one time and done (should be small, around 10 MB)

Result: Crawler is always clicking on the same links which got a defect in the JS and the urls get concated ad infinity:

Example Datenschutz:
First Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/
Second Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/
Third Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/
4th:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/

and so on.

So If I going into this site with high limits (because I dont know the Site and the crawl is for an automtically generated Domain Crawl, which should go deep inside) I will have to wait quite a long time (as it is only text) until one limit is going to be hit and I lose a lot of crawl time for valid crawls.

Proposal:
Do not follow links in case after a TopLevelDomain another Toplevel domain is in the URL (here: .at .com )
and/or skip Links with multiple "www." inside

Docker Start:

docker run -d --name test_1318m -v /home/antares/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.9.2 crawl --scopeType domain --depth 100 --headless --delay 0 --behaviorTimeout 30 --pageLoadTimeout 30 --waitUntil networkidle0 --saveState always --limit 10000 --logging stats,info --sitemap --url https://www.1318m.at

And another question to this topic is how could the crawler live can be moved out of the trap? In the K8s deployment I could enter block URLs commands directly, even in Heretrix I could manipulate the Frontier to get the Crawler out of the trap, or inject a command, but how to proceed in such a case with browsertrix?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions