Crawler Trap due invalid Links

Seeed: https://www.1318m.at

Expected result: crawl every page one time and done (should be small, around 10 MB)

Result: Crawler is always clicking on the same links which got a defect in the JS and the urls get concated ad infinity:

Example Datenschutz:
First Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/
Second Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/
Third Iteration URL:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/
4th:
https://www.1318m.at/de/1318m-datenschutz-qr52GAVQ/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/www.google.com/intl/de/policies/privacy/partners/

and so on.

So If I going into this site with high limits (because I dont know the Site and the crawl is for an automtically generated Domain Crawl, which should go deep inside) I will have to wait quite a long time (as it is only text) until one limit is going to be hit and I lose a lot of crawl time for valid crawls.

Proposal: 
Do not follow links in case after a TopLevelDomain another Toplevel domain is in the URL (here: .at .com )
and/or skip Links with multiple "www." inside

Docker Start:

`docker run -d --name test_1318m -v /home/antares/browsertrix/crawls/:/crawls/    webrecorder/browsertrix-crawler:1.9.2 crawl  --scopeType domain --depth 100  --headless  --delay 0 --behaviorTimeout 30 --pageLoadTimeout 30 --waitUntil networkidle0  --saveState always  --limit 10000 --logging stats,info   --sitemap --url https://www.1318m.at
`

And another question to this topic is how could the crawler live can be moved out of the trap? In the K8s deployment I could enter block URLs commands directly, even in Heretrix I could manipulate the Frontier to get the Crawler out of the trap, or inject a command, but how to proceed in such a case with browsertrix?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crawler Trap due invalid Links #931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Crawler Trap due invalid Links #931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions