[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

mona-ul · 2024-09-29T14:50:04Z

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I have a question about the settings Max Depth and Include Any Linked Page.
I think there might me an error in the implementation of the rules.

As I understand it,

the max depth apply to pages that are in scope. That means, pages would be cut, due to max depth. It limits the pages, that are in scope. "Limits how many hops away the crawler can visit while staying within the Start URL Scope."
The one hop out adds all pages, that are linked by pages in scope.
"If checked, the crawler will visit pages one link away outside of Start URL Scope."

Right now, the crawler from browsertix is not behaving as I would think.

Reproduction instructions

Example:
I've created an example page for illustration: https://monaulrich.online/scopetest/start/

Crawl Config "Scope Domain & Subdomain / Max Depth 1"
Here the Crawler includes what I would expect.
The start url and the 4 linked page, that have the same domain/subdomain are archived.
These pages are in scope.

Crawl Config "Scope Domain & Subdomain / Max Depth 1 / Include Any Linked Page: True"
I would expect, that all the pages, that are in scope based on the settings Scope Domain & Subdomain and Max Depth: 1, are checked for any linked pages. But only the pages 1 depth away from the start URL are included.

Browsertrix Output:

What I would expect:

I am not sure, if i misunderstand the settings, or if there is an inconsistency in the implementation.
Thanks in advance.

Screenshots / Video

No response

Environment

No response

Additional details

No response

ikreymer · 2024-09-29T17:54:38Z

Thanks for reporting this! I actually though that it was working the way you expected as well, and was surprised by the result!
The issue is that the extraHops checks was being applied only to the scope, but not the depth! I agree that it probably should apply to both. (We usually just left the depth unlimited, so it didn't matter, but if it is set, the scope check ignores extraHops even when at max depth).

I have a fix for this but wonder how we should fix it as is, or add a separate flag -- leaning towards fixing as is, as this seems to be the more intuitive way of how it should work. @tw4l @Shrinks99 do you have any thoughts?

- if extraHops is set, crawler should visit pages beyond maxDepth - currently returning out of scope at depth limit even if extraHops is set - adjust isInScope and isAtMaxDepth to account for extraHops - fixes #693

tw4l · 2024-09-30T14:38:45Z

I would agree with fixing it as-is! I also shared a same mental model of what should happen that corresponds to @mona-ul 's expected (but not actual) result. My leaning would be to treat this like a bug and patch it without any additional flags. Perhaps we can also be more explicit in the documentation about what's expected - the visual aids here really helped clarify!

Shrinks99 · 2024-09-30T15:28:00Z

+1 for fix as is! @mona-ul thanks for the clear explination and diagrams!

mona-ul · 2024-09-30T15:44:04Z

Thank you for your fast reply and clarification.

mona-ul added the bug Something isn't working label Sep 29, 2024

ikreymer transferred this issue from webrecorder/browsertrix Sep 29, 2024

ikreymer mentioned this issue Sep 30, 2024

ensure extraHops also apply to maxDepth #694

Merged

ikreymer closed this as completed in #694 Sep 30, 2024

ikreymer closed this as completed in 728f002 Sep 30, 2024

github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

mona-ul commented Sep 29, 2024

ikreymer commented Sep 29, 2024 •

edited

Loading

tw4l commented Sep 30, 2024

Shrinks99 commented Sep 30, 2024

mona-ul commented Sep 30, 2024

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

Comments

mona-ul commented Sep 29, 2024

Browsertrix Version

What did you expect to happen? What happened instead?

Reproduction instructions

Screenshots / Video

Environment

Additional details

ikreymer commented Sep 29, 2024 • edited Loading

tw4l commented Sep 30, 2024

Shrinks99 commented Sep 30, 2024

mona-ul commented Sep 30, 2024

ikreymer commented Sep 29, 2024 •

edited

Loading