Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

Closed
mona-ul opened this issue Sep 29, 2024 · 4 comments · Fixed by #694
Closed

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

mona-ul opened this issue Sep 29, 2024 · 4 comments · Fixed by #694
Labels
bug Something isn't working

Comments

@mona-ul
Copy link

mona-ul commented Sep 29, 2024

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I have a question about the settings Max Depth and Include Any Linked Page.
I think there might me an error in the implementation of the rules.

As I understand it,

  • the max depth apply to pages that are in scope. That means, pages would be cut, due to max depth. It limits the pages, that are in scope. "Limits how many hops away the crawler can visit while staying within the Start URL Scope."
  • The one hop out adds all pages, that are linked by pages in scope.
    "If checked, the crawler will visit pages one link away outside of Start URL Scope."

Right now, the crawler from browsertix is not behaving as I would think.

Reproduction instructions

Example:
I've created an example page for illustration: https://monaulrich.online/scopetest/start/

Crawl Config "Scope Domain & Subdomain / Max Depth 1"
Here the Crawler includes what I would expect.
The start url and the 4 linked page, that have the same domain/subdomain are archived.
These pages are in scope.
example_structure_in_scope

Crawl Config "Scope Domain & Subdomain / Max Depth 1 / Include Any Linked Page: True"
I would expect, that all the pages, that are in scope based on the settings Scope Domain & Subdomain and Max Depth: 1, are checked for any linked pages. But only the pages 1 depth away from the start URL are included.

Browsertrix Output:
example_structure_scope_linked_pages_actual

What I would expect:
example_structure_scope_linked_pages_expected

I am not sure, if i misunderstand the settings, or if there is an inconsistency in the implementation.
Thanks in advance.

Screenshots / Video

No response

Environment

No response

Additional details

No response

@mona-ul mona-ul added the bug Something isn't working label Sep 29, 2024
@ikreymer
Copy link
Member

ikreymer commented Sep 29, 2024

Thanks for reporting this! I actually though that it was working the way you expected as well, and was surprised by the result!
The issue is that the extraHops checks was being applied only to the scope, but not the depth! I agree that it probably should apply to both. (We usually just left the depth unlimited, so it didn't matter, but if it is set, the scope check ignores extraHops even when at max depth).

I have a fix for this but wonder how we should fix it as is, or add a separate flag -- leaning towards fixing as is, as this seems to be the more intuitive way of how it should work. @tw4l @Shrinks99 do you have any thoughts?

@ikreymer ikreymer transferred this issue from webrecorder/browsertrix Sep 29, 2024
ikreymer added a commit that referenced this issue Sep 29, 2024
- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is set
- adjust isInScope and isAtMaxDepth to account for extraHops
- fixes #693
@tw4l
Copy link
Member

tw4l commented Sep 30, 2024

I would agree with fixing it as-is! I also shared a same mental model of what should happen that corresponds to @mona-ul 's expected (but not actual) result. My leaning would be to treat this like a bug and patch it without any additional flags. Perhaps we can also be more explicit in the documentation about what's expected - the visual aids here really helped clarify!

@Shrinks99
Copy link
Member

+1 for fix as is! @mona-ul thanks for the clear explination and diagrams!

@mona-ul
Copy link
Author

mona-ul commented Sep 30, 2024

Thank you for your fast reply and clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

4 participants