-
-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693
Comments
Thanks for reporting this! I actually though that it was working the way you expected as well, and was surprised by the result! I have a fix for this but wonder how we should fix it as is, or add a separate flag -- leaning towards fixing as is, as this seems to be the more intuitive way of how it should work. @tw4l @Shrinks99 do you have any thoughts? |
- if extraHops is set, crawler should visit pages beyond maxDepth - currently returning out of scope at depth limit even if extraHops is set - adjust isInScope and isAtMaxDepth to account for extraHops - fixes #693
I would agree with fixing it as-is! I also shared a same mental model of what should happen that corresponds to @mona-ul 's expected (but not actual) result. My leaning would be to treat this like a bug and patch it without any additional flags. Perhaps we can also be more explicit in the documentation about what's expected - the visual aids here really helped clarify! |
+1 for fix as is! @mona-ul thanks for the clear explination and diagrams! |
Thank you for your fast reply and clarification. |
Browsertrix Version
v1.11.7-7a61568
What did you expect to happen? What happened instead?
I have a question about the settings Max Depth and Include Any Linked Page.
I think there might me an error in the implementation of the rules.
As I understand it,
"If checked, the crawler will visit pages one link away outside of Start URL Scope."
Right now, the crawler from browsertix is not behaving as I would think.
Reproduction instructions
Example:
I've created an example page for illustration: https://monaulrich.online/scopetest/start/
Crawl Config "Scope Domain & Subdomain / Max Depth 1"
Here the Crawler includes what I would expect.
The start url and the 4 linked page, that have the same domain/subdomain are archived.
These pages are in scope.
Crawl Config "Scope Domain & Subdomain / Max Depth 1 / Include Any Linked Page: True"
I would expect, that all the pages, that are in scope based on the settings Scope Domain & Subdomain and Max Depth: 1, are checked for any linked pages. But only the pages 1 depth away from the start URL are included.
Browsertrix Output:
What I would expect:
I am not sure, if i misunderstand the settings, or if there is an inconsistency in the implementation.
Thanks in advance.
Screenshots / Video
No response
Environment
No response
Additional details
No response
The text was updated successfully, but these errors were encountered: