Add a more general support for inferred path discovery #597
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ExtractorHTTP already extracts per-host favicon.ico by inference and can be set to also discover hosts' root page (
/
). This PR adds a list of paths that should be inferred to exist for each host.The motivation for this was the need to check if a site has a sitemap (
/sitemap.xml
) even when one isn't listed in therobots.txt
file. We've encountered several instances of this. Rather than just adding in one more hard-coded path, it seems better to make this configurable.The existing config for discovering the root path has been marked deprecated in favor of this new setting, but it continues to function as before. So, existing configuration should not be affected by any of these changes.
Given that discovery of the
favicon.ico
is hard-coded in and not configurable, it remains unaffected. I judge the impact of changing this as too disruptive, but ideally these inferences could all be managed through this new list. Indeed, if not for thefavicon.ico
legacy, I might instead suggest that this functionality be broken off into a new "ExtractorInference" as all this is unrelated to the extraction of links in the HTTP response headers.