Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support linkcheck_ignore in link redirection #11233

Open
ericpre opened this issue Mar 11, 2023 · 5 comments · May be fixed by #13127
Open

Support linkcheck_ignore in link redirection #11233

ericpre opened this issue Mar 11, 2023 · 5 comments · May be fixed by #13127

Comments

@ericpre
Copy link

ericpre commented Mar 11, 2023

Is your feature request related to a problem? Please describe.
Specifying a domain in linkcheck_ignore works well for links containing this domain but it doesnn't for links which redirect to a link to the domain to be ignored.
For example, the following configuration:

linkcheck_ignore = [
    "https://onlinelibrary.wiley.com",  # 403 Client Error: Forbidden for url
]

works perfectly for links like https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597 but not for https://doi.org/10.1002/jemt.20597, which redirect to https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597

Describe the solution you'd like
The linkcheck_ignore configuration parameters should also apply to redirect links.

Additional context
See for example hyperspy/hyperspy#3108. This typically happen for DOI links, which are by design permanent url and redirect to urls which can changed. In this case, the DOI should be used in favour of the redirect url however, the linkcheck_ignore will not be effective on the redirect url.

@francoisfreitag
Copy link
Contributor

Sounds very reasonable to me. When a user expects all links to a domain (or path) to be ignored, linkcheck should also ignore the redirections pointing to that domain.

@AA-Turner AA-Turner added this to the some future version milestone Apr 29, 2023
@goekce
Copy link

goekce commented Jun 13, 2023

I tried to understand why Wiley URLs have this problem but was not successful, so I asked for help upstream: psf/requests#6471

Meanwhile having the option to ignore specific redirect sites of DOI links would a good idea.

Here is what I have tried here if someone is interested If I for instance visit `https://doi.org/10.1002/jccs.200600142` with my browser, everything is fine. But both requests and Sphinx fail:
python -c "import requests; print(requests.head('https://doi.org/10.1002/jccs.200600142', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('https://doi.org/10.1002/jccs.200600142', allow_redirects=True))"
<Response [403]>
<Response [403]>

I also tried accepting cookies and changing the user-agent, which also did not help:

import requests
with requests.Session() as s:
    print(s.get('https://doi.org/10.1002/jccs.200600142', allow_redirects=True, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}))

I posted this upstream: psf/requests#6471

@francoisfreitag
Copy link
Contributor

francoisfreitag commented Jun 13, 2023

Unsure what problem Wiley URLs have? They are always replying with a 403. The issue here is that developers can instruct linkcheck to ignore URLs matching a regexp pattern, but that URL is only ignored if it appears in the documentation, not if it comes as the result of a redirect.

So:

# conf.py
linkcheck_ignore = [
    "https://onlinelibrary.wiley.com",  # 403 Client Error: Forbidden for url
]
.. doc.rst
.. this link is ignored by linkcheck, it matches the pattern from linkcheck_ignore
`direct link <https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597>`_
.. doi.org redirects to Wiley, but linkcheck does not check the linkcheck_ignore during the redirection chain, so the following shows as a broken link for linkcheck
`indirect link <https://doi.org/10.1002/jemt.20597>`_

@jayaddison
Copy link
Contributor

I'm interested in implementing this feature, but there's a detail about the existing implementation that I think is important to consider first:

The Sphinx linkcheck builder currently uses the requests HTTP client to handle following of redirects, and so the logic to determine whether one-or-more-redirects will be followed is encapsulated by that library, and a naive application of linkcheck_ignore rules would only apply after a final response URL is resolved (or a timeout occurs, or an error code is returned, etc).

My intuition for a feature like this is that we'd ideally want to ignore redirections as soon as they suggest navigating through an ignored path -- that is, we'd follow the initial hyperlink, and if it tells us to go to a known-ignorable URL, we'd stop immediately and return the ignore status code for that link. I believe that'd be preferable because it'd imply that we'd generate less network traffic, we'd spend less time linkchecking, and we wouldn't attempt to initiate traffic to domains we've configured as ignored.

To do so, however, we'd probably want to adjust Sphinx's linkchecker so that requests doesn't follow redirects on its behalf. That's OK... but it potentially has other consequences; we'd want to enforce a maximum-redirect limit (similar to the long-standing requests.sessions.DEFAULT_REDIRECT_LIMIT), for example. And requests itself will have had a wealth of experience dealing with unusual behaviours and bugreports related to redirects. So... as a maintainer I'm initially a bit cautious (an understatement) about this.

We could perhaps apply the naive solution, and ignore URLs after requests has followed all redirects. Perhaps that's OK - I just don't love the idea that some traffic may in fact navigate through ignored URLs. Hiding those by marking them as ignored -- even though they weren't -- feels potentially worse than doing nothing.

@jayaddison jayaddison linked a pull request Nov 13, 2024 that will close this issue
@jayaddison
Copy link
Contributor

The concerns I had in my previous comment about achieving this using requests have been resolved after I learned that the requests.Session class -- that we use for all linkcheck HTTP requests -- provides a conveniently-overridable get_redirect_target method.

By overriding that method in #13127 in a Sphinx-internal subclass of requests.Session, I'm suggesting a change that inspects each HTTP redirect as-it-occurs, and compares each against the configured linkcheck_ignore patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants