Support `linkcheck_ignore` in link redirection #11233

ericpre · 2023-03-11T20:57:59Z

Is your feature request related to a problem? Please describe.
Specifying a domain in linkcheck_ignore works well for links containing this domain but it doesnn't for links which redirect to a link to the domain to be ignored.
For example, the following configuration:

linkcheck_ignore = [
    "https://onlinelibrary.wiley.com",  # 403 Client Error: Forbidden for url
]

works perfectly for links like https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597 but not for https://doi.org/10.1002/jemt.20597, which redirect to https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597

Describe the solution you'd like
The linkcheck_ignore configuration parameters should also apply to redirect links.

Additional context
See for example hyperspy/hyperspy#3108. This typically happen for DOI links, which are by design permanent url and redirect to urls which can changed. In this case, the DOI should be used in favour of the redirect url however, the linkcheck_ignore will not be effective on the redirect url.

The text was updated successfully, but these errors were encountered:

francoisfreitag · 2023-03-14T16:02:03Z

Sounds very reasonable to me. When a user expects all links to a domain (or path) to be ignored, linkcheck should also ignore the redirections pointing to that domain.

goekce · 2023-06-13T09:42:37Z

I tried to understand why Wiley URLs have this problem but was not successful, so I asked for help upstream: psf/requests#6471

Meanwhile having the option to ignore specific redirect sites of DOI links would a good idea.

Here is what I have tried here if someone is interested

If I for instance visit `https://doi.org/10.1002/jccs.200600142` with my browser, everything is fine. But both requests and Sphinx fail:

python -c "import requests; print(requests.head('https://doi.org/10.1002/jccs.200600142', allow_redirects=True)); import sphinx.util.requests; print(sphinx.util.requests.head('https://doi.org/10.1002/jccs.200600142', allow_redirects=True))"
<Response [403]>
<Response [403]>

I also tried accepting cookies and changing the user-agent, which also did not help:

import requests
with requests.Session() as s:
    print(s.get('https://doi.org/10.1002/jccs.200600142', allow_redirects=True, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}))

I posted this upstream: psf/requests#6471

francoisfreitag · 2023-06-13T12:33:15Z

Unsure what problem Wiley URLs have? They are always replying with a 403. The issue here is that developers can instruct linkcheck to ignore URLs matching a regexp pattern, but that URL is only ignored if it appears in the documentation, not if it comes as the result of a redirect.

So:

# conf.py
linkcheck_ignore = [
    "https://onlinelibrary.wiley.com",  # 403 Client Error: Forbidden for url
]

.. doc.rst
.. this link is ignored by linkcheck, it matches the pattern from linkcheck_ignore
`direct link <https://onlinelibrary.wiley.com/doi/10.1002/jemt.20597>`_
.. doi.org redirects to Wiley, but linkcheck does not check the linkcheck_ignore during the redirection chain, so the following shows as a broken link for linkcheck
`indirect link <https://doi.org/10.1002/jemt.20597>`_

jayaddison · 2024-10-26T13:12:33Z

I'm interested in implementing this feature, but there's a detail about the existing implementation that I think is important to consider first:

The Sphinx linkcheck builder currently uses the requests HTTP client to handle following of redirects, and so the logic to determine whether one-or-more-redirects will be followed is encapsulated by that library, and a naive application of linkcheck_ignore rules would only apply after a final response URL is resolved (or a timeout occurs, or an error code is returned, etc).

My intuition for a feature like this is that we'd ideally want to ignore redirections as soon as they suggest navigating through an ignored path -- that is, we'd follow the initial hyperlink, and if it tells us to go to a known-ignorable URL, we'd stop immediately and return the ignore status code for that link. I believe that'd be preferable because it'd imply that we'd generate less network traffic, we'd spend less time linkchecking, and we wouldn't attempt to initiate traffic to domains we've configured as ignored.

To do so, however, we'd probably want to adjust Sphinx's linkchecker so that requests doesn't follow redirects on its behalf. That's OK... but it potentially has other consequences; we'd want to enforce a maximum-redirect limit (similar to the long-standing requests.sessions.DEFAULT_REDIRECT_LIMIT), for example. And requests itself will have had a wealth of experience dealing with unusual behaviours and bugreports related to redirects. So... as a maintainer I'm initially a bit cautious (an understatement) about this.

We could perhaps apply the naive solution, and ignore URLs after requests has followed all redirects. Perhaps that's OK - I just don't love the idea that some traffic may in fact navigate through ignored URLs. Hiding those by marking them as ignored -- even though they weren't -- feels potentially worse than doing nothing.

jayaddison · 2024-11-17T22:01:24Z

The concerns I had in my previous comment about achieving this using requests have been resolved after I learned that the requests.Session class -- that we use for all linkcheck HTTP requests -- provides a conveniently-overridable get_redirect_target method.

By overriding that method in #13127 in a Sphinx-internal subclass of requests.Session, I'm suggesting a change that inspects each HTTP redirect as-it-occurs, and compares each against the configured linkcheck_ignore patterns.

ericpre mentioned this issue Mar 12, 2023

Fix linkchecker documentation hyperspy/hyperspy#3108

Merged

6 tasks

francoisfreitag added builder:linkcheck type:proposal a feature suggestion labels Mar 14, 2023

AA-Turner added this to the some future version milestone Apr 29, 2023

melund mentioned this issue Jan 17, 2024

Add user agent and update linkcheck_ignore list AnyBody/ammr#904

Closed

jayaddison linked a pull request Nov 13, 2024 that will close this issue

linkcheck: support ignored-URIs for redirects #13127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `linkcheck_ignore` in link redirection #11233

Support `linkcheck_ignore` in link redirection #11233

ericpre commented Mar 11, 2023 •

edited

Loading

francoisfreitag commented Mar 14, 2023

goekce commented Jun 13, 2023

francoisfreitag commented Jun 13, 2023 •

edited

Loading

jayaddison commented Oct 26, 2024

jayaddison commented Nov 17, 2024

Support linkcheck_ignore in link redirection #11233

Support linkcheck_ignore in link redirection #11233

Comments

ericpre commented Mar 11, 2023 • edited Loading

francoisfreitag commented Mar 14, 2023

goekce commented Jun 13, 2023

francoisfreitag commented Jun 13, 2023 • edited Loading

jayaddison commented Oct 26, 2024

jayaddison commented Nov 17, 2024

Support `linkcheck_ignore` in link redirection #11233

Support `linkcheck_ignore` in link redirection #11233

ericpre commented Mar 11, 2023 •

edited

Loading

francoisfreitag commented Jun 13, 2023 •

edited

Loading