Closed
Description
Description
Scrapy may not handle redirection depending on the order of content
and http-equiv
attributes of <meta>
tag
Steps to Reproduce
- Create two sample pages and serve them using simple http server:
echo '<html><head><meta content="0;url=dummy.html" http-equiv="refresh"></head></html>' > index1.html
echo '<html><head><meta http-equiv="refresh" content="0;url=dummy.html"></head></html>' > index2.html
python3 -m http.server -d .
- On another terminal open scrapy shell:
scrapy shell
>>> fetch('http://localhost:8000')
2021-01-29 21:24:22 [scrapy.core.engine] INFO: Spider opened
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/robots.txt> (referer: None)
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000> (referer: None)
>>> fetch('http://localhost:8000/index1.html')
2021-01-29 21:24:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000/index1.html> (referer: None)
>>> fetch('http://localhost:8000/index2.html')
2021-01-29 21:24:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http: //localhost:8000/dummy.html> from <GET http: //localhost:8000/index2.html>
2021-01-29 21:24:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/dummy.html> (referer: None)
Expected behavior:
Redirection happens in both cases.
Actual behavior:
Redirection only happens in second case (http-equiv
, content
), not in first (content
, http-equiv
).
Reproduces how often:
Always.
Versions
Scrapy : 2.4.1
lxml : 4.6.2.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 20.3.0
Python : 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0]
pyOpenSSL : 20.0.1 (OpenSSL 1.1.1i 8 Dec 2020)
cryptography : 3.3.1
Platform : Linux-4.4.0-18362-Microsoft-x86_64-with-glibc2.29