Skip to content

Redirection may not work depending on order of 'content' and 'http-equiv' in meta tag #162

Closed
@gmargari

Description

@gmargari

Description

Scrapy may not handle redirection depending on the order of content and http-equiv attributes of <meta> tag

Steps to Reproduce

  1. Create two sample pages and serve them using simple http server:
echo '<html><head><meta content="0;url=dummy.html" http-equiv="refresh"></head></html>' > index1.html
echo '<html><head><meta http-equiv="refresh" content="0;url=dummy.html"></head></html>' > index2.html
python3 -m http.server -d .
  1. On another terminal open scrapy shell:
scrapy shell
>>> fetch('http://localhost:8000')
2021-01-29 21:24:22 [scrapy.core.engine] INFO: Spider opened
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/robots.txt> (referer: None)
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000> (referer: None)

>>> fetch('http://localhost:8000/index1.html')
2021-01-29 21:24:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000/index1.html> (referer: None)

>>> fetch('http://localhost:8000/index2.html')
2021-01-29 21:24:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http: //localhost:8000/dummy.html> from <GET http: //localhost:8000/index2.html>
2021-01-29 21:24:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/dummy.html> (referer: None)

Expected behavior:

Redirection happens in both cases.

Actual behavior:

Redirection only happens in second case (http-equiv, content), not in first (content, http-equiv).

Reproduces how often:

Always.

Versions

Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.3.1
Platform     : Linux-4.4.0-18362-Microsoft-x86_64-with-glibc2.29

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions