Skip to content

[MRG+1] Added: Removing comments before extracting base URLs. Not a solution to #70, but does help in some cases. #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 7, 2022

Conversation

starrify
Copy link
Contributor

@starrify starrify commented Oct 25, 2016

Helps resolving the issue in such cases, which does happen in several websites:

>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'

Fixes #70 (since the original #70 report is about this scenario; for other scenarios, we should have separate issues)

@codecov-io
Copy link

codecov-io commented Oct 25, 2016

Current coverage is 94.10% (diff: 100%)

Merging #77 into master will increase coverage by 0.01%

@@             master        #77   diff @@
==========================================
  Files             7          7          
  Lines           406        407     +1   
  Methods           0          0          
  Messages          0          0          
  Branches         84         84          
==========================================
+ Hits            382        383     +1   
  Misses           16         16          
  Partials          8          8          

Powered by Codecov. Last update 03c28d2...11b5d26

@redapple
Copy link
Contributor

Can you add tests for this?
Can you provide example websites showing this issue?

@starrify starrify force-pushed the remove-comments-for-base-url branch from 93a90c5 to 11b5d26 Compare October 26, 2016 07:18
@starrify
Copy link
Contributor Author

Thanks for the notice @redapple . A test has been added.

Here's a sample site which triggers this issue: http://planweb01.rother.gov.uk/OcellaWeb/planningSearch

@Gallaecio Gallaecio changed the title Added: Removing comments before extracting base URLs. Not a solution to #70, but does help in some cases. [MRG+1] Added: Removing comments before extracting base URLs. Not a solution to #70, but does help in some cases. Aug 8, 2019
@Gallaecio
Copy link
Member

@kmike Could you have a look?

@Gallaecio
Copy link
Member

This is related to #70

@yozachar
Copy link

Bumping to close outdated PR.

@Gallaecio Gallaecio requested review from kmike and wRAR November 4, 2022 17:57
@wRAR wRAR merged commit fb70566 into scrapy:master Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

It's not a good idead to parse HTML text using regular expressions
6 participants