Fix HTML link rewriting and CSS link rewriting #34

satyamtg · 2020-07-19T18:08:47Z

This fixes the following -

~~Proper rewriting of root-relative links~~ - Dropped as its forbidden in ZIM and shall be handled by scrapers given that we plan to get rid of namespaces
CSS URLs are now found in a non greedy manner

Additional changes -

Changed str to Path in test_fix_target_for() as the function it tests expects a Path

rgaudin · 2020-07-20T10:09:36Z

Thank you. There a a couple things I don't understand:

what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.
Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?
You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

satyamtg · 2020-07-20T10:26:38Z

what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have -
url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

Agrees. Updated the main comment and will take care of this in the future

Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

rgaudin · 2020-07-20T10:37:18Z

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have -
url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

Thanks for pointing that out. I did not expect that. Could you please add such a test then?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

I think we should take care of this now. Where are those links coming from btw? Our templates of edx ?

openzim/libzim#325

codecov · 2020-07-20T11:37:06Z

Codecov Report

Merging #34 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #34   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          754       757    +3     
=========================================
+ Hits           754       757    +3

Impacted Files	Coverage Δ
src/zimscraperlib/zim/rewriting.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a032a69...c5d7368. Read the comment docs.

satyamtg · 2020-07-20T13:36:21Z

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

rgaudin · 2020-07-20T14:03:56Z

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

Hum, I don't see the added value of this. This module's role is to rewrite URLs to handle the namespace stuff. That's it.

What you added is a validation check.
While this can be handy in that it lets you know at scraping time that you have an incorrect link, I don't think it's the path to follow as it implies the scraperlib does some kind of validation while it doesn't and probably won't.

There are many things that can be checked and that's why zimcheck exists. We don't use zimcheck enough for now but we should use it more while developing scrapers ; that's why we have a ticket to integrate it with zimfarm.

I'd prefer we stick to “we rewrite what we know needs rewriting and leave the rest as is”.

Also, changelog messages should be more specific here as this project is developer-oriented and the changelog would be first stop for people with an issue to look for answers.

rgaudin

Good. Simplified the change a bit

satyamtg requested a review from rgaudin July 19, 2020 18:09

satyamtg mentioned this pull request Jul 19, 2020

Use pylibzim to create ZIM openzim/openedx#70

Merged

satyamtg force-pushed the fix_link_rewriting branch from 878e0eb to 111d009 Compare July 20, 2020 11:34

rgaudin approved these changes Jul 20, 2020

View reviewed changes

satyamtg changed the title ~~Fix root-relative link rewriting and CSS link rewriting~~ Fix HTML link rewriting and CSS link rewriting Jul 20, 2020

satyamtg force-pushed the fix_link_rewriting branch from 591a769 to 1de6a8a Compare July 20, 2020 14:17

satyamtg requested a review from rgaudin July 20, 2020 14:20

satyamtg and others added 2 commits July 20, 2020 14:32

Fix root-relative link rewriting and CSS link rewriting

488977c

Do not fix target if link has a netloc

c5d7368

rgaudin force-pushed the fix_link_rewriting branch from 1de6a8a to c5d7368 Compare July 20, 2020 14:35

rgaudin approved these changes Jul 20, 2020

View reviewed changes

rgaudin merged commit 7227754 into master Jul 20, 2020

rgaudin deleted the fix_link_rewriting branch July 20, 2020 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix HTML link rewriting and CSS link rewriting #34

Fix HTML link rewriting and CSS link rewriting #34

Uh oh!

satyamtg commented Jul 19, 2020 •

edited

Loading

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

satyamtg commented Jul 20, 2020

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

codecov bot commented Jul 20, 2020 •

edited

Loading

Uh oh!

satyamtg commented Jul 20, 2020

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

rgaudin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix HTML link rewriting and CSS link rewriting #34

Fix HTML link rewriting and CSS link rewriting #34

Uh oh!

Conversation

satyamtg commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

satyamtg commented Jul 20, 2020

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

codecov bot commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

satyamtg commented Jul 20, 2020

Uh oh!

rgaudin commented Jul 20, 2020

Uh oh!

rgaudin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

satyamtg commented Jul 19, 2020 •

edited

Loading

codecov bot commented Jul 20, 2020 •

edited

Loading