Skip to content

Conversation

@satyamtg
Copy link
Contributor

@satyamtg satyamtg commented Jul 19, 2020

This fixes the following -

  • Proper rewriting of root-relative links - Dropped as its forbidden in ZIM and shall be handled by scrapers given that we plan to get rid of namespaces
  • CSS URLs are now found in a non greedy manner

Additional changes -

  • Changed str to Path in test_fix_target_for() as the function it tests expects a Path

@rgaudin
Copy link
Member

rgaudin commented Jul 20, 2020

Thank you. There a a couple things I don't understand:

  • what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.
  • Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?
  • You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

@satyamtg
Copy link
Contributor Author

  • what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have -
url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

  • You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

Agrees. Updated the main comment and will take care of this in the future

  • Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

@rgaudin
Copy link
Member

rgaudin commented Jul 20, 2020

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have -
url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

Thanks for pointing that out. I did not expect that. Could you please add such a test then?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

I think we should take care of this now. Where are those links coming from btw? Our templates of edx ?

openzim/libzim#325

@satyamtg satyamtg force-pushed the fix_link_rewriting branch from 878e0eb to 111d009 Compare July 20, 2020 11:34
@codecov
Copy link

codecov bot commented Jul 20, 2020

Codecov Report

Merging #34 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #34   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          754       757    +3     
=========================================
+ Hits           754       757    +3     
Impacted Files Coverage Δ
src/zimscraperlib/zim/rewriting.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a032a69...c5d7368. Read the comment docs.

@satyamtg
Copy link
Contributor Author

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

@rgaudin
Copy link
Member

rgaudin commented Jul 20, 2020

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

Hum, I don't see the added value of this. This module's role is to rewrite URLs to handle the namespace stuff. That's it.

What you added is a validation check.
While this can be handy in that it lets you know at scraping time that you have an incorrect link, I don't think it's the path to follow as it implies the scraperlib does some kind of validation while it doesn't and probably won't.

There are many things that can be checked and that's why zimcheck exists. We don't use zimcheck enough for now but we should use it more while developing scrapers ; that's why we have a ticket to integrate it with zimfarm.

I'd prefer we stick to “we rewrite what we know needs rewriting and leave the rest as is”.

Also, changelog messages should be more specific here as this project is developer-oriented and the changelog would be first stop for people with an issue to look for answers.

@satyamtg satyamtg changed the title Fix root-relative link rewriting and CSS link rewriting Fix HTML link rewriting and CSS link rewriting Jul 20, 2020
@satyamtg satyamtg force-pushed the fix_link_rewriting branch from 591a769 to 1de6a8a Compare July 20, 2020 14:17
@satyamtg satyamtg requested a review from rgaudin July 20, 2020 14:20
@rgaudin rgaudin force-pushed the fix_link_rewriting branch from 1de6a8a to c5d7368 Compare July 20, 2020 14:35
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. Simplified the change a bit

@rgaudin rgaudin merged commit 7227754 into master Jul 20, 2020
@rgaudin rgaudin deleted the fix_link_rewriting branch July 20, 2020 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants