Use pylibzim to create ZIM #70

satyamtg · 2020-07-19T18:52:43Z

This fixes #53 and uses pylibzim to create ZIMs. It currently relies on openzim/python-scraperlib#34 and thus has a requirement from that branch itself. Also fixes #24 which was necessary to make pylibzim work.

Openedx instances have many root-relative links and we correctly fix them to be not root relative but just relative if the page that it points is present in the ZIM or else point to an external URL by adding the instance netloc.

The following changes are made in scraper.py related to link rewriting -

Rewriting is done just after the dependency download is complete
In order to avoid running link rewriting, before having a list of all offline tabs, the previous annex() method is renamed to get_course_tabs() which only gets the course tabs and the new annex() method actually downloads the content. get_course_tabs() is reused in rewrite_internal_links().
Internal links are basically of 2 types, one directly pointing to an xblock (with a jump_to in the path (see the xblock_json)), and other one pointing to a tab (for which path is determined by get_course_tabs() as we do not offline all tabs).
handle_jump_to_path() compares the jump_to type URL and finds the xblock with that URL from the list of xblock_extractor objects, and checks if the xblock is a vertical or course and returns the modified link. As only course and vertical have HTMLs, we look at the descendants for linkable xblocks too here.
relative_dots() prepares a path of backward jumps, according to the number of parts in the path
update_root_relative_path() writes ensures that no root relative URLs are left out by putting the instance_url in place of the netloc.
rewrite_internal_links() is the main manager method. It calls the other functions. In case of jump_to links, if in the first try we do not get a path, we try with the parent as it may be pointing to an xblock with which the vertical xblock is made.

Note this depends on a future release of zimscraperlib

openedx2zim/scraper.py

requirements.txt

using `path_fixed` and `fixed_path` for different stuff in same method is bad. I've renamed those. Also moved `relative_dots` and `update_root_relative_path` into `rewrite_internal_links`.

rgaudin

did some simplifications so it's easier to read

using path_fixed and fixed_path for different stuff in same method is bad.
I've renamed those.

Also moved relative_dots and update_root_relative_path into rewrite_internal_links.

Codefactor may complain about complexity (my editor didn't but config is different).
In this case we'd moved those methods back

openedx2zim/scraper.py

… add docstrings

satyamtg · 2020-07-21T06:03:12Z

Codefactor may complain about complexity (my editor didn't but config is different).
In this case we'd moved those methods back

Codefactor doesn't complain so kept it there. Also, the change broke it as the links starting with path_prefix were never fixed, as the absolute links were fixed before. So fixed that.

Also made another module html_processor.py and moved all HTML parsing, dependency downloading and link fixing there. I did this because the scraper.py went too long. Also renamed dl_dependencies to dl_dependencies_and_fix_links and added docstrings for them. Changed the attribute checking for wiki and forum and check value of self.wiki or self.forum now.

rgaudin

OK

satyamtg requested a review from rgaudin July 19, 2020 18:52

Fixes #53 - Use pylibzim to create ZIM

334656e

satyamtg force-pushed the pylibzim_support branch from a422b33 to 334656e Compare July 19, 2020 19:10

satyamtg mentioned this pull request Jul 20, 2020

Fix HTML link rewriting and CSS link rewriting openzim/python-scraperlib#34

Merged

rgaudin requested changes Jul 20, 2020

View reviewed changes

openedx2zim/scraper.py Outdated Show resolved Hide resolved

requirements.txt Outdated Show resolved Hide resolved

satyamtg linked an issue Jul 20, 2020 that may be closed by this pull request

Mooc inter-xblocks link are not (always) rewritten properly #24

Closed

satyamtg force-pushed the pylibzim_support branch from 5a0d951 to d0c9880 Compare July 20, 2020 14:02

satyamtg changed the title ~~Fixes #53 - Use pylibzim to create ZIM~~ Use pylibzim to create ZIM Jul 20, 2020

Fixes #24 - Fix internal links and don't allow root-relative links

b5778cd

satyamtg force-pushed the pylibzim_support branch from d0c9880 to b5778cd Compare July 20, 2020 15:28

satyamtg requested a review from rgaudin July 20, 2020 15:35

satyamtg self-assigned this Jul 20, 2020

satyamtg marked this pull request as draft July 20, 2020 15:36

did some simplifications so it's easier to read

9d7bc09

using `path_fixed` and `fixed_path` for different stuff in same method is bad. I've renamed those. Also moved `relative_dots` and `update_root_relative_path` into `rewrite_internal_links`.

rgaudin requested changes Jul 20, 2020

View reviewed changes

openedx2zim/scraper.py Outdated Show resolved Hide resolved

openedx2zim/scraper.py Outdated Show resolved Hide resolved

openedx2zim/scraper.py Outdated Show resolved Hide resolved

Move dependency downloading and links fixing to html_processor.py and…

bdc4c22

… add docstrings

satyamtg force-pushed the pylibzim_support branch from a5ed85f to bdc4c22 Compare July 21, 2020 05:50

satyamtg requested a review from rgaudin July 21, 2020 06:03

satyamtg marked this pull request as ready for review July 21, 2020 06:03

rgaudin approved these changes Jul 21, 2020

View reviewed changes

rgaudin merged commit 7cba0e4 into master Jul 21, 2020

rgaudin deleted the pylibzim_support branch July 21, 2020 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use pylibzim to create ZIM #70

Use pylibzim to create ZIM #70

Uh oh!

satyamtg commented Jul 19, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rgaudin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

satyamtg commented Jul 21, 2020

Uh oh!

rgaudin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Use pylibzim to create ZIM #70

Use pylibzim to create ZIM #70

Uh oh!

Conversation

satyamtg commented Jul 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgaudin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

satyamtg commented Jul 21, 2020

Uh oh!

rgaudin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

satyamtg commented Jul 19, 2020 •

edited

Loading