Skip to content

Conversation

@satyamtg
Copy link
Contributor

@satyamtg satyamtg commented Jul 19, 2020

This fixes #53 and uses pylibzim to create ZIMs. It currently relies on openzim/python-scraperlib#34 and thus has a requirement from that branch itself. Also fixes #24 which was necessary to make pylibzim work.

Openedx instances have many root-relative links and we correctly fix them to be not root relative but just relative if the page that it points is present in the ZIM or else point to an external URL by adding the instance netloc.

The following changes are made in scraper.py related to link rewriting -

  • Rewriting is done just after the dependency download is complete
  • In order to avoid running link rewriting, before having a list of all offline tabs, the previous annex() method is renamed to get_course_tabs() which only gets the course tabs and the new annex() method actually downloads the content. get_course_tabs() is reused in rewrite_internal_links().
  • Internal links are basically of 2 types, one directly pointing to an xblock (with a jump_to in the path (see the xblock_json)), and other one pointing to a tab (for which path is determined by get_course_tabs() as we do not offline all tabs).
  • handle_jump_to_path() compares the jump_to type URL and finds the xblock with that URL from the list of xblock_extractor objects, and checks if the xblock is a vertical or course and returns the modified link. As only course and vertical have HTMLs, we look at the descendants for linkable xblocks too here.
  • relative_dots() prepares a path of backward jumps, according to the number of parts in the path
  • update_root_relative_path() writes ensures that no root relative URLs are left out by putting the instance_url in place of the netloc.
  • rewrite_internal_links() is the main manager method. It calls the other functions. In case of jump_to links, if in the first try we do not get a path, we try with the parent as it may be pointing to an xblock with which the vertical xblock is made.

Note this depends on a future release of zimscraperlib

@satyamtg satyamtg requested a review from rgaudin July 19, 2020 18:52
@satyamtg satyamtg linked an issue Jul 20, 2020 that may be closed by this pull request
@satyamtg satyamtg changed the title Fixes #53 - Use pylibzim to create ZIM Use pylibzim to create ZIM Jul 20, 2020
@satyamtg satyamtg requested a review from rgaudin July 20, 2020 15:35
@satyamtg satyamtg self-assigned this Jul 20, 2020
@satyamtg satyamtg marked this pull request as draft July 20, 2020 15:36
using `path_fixed` and `fixed_path` for different stuff in same method is bad.
I've renamed those.

Also moved `relative_dots` and `update_root_relative_path` into `rewrite_internal_links`.
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some simplifications so it's easier to read

using path_fixed and fixed_path for different stuff in same method is bad.
I've renamed those.

Also moved relative_dots and update_root_relative_path into rewrite_internal_links.

Codefactor may complain about complexity (my editor didn't but config is different).
In this case we'd moved those methods back

@satyamtg
Copy link
Contributor Author

Codefactor may complain about complexity (my editor didn't but config is different).
In this case we'd moved those methods back

Codefactor doesn't complain so kept it there. Also, the change broke it as the links starting with path_prefix were never fixed, as the absolute links were fixed before. So fixed that.

Also made another module html_processor.py and moved all HTML parsing, dependency downloading and link fixing there. I did this because the scraper.py went too long. Also renamed dl_dependencies to dl_dependencies_and_fix_links and added docstrings for them. Changed the attribute checking for wiki and forum and check value of self.wiki or self.forum now.

@satyamtg satyamtg requested a review from rgaudin July 21, 2020 06:03
@satyamtg satyamtg marked this pull request as ready for review July 21, 2020 06:03
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@rgaudin rgaudin merged commit 7cba0e4 into master Jul 21, 2020
@rgaudin rgaudin deleted the pylibzim_support branch July 21, 2020 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add python-libzim based ZIM creation Mooc inter-xblocks link are not (always) rewritten properly

2 participants