Skip to content

Enhance image processing and more #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 11, 2024
Merged

Enhance image processing and more #63

merged 9 commits into from
Nov 11, 2024

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Nov 7, 2024

Fix #61
Fix #27

Changes:

  • convert all supported images which comes from the pages (i.e. not the ones coming from CSS for buttons, ...) to webp, and optimize them to WebP Medium preset
    • I experimented with resizing big ones as well, but it didn't changed much the ZIM size (less than 1% change in size) => not worth it
  • cache optimized images to S3 cache, use cache when available
  • add backoff to retry on network failures with exponential wait time and jitter
  • add WebP polyfill
  • add joblib to process (by default) 10 assets (currently images but could be something else) in parallel
  • use new methods validate file/folder from scraperlib

Test ZIM (1% of geo.libretexts.org mentioned in #61) goes down to 107M instead of 207M, and execution time is somewhere between 3 mins (when everything is in S3 cache) up to 7 mins (when everything needs to be downloaded, opimized and uploaded to cache).

I chose joblib for parallel job execution because:

  • it is widely used
  • maintenance is not optimal but not zero
  • it is pure Python
  • it is capable of using multiple backends for executing tasks (threads, various processes)
  • it nicely stops upon first exception in a task and nicely report the exception (even when using processes)
  • it nicely supports return values of parallel tasks (not needed here, more a general argument)
  • creating and maintaining our own executor finally seems to be a waste of time

@benoit74 benoit74 self-assigned this Nov 7, 2024
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 34.66667% with 98 lines in your changes missing coverage. Please review.

Project coverage is 45.22%. Comparing base (b403f74) to head (a3b696e).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/asset.py 32.11% 74 Missing ⚠️
scraper/src/mindtouch2zim/processor.py 30.76% 18 Missing ⚠️
scraper/src/mindtouch2zim/entrypoint.py 20.00% 4 Missing ⚠️
scraper/src/mindtouch2zim/utils.py 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   46.49%   45.22%   -1.27%     
==========================================
  Files          12       13       +1     
  Lines         727      849     +122     
  Branches       94      111      +17     
==========================================
+ Hits          338      384      +46     
- Misses        376      452      +76     
  Partials       13       13              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review November 8, 2024 13:07
@benoit74 benoit74 requested a review from rgaudin November 11, 2024 08:13
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benoit74 benoit74 merged commit d571ab8 into main Nov 11, 2024
10 checks passed
@benoit74 benoit74 deleted the image_compression branch November 11, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of images Use proper zimscraperlib method to validate file is creatable.
2 participants