-
Notifications
You must be signed in to change notification settings - Fork 1
Preserve images on gallery-style pages #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add smart fallback logic to detect when Readability.js strips >50% of images from specific CMS containers and fall back to direct extraction with cleaning. New helper functions: - countImagesInHtml(): Count images in HTML string using DOMParser - cleanContainerHtml(): Strip non-content elements (styles, scripts, promos) - findContentContainer(): Find CMS-specific containers (.rte, .entry-content, etc) Modified extractContent() to detect gallery pages where text-to-image ratio is low and use the CMS container directly instead of Readability's output. This preserves all images while still removing styles and promo content. Fixes issue where Shopify gallery pages (e.g. blambot.com) had images stripped while maintaining clean output for news sites (BBC, TOI, etc).
|
Claude finished @gupsammy's task —— View job Code Review: Image Preservation on Gallery Pages ✅Review SummaryVerdict: Ship it. This PR successfully solves the gallery page problem with clean, pragmatic code. No critical or high-priority issues found. What I ReviewedThe PR adds intelligent fallback logic to detect when Readability.js strips too many images (>50% loss) and falls back to direct CMS container extraction with cleanup. I checked for:
Code Analysis✅ Three new helper functions:
✅ Modified
Security & Safety✅ No security issues
✅ Manifest V3 compliant
Logic & Edge Cases✅ The 50% threshold is reasonable
✅ Fallback safety
Minor Observations (not blockers)Promo selector patterns (lines 81-91): No error handling on Testing CoverageBased on the PR description, you tested:
This is solid coverage for the intended use cases. Final ThoughtsThis is exactly the kind of pragmatic solution that makes sense for a solo project:
No changes needed. Ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const selector of selectors) { | ||
| const el = document.querySelector(selector); | ||
| if (el && el.querySelectorAll('img').length > 0) { | ||
| return el; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose primary CMS container before triggering fallback
findContentContainer() picks the first matching CMS selector that contains any <img> and feeds it into the >50% image-loss check. On pages with multiple .rte/.entry-content blocks (e.g., Shopify themes with header/footer rich-text widgets or product recommendation sections), the first match can be a small non-article fragment. If Readability drops that fragment’s logo image, imageLossRatio exceeds 0.5 and the new fallback returns only that snippet, replacing the real article/gallery content that Readability extracted. Selecting the most content-rich container (e.g., highest text or image count) rather than the first match would avoid this regression.
Useful? React with 👍 / 👎.
Summary
Fixed an issue where Readability.js strips images from gallery-style pages (like Shopify stores with many images and little text). Added intelligent fallback logic that detects when >50% of images are lost during content extraction and falls back to direct extraction from CMS-specific containers with cleaning.
Changes
Added three new helper functions to
lib/ezycopy.js:Modified the extractContent() function to add smart fallback logic:
Problem Solved
Readability's heuristic removes images when text-to-image ratio is low (p/img < 0.5). This caused:
Testing