Preserve images on gallery-style pages #5

gupsammy · 2025-11-29T06:41:52Z

Summary

Fixed an issue where Readability.js strips images from gallery-style pages (like Shopify stores with many images and little text). Added intelligent fallback logic that detects when >50% of images are lost during content extraction and falls back to direct extraction from CMS-specific containers with cleaning.

Changes

Added three new helper functions to lib/ezycopy.js:

countImagesInHtml() - Counts images in an HTML string using DOMParser to detect image loss
cleanContainerHtml() - Strips non-content elements (style tags, scripts, promo/ad elements) before Turndown processes the HTML
findContentContainer() - Finds CMS-specific content containers (.rte for Shopify, .entry-content for WordPress, etc.) that reliably contain only article content

Modified the extractContent() function to add smart fallback logic:

Detects when Readability strips >50% of images from a specific CMS container
Falls back to extracting from the CMS container directly with cleaning
Only triggers for CMS-specific containers, not generic ones like body (prevents false positives on news sites)

Problem Solved

Readability's heuristic removes images when text-to-image ratio is low (p/img < 0.5). This caused:

Gallery pages like blambot.com/pages/lettering-tips had all 24 images stripped
But now detected and falls back to preserve all images
News sites like BBC and TOI continue to use Readability's clean output (no false positives)
Promo content and CSS styles are stripped from fallback extraction

Testing

Blambot (Shopify gallery): All 24 images now captured
Times of India: Uses Readability, clean output preserved
BBC News: Uses Readability, clean output preserved
Ars Technica: Fallback triggers but CSS/promos are cleaned

Add smart fallback logic to detect when Readability.js strips >50% of images from specific CMS containers and fall back to direct extraction with cleaning. New helper functions: - countImagesInHtml(): Count images in HTML string using DOMParser - cleanContainerHtml(): Strip non-content elements (styles, scripts, promos) - findContentContainer(): Find CMS-specific containers (.rte, .entry-content, etc) Modified extractContent() to detect gallery pages where text-to-image ratio is low and use the CMS container directly instead of Readability's output. This preserves all images while still removing styles and promo content. Fixes issue where Shopify gallery pages (e.g. blambot.com) had images stripped while maintaining clean output for news sites (BBC, TOI, etc).

claude · 2025-11-29T06:42:04Z

Claude finished @gupsammy's task —— View job

Code Review: Image Preservation on Gallery Pages ✅

Review Summary

Verdict: Ship it. This PR successfully solves the gallery page problem with clean, pragmatic code. No critical or high-priority issues found.

What I Reviewed

The PR adds intelligent fallback logic to detect when Readability.js strips too many images (>50% loss) and falls back to direct CMS container extraction with cleanup. I checked for:

Critical issues: Crashes, Chrome API misuse, MV3 violations, security vulnerabilities
High-priority issues: User-facing bugs, memory leaks, race conditions
Chrome Extension specifics: Manifest V3 compliance, content script safety

Code Analysis

✅ Three new helper functions:

countImagesInHtml(html) (lines 62-66)
- Uses DOMParser to safely count images in HTML string
- Safe for content scripts - no script execution
cleanContainerHtml(container) (lines 73-97)
- Clones node to avoid DOM mutations
- Removes scripts, styles, and common promo patterns
- Good defensive coding with the clone
findContentContainer() (lines 104-124)
- Checks CMS-specific selectors (.rte, .entry-content, etc.)
- Only returns container if it has images
- Smart fallback to document.body

✅ Modified extractContent() function (lines 219-244)

Detects image loss: (containerImageCount - readabilityImageCount) / containerImageCount > 0.5
Only triggers fallback for specific CMS containers (not generic body)
Preserves Readability's metadata (article.title, article.byline)

Security & Safety

✅ No security issues

DOMParser is safe - doesn't execute scripts
cloneNode(true) prevents unintended DOM mutations
No Chrome API misuse
No user input validation needed (all DOM-based)

✅ Manifest V3 compliant

All code runs in content script context
No deprecated APIs used
No CSP violations

Logic & Edge Cases

✅ The 50% threshold is reasonable

Catches gallery pages (100% loss) without false positives
Tested on Blambot (Shopify), BBC, TOI, Ars Technica

✅ Fallback safety

isSpecificContainer check prevents triggering on news sites with generic containers
Division by zero handled: containerImageCount > 0 check before ratio calculation
Fallback still uses Readability's article.title and article.byline (good UX)

Minor Observations (not blockers)

Promo selector patterns (lines 81-91):
The attribute substring selectors like [class*="promo"] will match class="promotion", class="compromise", etc. This is probably fine for your use case - overly aggressive cleaning is better than preserving junk on gallery pages. If you ever see false positives, you can tighten these later.

No error handling on querySelector:
The code assumes container.querySelectorAll('img') and clone.querySelectorAll() won't throw. This is safe - these APIs don't throw on content scripts.

Testing Coverage

Based on the PR description, you tested:

✅ Blambot (Shopify gallery): 24 images preserved
✅ BBC News: Readability still used (no fallback)
✅ Times of India: Readability still used
✅ Ars Technica: Fallback triggered, promos cleaned

This is solid coverage for the intended use cases.

Final Thoughts

This is exactly the kind of pragmatic solution that makes sense for a solo project:

Solves a real problem (gallery pages losing images)
Doesn't over-engineer (simple 50% threshold)
Preserves existing behavior for normal articles
Clean, readable code

No changes needed. Ready to merge.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-29T06:46:14Z

lib/ezycopy.js

+  for (const selector of selectors) {
+    const el = document.querySelector(selector);
+    if (el && el.querySelectorAll('img').length > 0) {
+      return el;


Choose primary CMS container before triggering fallback

findContentContainer() picks the first matching CMS selector that contains any <img> and feeds it into the >50% image-loss check. On pages with multiple .rte/.entry-content blocks (e.g., Shopify themes with header/footer rich-text widgets or product recommendation sections), the first match can be a small non-article fragment. If Readability drops that fragment’s logo image, imageLossRatio exceeds 0.5 and the new fallback returns only that snippet, replacing the real article/gallery content that Readability extracted. Selecting the most content-rich container (e.g., highest text or image count) rather than the first match would avoid this regression.

Useful? React with 👍 / 👎.

gupsammy merged commit 40ec732 into main Nov 29, 2025
1 check passed

chatgpt-codex-connector bot reviewed Nov 29, 2025

View reviewed changes

gupsammy deleted the feat/preserve-images-gallery-pages branch November 29, 2025 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve images on gallery-style pages #5

Preserve images on gallery-style pages #5

Uh oh!

gupsammy commented Nov 29, 2025

Uh oh!

claude bot commented Nov 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preserve images on gallery-style pages #5

Preserve images on gallery-style pages #5

Uh oh!

Conversation

gupsammy commented Nov 29, 2025

Summary

Changes

Problem Solved

Testing

Uh oh!

claude bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Image Preservation on Gallery Pages ✅

Review Summary

What I Reviewed

Code Analysis

Security & Safety

Logic & Edge Cases

Minor Observations (not blockers)

Testing Coverage

Final Thoughts

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Nov 29, 2025 •

edited

Loading