Skip to content

Conversation

@gupsammy
Copy link
Owner

Summary

Fixed an issue where Readability.js strips images from gallery-style pages (like Shopify stores with many images and little text). Added intelligent fallback logic that detects when >50% of images are lost during content extraction and falls back to direct extraction from CMS-specific containers with cleaning.

Changes

Added three new helper functions to lib/ezycopy.js:

  1. countImagesInHtml() - Counts images in an HTML string using DOMParser to detect image loss
  2. cleanContainerHtml() - Strips non-content elements (style tags, scripts, promo/ad elements) before Turndown processes the HTML
  3. findContentContainer() - Finds CMS-specific content containers (.rte for Shopify, .entry-content for WordPress, etc.) that reliably contain only article content

Modified the extractContent() function to add smart fallback logic:

  • Detects when Readability strips >50% of images from a specific CMS container
  • Falls back to extracting from the CMS container directly with cleaning
  • Only triggers for CMS-specific containers, not generic ones like body (prevents false positives on news sites)

Problem Solved

Readability's heuristic removes images when text-to-image ratio is low (p/img < 0.5). This caused:

  • Gallery pages like blambot.com/pages/lettering-tips had all 24 images stripped
  • But now detected and falls back to preserve all images
  • News sites like BBC and TOI continue to use Readability's clean output (no false positives)
  • Promo content and CSS styles are stripped from fallback extraction

Testing

  • Blambot (Shopify gallery): All 24 images now captured
  • Times of India: Uses Readability, clean output preserved
  • BBC News: Uses Readability, clean output preserved
  • Ars Technica: Fallback triggers but CSS/promos are cleaned

Add smart fallback logic to detect when Readability.js strips >50% of images
from specific CMS containers and fall back to direct extraction with cleaning.

New helper functions:
- countImagesInHtml(): Count images in HTML string using DOMParser
- cleanContainerHtml(): Strip non-content elements (styles, scripts, promos)
- findContentContainer(): Find CMS-specific containers (.rte, .entry-content, etc)

Modified extractContent() to detect gallery pages where text-to-image ratio is
low and use the CMS container directly instead of Readability's output. This
preserves all images while still removing styles and promo content.

Fixes issue where Shopify gallery pages (e.g. blambot.com) had images stripped
while maintaining clean output for news sites (BBC, TOI, etc).
@claude
Copy link
Contributor

claude bot commented Nov 29, 2025

Claude finished @gupsammy's task —— View job


Code Review: Image Preservation on Gallery Pages ✅

Review Summary

Verdict: Ship it. This PR successfully solves the gallery page problem with clean, pragmatic code. No critical or high-priority issues found.


What I Reviewed

The PR adds intelligent fallback logic to detect when Readability.js strips too many images (>50% loss) and falls back to direct CMS container extraction with cleanup. I checked for:

  • Critical issues: Crashes, Chrome API misuse, MV3 violations, security vulnerabilities
  • High-priority issues: User-facing bugs, memory leaks, race conditions
  • Chrome Extension specifics: Manifest V3 compliance, content script safety

Code Analysis

✅ Three new helper functions:

  1. countImagesInHtml(html) (lines 62-66)

    • Uses DOMParser to safely count images in HTML string
    • Safe for content scripts - no script execution
  2. cleanContainerHtml(container) (lines 73-97)

    • Clones node to avoid DOM mutations
    • Removes scripts, styles, and common promo patterns
    • Good defensive coding with the clone
  3. findContentContainer() (lines 104-124)

    • Checks CMS-specific selectors (.rte, .entry-content, etc.)
    • Only returns container if it has images
    • Smart fallback to document.body

✅ Modified extractContent() function (lines 219-244)

  • Detects image loss: (containerImageCount - readabilityImageCount) / containerImageCount > 0.5
  • Only triggers fallback for specific CMS containers (not generic body)
  • Preserves Readability's metadata (article.title, article.byline)

Security & Safety

No security issues

  • DOMParser is safe - doesn't execute scripts
  • cloneNode(true) prevents unintended DOM mutations
  • No Chrome API misuse
  • No user input validation needed (all DOM-based)

Manifest V3 compliant

  • All code runs in content script context
  • No deprecated APIs used
  • No CSP violations

Logic & Edge Cases

The 50% threshold is reasonable

  • Catches gallery pages (100% loss) without false positives
  • Tested on Blambot (Shopify), BBC, TOI, Ars Technica

Fallback safety

  • isSpecificContainer check prevents triggering on news sites with generic containers
  • Division by zero handled: containerImageCount > 0 check before ratio calculation
  • Fallback still uses Readability's article.title and article.byline (good UX)

Minor Observations (not blockers)

Promo selector patterns (lines 81-91):
The attribute substring selectors like [class*="promo"] will match class="promotion", class="compromise", etc. This is probably fine for your use case - overly aggressive cleaning is better than preserving junk on gallery pages. If you ever see false positives, you can tighten these later.

No error handling on querySelector:
The code assumes container.querySelectorAll('img') and clone.querySelectorAll() won't throw. This is safe - these APIs don't throw on content scripts.


Testing Coverage

Based on the PR description, you tested:

  • ✅ Blambot (Shopify gallery): 24 images preserved
  • ✅ BBC News: Readability still used (no fallback)
  • ✅ Times of India: Readability still used
  • ✅ Ars Technica: Fallback triggered, promos cleaned

This is solid coverage for the intended use cases.


Final Thoughts

This is exactly the kind of pragmatic solution that makes sense for a solo project:

  • Solves a real problem (gallery pages losing images)
  • Doesn't over-engineer (simple 50% threshold)
  • Preserves existing behavior for normal articles
  • Clean, readable code

No changes needed. Ready to merge.


@gupsammy gupsammy merged commit 40ec732 into main Nov 29, 2025
1 check passed
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +117 to +120
for (const selector of selectors) {
const el = document.querySelector(selector);
if (el && el.querySelectorAll('img').length > 0) {
return el;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Choose primary CMS container before triggering fallback

findContentContainer() picks the first matching CMS selector that contains any <img> and feeds it into the >50% image-loss check. On pages with multiple .rte/.entry-content blocks (e.g., Shopify themes with header/footer rich-text widgets or product recommendation sections), the first match can be a small non-article fragment. If Readability drops that fragment’s logo image, imageLossRatio exceeds 0.5 and the new fallback returns only that snippet, replacing the real article/gallery content that Readability extracted. Selecting the most content-rich container (e.g., highest text or image count) rather than the first match would avoid this regression.

Useful? React with 👍 / 👎.

@gupsammy gupsammy deleted the feat/preserve-images-gallery-pages branch November 29, 2025 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants