Made mg-assetscraper-db default asset scraper by 55sketch · Pull Request #1535 · TryGhost/migrate

55sketch · 2026-02-05T11:51:29Z

Updated 11 source files to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper
Changed asset scraper initialization to use allowAllDomains: true
Added async init() call after creating the scraper
Changed assetScraper.fetch(ctx) to assetScraper.getTasks()
Removed @tryghost/mg-assetscraper dependency from package.json

Benefits of mg-assetscraper-db:

SQLite caching prevents re-downloading assets
Auto-converts HEIC/HEIF → JPEG, AVIF → WebP
Better performance with modern fetch API
Support for base64 image extraction

Note

Medium Risk
Touches core migration scraping behavior and CLI options across multiple sources; mistakes could change what gets downloaded or scraped during migrations and impact performance/outputs.

Overview
Switches the migration CLI’s asset downloading to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper, including dropping the old dependency.

Updates all affected sources to initialize the new scraper with allowAllDomains: true, call the new async init(), and run asset work via getTasks(); the --scrape option is simplified to all|web|assets|none while still accepting legacy img|media|files aliases, and WordPress content processing now treats remote scraping as enabled unless --scrape none is set.

^{Written by Cursor Bugbot for commit 6081b2c. This will update automatically on new commits. Configure here.}

…tools - Updated 11 source files to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper - Changed asset scraper initialization to use allowAllDomains: true - Added async init() call after creating the scraper - Changed assetScraper.fetch(ctx) to assetScraper.getTasks() - Removed @tryghost/mg-assetscraper dependency from package.json Benefits of mg-assetscraper-db: - SQLite caching prevents re-downloading assets - Auto-converts HEIC/HEIF → JPEG, AVIF → WebP - Better performance with modern fetch API - Support for base64 image extraction

coderabbitai · 2026-02-05T11:51:49Z

Walkthrough

This PR centralizes asset-scraping across migrate: source files switch their asset-scraper import to the db-backed package, constructor options are simplified (sizeLimit and per-type allow flags removed, allowAllDomains added), an async init() call is introduced and awaited, initialization tasks are made async, per-type scrape flags (images/media/files) are consolidated into a single assets flag, and calls to fetch(ctx) are replaced with getTasks() (no ctx). CLI commands remove the sizeLimit option and add/reorder an assets scrape choice with expanded descriptions. README, types, and AssetScraper defaults updated to remove per-type allow options.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Made mg-assetscraper-db default asset scraper' clearly and concisely describes the main change: switching the default asset scraper library across the codebase.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description clearly relates to the changeset, detailing the migration from `@tryghost/mg-assetscraper` to `@tryghost/mg-assetscraper-db` with specific changes and benefits.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch make-db-asset-scraper-default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/migrate/sources/jekyll.js (1)

74-94: ⚠️ Potential issue | 🟠 Major

Asset size limits are not enforced—the CLI sizeLimit option is defined but not wired to MgAssetScraper.

The jekyll.js CLI defines --sizeLimit ("Assets larger than this size (defined in MB) will be ignored"), but MgAssetScraper from @tryghost/mg-assetscraper-db doesn't support a sizeLimit option. This appears to be a regression from the older mg-assetscraper package, which did support this field.

This is a systemic issue affecting all source adapters (wp-xml, wp-api, blogger, squarespace, substack, tinynews, libsyn, letterdrop, medium) that use mg-assetscraper-db. Either add sizeLimit support to mg-assetscraper-db or remove the CLI option and update documentation to clarify that asset size limits are no longer enforced.

🤖 Fix all issues with AI agents

In `@packages/migrate/sources/letterdrop.js`:
- Around line 30-36: The MgAssetScraper is constructed without a size limit, so
pass the CLI-provided limit through by adding sizeLimit: options.sizeLimit to
the scraper config object used when creating ctx.assetScraper (the new
MgAssetScraper(ctx.fileCache, { ... }, ctx) call); update the config keys
(alongside allowAllDomains, allowImages, allowMedia, allowFiles) to include
sizeLimit so the scraper enforces the user-specified limit.

packages/migrate/sources/letterdrop.js

- Removed allowImages, allowMedia, allowFiles from AssetScraper options (not used in implementation) - Simplified ctx.allowScrape to use 'assets' flag instead of separate image/media/files flags - Updated skip conditions to use !ctx.allowScrape.assets - Updated mg-assetscraper-db README to remove unused options - Cleaned up types.ts to remove unused option types - Added explicit 'assets' option to --scrape flag choices - Updated --scrape flag documentation: all = web + assets, web = metadata only, assets = download assets only - Kept img/media/files options for backward compatibility (all trigger asset scraping) - Removed --sizeLimit CLI option (not supported by new asset scraper)

- Updated choices to: all, web, assets, none - Added descriptions for each option - Removed --sizeLimit documentation (option no longer supported) - Fixed example in wp-api README

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@packages/migrate/commands/buttondown.js`:
- Around line 34-38: Update the help text for the '--scrape' flag to clarify
that the choices 'img', 'media', and 'files' are legacy aliases for 'assets' (or
remove them if you prefer not to accept legacy inputs); specifically change the
desc string associated with the flags: '--scrape' and the choices array to
mention that 'img/media/files' are legacy aliases for 'assets' (e.g., "assets
(download assets only) — legacy aliases: img, media, files") so users understand
the mapping, or remove those legacy tokens from the choices array if they should
no longer be user-facing.

🧹 Nitpick comments (1)

packages/migrate/commands/jekyll.js (1)
39-44: Clarify legacy scrape aliases in the CLI help text.

img, media, and files remain as choices, but the description doesn’t say they’re legacy aliases for assets. That can confuse users scanning --help. Consider noting the back-compat behavior in the description.
✍️ Suggested wording tweak
-        desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only)'
+        desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only; img/media/files are legacy aliases for assets)'

packages/migrate/commands/buttondown.js

Updated help text to note that img, media, files are legacy aliases for assets

packages/migrate/commands/ghost.js

- Changed allowRemoteScraping logic to 'if not none' instead of checking for legacy 'media' option - This allows remote scraping for all, web, and assets options

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-05T12:43:42Z

packages/mg-wp-api/lib/processor.js

-    if (options?.scrape?.includes('all') || options.scrape?.includes('media')) {
-        allowRemoteScraping = true;
-    }
+    let allowRemoteScraping = !options?.scrape?.includes('none');


Remote scraping logic inconsistent with scrape options

Medium Severity

The allowRemoteScraping logic uses !options?.scrape?.includes('none') which is too permissive. When a user specifies --scrape web (expecting only web metadata scraping), this evaluates to true because 'web' doesn't include 'none'. This causes remote scraping of Libsyn podcasts and WordPress bookmark cards when it shouldn't. The logic should match how ctx.allowScrape.assets is computed in the source files, checking for 'all', 'assets', 'img', 'media', or 'files'.

cursor · 2026-02-05T12:43:42Z

packages/migrate/sources/beehiiv.js

-                images: ctx.options.scrape.includes('img') || ctx.options.scrape.includes('all'),
-                media: ctx.options.scrape.includes('media') || ctx.options.scrape.includes('all'),
-                files: ctx.options.scrape.includes('files') || ctx.options.scrape.includes('all'),
+                assets: ctx.options.scrape.includes('all') || ctx.options.scrape.includes('assets') || ctx.options.scrape.includes('img') || ctx.options.scrape.includes('media') || ctx.options.scrape.includes('files'),


Unused all property in ctx.allowScrape object

Low Severity

The all property (all: ctx.options.scrape.includes('all')) is defined in ctx.allowScrape across all 13 migration source files but is never read or used. Only ctx.allowScrape.web and ctx.allowScrape.assets are used in skip conditions. This property should be removed.

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

packages/migrate/sources/letterdrop.js Show resolved Hide resolved

55sketch added 2 commits February 5, 2026 12:04

docs: update --scrape flag documentation in all tool READMEs

a6820ec

- Updated choices to: all, web, assets, none - Added descriptions for each option - Removed --sizeLimit documentation (option no longer supported) - Fixed example in wp-api README

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

packages/migrate/commands/buttondown.js Outdated Show resolved Hide resolved

docs: clarify legacy aliases for --scrape assets option

36c1561

Updated help text to note that img, media, files are legacy aliases for assets

cursor bot reviewed Feb 5, 2026

View reviewed changes

packages/migrate/commands/ghost.js Outdated Show resolved Hide resolved

55sketch added 2 commits February 5, 2026 12:26

fix: add missing legacy aliases to ghost.js --scrape choices

53f28e4

fix: update scrape option checks in mg-wp-api processor

6081b2c

- Changed allowRemoteScraping logic to 'if not none' instead of checking for legacy 'media' option - This allows remote scraping for all, web, and assets options

55sketch merged commit ea718fa into main Feb 5, 2026
3 checks passed

55sketch deleted the make-db-asset-scraper-default branch February 5, 2026 12:40

cursor bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Made mg-assetscraper-db default asset scraper#1535

Made mg-assetscraper-db default asset scraper#1535
55sketch merged 6 commits intomainfrom
make-db-asset-scraper-default

55sketch commented Feb 5, 2026 •

edited by cursor bot

Loading

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 5, 2026

Uh oh!

cursor bot Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

55sketch commented Feb 5, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Remote scraping logic inconsistent with scrape options

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Unused all property in ctx.allowScrape object

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

55sketch commented Feb 5, 2026 •

edited by cursor bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Unused `all` property in `ctx.allowScrape` object