Made mg-assetscraper-db default asset scraper#1535
Conversation
…tools - Updated 11 source files to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper - Changed asset scraper initialization to use allowAllDomains: true - Added async init() call after creating the scraper - Changed assetScraper.fetch(ctx) to assetScraper.getTasks() - Removed @tryghost/mg-assetscraper dependency from package.json Benefits of mg-assetscraper-db: - SQLite caching prevents re-downloading assets - Auto-converts HEIC/HEIF → JPEG, AVIF → WebP - Better performance with modern fetch API - Support for base64 image extraction
WalkthroughThis PR centralizes asset-scraping across migrate: source files switch their asset-scraper import to the db-backed package, constructor options are simplified (sizeLimit and per-type allow flags removed, allowAllDomains added), an async init() call is introduced and awaited, initialization tasks are made async, per-type scrape flags (images/media/files) are consolidated into a single assets flag, and calls to fetch(ctx) are replaced with getTasks() (no ctx). CLI commands remove the sizeLimit option and add/reorder an assets scrape choice with expanded descriptions. README, types, and AssetScraper defaults updated to remove per-type allow options. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/migrate/sources/jekyll.js (1)
74-94:⚠️ Potential issue | 🟠 MajorAsset size limits are not enforced—the CLI
sizeLimitoption is defined but not wired toMgAssetScraper.The jekyll.js CLI defines
--sizeLimit("Assets larger than this size (defined in MB) will be ignored"), butMgAssetScraperfrom@tryghost/mg-assetscraper-dbdoesn't support asizeLimitoption. This appears to be a regression from the oldermg-assetscraperpackage, which did support this field.This is a systemic issue affecting all source adapters (wp-xml, wp-api, blogger, squarespace, substack, tinynews, libsyn, letterdrop, medium) that use
mg-assetscraper-db. Either addsizeLimitsupport tomg-assetscraper-dbor remove the CLI option and update documentation to clarify that asset size limits are no longer enforced.
🤖 Fix all issues with AI agents
In `@packages/migrate/sources/letterdrop.js`:
- Around line 30-36: The MgAssetScraper is constructed without a size limit, so
pass the CLI-provided limit through by adding sizeLimit: options.sizeLimit to
the scraper config object used when creating ctx.assetScraper (the new
MgAssetScraper(ctx.fileCache, { ... }, ctx) call); update the config keys
(alongside allowAllDomains, allowImages, allowMedia, allowFiles) to include
sizeLimit so the scraper enforces the user-specified limit.
- Removed allowImages, allowMedia, allowFiles from AssetScraper options (not used in implementation) - Simplified ctx.allowScrape to use 'assets' flag instead of separate image/media/files flags - Updated skip conditions to use !ctx.allowScrape.assets - Updated mg-assetscraper-db README to remove unused options - Cleaned up types.ts to remove unused option types - Added explicit 'assets' option to --scrape flag choices - Updated --scrape flag documentation: all = web + assets, web = metadata only, assets = download assets only - Kept img/media/files options for backward compatibility (all trigger asset scraping) - Removed --sizeLimit CLI option (not supported by new asset scraper)
- Updated choices to: all, web, assets, none - Added descriptions for each option - Removed --sizeLimit documentation (option no longer supported) - Fixed example in wp-api README
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/migrate/commands/buttondown.js`:
- Around line 34-38: Update the help text for the '--scrape' flag to clarify
that the choices 'img', 'media', and 'files' are legacy aliases for 'assets' (or
remove them if you prefer not to accept legacy inputs); specifically change the
desc string associated with the flags: '--scrape' and the choices array to
mention that 'img/media/files' are legacy aliases for 'assets' (e.g., "assets
(download assets only) — legacy aliases: img, media, files") so users understand
the mapping, or remove those legacy tokens from the choices array if they should
no longer be user-facing.
🧹 Nitpick comments (1)
packages/migrate/commands/jekyll.js (1)
39-44: Clarify legacy scrape aliases in the CLI help text.
img,media, andfilesremain as choices, but the description doesn’t say they’re legacy aliases forassets. That can confuse users scanning--help. Consider noting the back-compat behavior in the description.✍️ Suggested wording tweak
- desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only)' + desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only; img/media/files are legacy aliases for assets)'
Updated help text to note that img, media, files are legacy aliases for assets
- Changed allowRemoteScraping logic to 'if not none' instead of checking for legacy 'media' option - This allows remote scraping for all, web, and assets options
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| if (options?.scrape?.includes('all') || options.scrape?.includes('media')) { | ||
| allowRemoteScraping = true; | ||
| } | ||
| let allowRemoteScraping = !options?.scrape?.includes('none'); |
There was a problem hiding this comment.
Remote scraping logic inconsistent with scrape options
Medium Severity
The allowRemoteScraping logic uses !options?.scrape?.includes('none') which is too permissive. When a user specifies --scrape web (expecting only web metadata scraping), this evaluates to true because 'web' doesn't include 'none'. This causes remote scraping of Libsyn podcasts and WordPress bookmark cards when it shouldn't. The logic should match how ctx.allowScrape.assets is computed in the source files, checking for 'all', 'assets', 'img', 'media', or 'files'.
| images: ctx.options.scrape.includes('img') || ctx.options.scrape.includes('all'), | ||
| media: ctx.options.scrape.includes('media') || ctx.options.scrape.includes('all'), | ||
| files: ctx.options.scrape.includes('files') || ctx.options.scrape.includes('all'), | ||
| assets: ctx.options.scrape.includes('all') || ctx.options.scrape.includes('assets') || ctx.options.scrape.includes('img') || ctx.options.scrape.includes('media') || ctx.options.scrape.includes('files'), |
There was a problem hiding this comment.
Unused all property in ctx.allowScrape object
Low Severity
The all property (all: ctx.options.scrape.includes('all')) is defined in ctx.allowScrape across all 13 migration source files but is never read or used. Only ctx.allowScrape.web and ctx.allowScrape.assets are used in skip conditions. This property should be removed.


Benefits of mg-assetscraper-db:
Note
Medium Risk
Touches core migration scraping behavior and CLI options across multiple sources; mistakes could change what gets downloaded or scraped during migrations and impact performance/outputs.
Overview
Switches the migration CLI’s asset downloading to use
@tryghost/mg-assetscraper-dbinstead of@tryghost/mg-assetscraper, including dropping the old dependency.Updates all affected sources to initialize the new scraper with
allowAllDomains: true, call the new asyncinit(), and run asset work viagetTasks(); the--scrapeoption is simplified toall|web|assets|nonewhile still accepting legacyimg|media|filesaliases, and WordPress content processing now treats remote scraping as enabled unless--scrape noneis set.Written by Cursor Bugbot for commit 6081b2c. This will update automatically on new commits. Configure here.