Skip to content

Made mg-assetscraper-db default asset scraper#1535

Merged
55sketch merged 6 commits intomainfrom
make-db-asset-scraper-default
Feb 5, 2026
Merged

Made mg-assetscraper-db default asset scraper#1535
55sketch merged 6 commits intomainfrom
make-db-asset-scraper-default

Conversation

@55sketch
Copy link
Contributor

@55sketch 55sketch commented Feb 5, 2026

  • Updated 11 source files to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper
  • Changed asset scraper initialization to use allowAllDomains: true
  • Added async init() call after creating the scraper
  • Changed assetScraper.fetch(ctx) to assetScraper.getTasks()
  • Removed @tryghost/mg-assetscraper dependency from package.json

Benefits of mg-assetscraper-db:

  • SQLite caching prevents re-downloading assets
  • Auto-converts HEIC/HEIF → JPEG, AVIF → WebP
  • Better performance with modern fetch API
  • Support for base64 image extraction

Note

Medium Risk
Touches core migration scraping behavior and CLI options across multiple sources; mistakes could change what gets downloaded or scraped during migrations and impact performance/outputs.

Overview
Switches the migration CLI’s asset downloading to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper, including dropping the old dependency.

Updates all affected sources to initialize the new scraper with allowAllDomains: true, call the new async init(), and run asset work via getTasks(); the --scrape option is simplified to all|web|assets|none while still accepting legacy img|media|files aliases, and WordPress content processing now treats remote scraping as enabled unless --scrape none is set.

Written by Cursor Bugbot for commit 6081b2c. This will update automatically on new commits. Configure here.

…tools

- Updated 11 source files to use @tryghost/mg-assetscraper-db instead of @tryghost/mg-assetscraper
- Changed asset scraper initialization to use allowAllDomains: true
- Added async init() call after creating the scraper
- Changed assetScraper.fetch(ctx) to assetScraper.getTasks()
- Removed @tryghost/mg-assetscraper dependency from package.json

Benefits of mg-assetscraper-db:
- SQLite caching prevents re-downloading assets
- Auto-converts HEIC/HEIF → JPEG, AVIF → WebP
- Better performance with modern fetch API
- Support for base64 image extraction
@coderabbitai
Copy link

coderabbitai bot commented Feb 5, 2026

Walkthrough

This PR centralizes asset-scraping across migrate: source files switch their asset-scraper import to the db-backed package, constructor options are simplified (sizeLimit and per-type allow flags removed, allowAllDomains added), an async init() call is introduced and awaited, initialization tasks are made async, per-type scrape flags (images/media/files) are consolidated into a single assets flag, and calls to fetch(ctx) are replaced with getTasks() (no ctx). CLI commands remove the sizeLimit option and add/reorder an assets scrape choice with expanded descriptions. README, types, and AssetScraper defaults updated to remove per-type allow options.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Made mg-assetscraper-db default asset scraper' clearly and concisely describes the main change: switching the default asset scraper library across the codebase.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description clearly relates to the changeset, detailing the migration from @tryghost/mg-assetscraper to @tryghost/mg-assetscraper-db with specific changes and benefits.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch make-db-asset-scraper-default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/migrate/sources/jekyll.js (1)

74-94: ⚠️ Potential issue | 🟠 Major

Asset size limits are not enforced—the CLI sizeLimit option is defined but not wired to MgAssetScraper.

The jekyll.js CLI defines --sizeLimit ("Assets larger than this size (defined in MB) will be ignored"), but MgAssetScraper from @tryghost/mg-assetscraper-db doesn't support a sizeLimit option. This appears to be a regression from the older mg-assetscraper package, which did support this field.

This is a systemic issue affecting all source adapters (wp-xml, wp-api, blogger, squarespace, substack, tinynews, libsyn, letterdrop, medium) that use mg-assetscraper-db. Either add sizeLimit support to mg-assetscraper-db or remove the CLI option and update documentation to clarify that asset size limits are no longer enforced.

🤖 Fix all issues with AI agents
In `@packages/migrate/sources/letterdrop.js`:
- Around line 30-36: The MgAssetScraper is constructed without a size limit, so
pass the CLI-provided limit through by adding sizeLimit: options.sizeLimit to
the scraper config object used when creating ctx.assetScraper (the new
MgAssetScraper(ctx.fileCache, { ... }, ctx) call); update the config keys
(alongside allowAllDomains, allowImages, allowMedia, allowFiles) to include
sizeLimit so the scraper enforces the user-specified limit.

- Removed allowImages, allowMedia, allowFiles from AssetScraper options (not used in implementation)
- Simplified ctx.allowScrape to use 'assets' flag instead of separate image/media/files flags
- Updated skip conditions to use !ctx.allowScrape.assets
- Updated mg-assetscraper-db README to remove unused options
- Cleaned up types.ts to remove unused option types
- Added explicit 'assets' option to --scrape flag choices
- Updated --scrape flag documentation: all = web + assets, web = metadata only, assets = download assets only
- Kept img/media/files options for backward compatibility (all trigger asset scraping)
- Removed --sizeLimit CLI option (not supported by new asset scraper)
- Updated choices to: all, web, assets, none
- Added descriptions for each option
- Removed --sizeLimit documentation (option no longer supported)
- Fixed example in wp-api README
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/migrate/commands/buttondown.js`:
- Around line 34-38: Update the help text for the '--scrape' flag to clarify
that the choices 'img', 'media', and 'files' are legacy aliases for 'assets' (or
remove them if you prefer not to accept legacy inputs); specifically change the
desc string associated with the flags: '--scrape' and the choices array to
mention that 'img/media/files' are legacy aliases for 'assets' (e.g., "assets
(download assets only) — legacy aliases: img, media, files") so users understand
the mapping, or remove those legacy tokens from the choices array if they should
no longer be user-facing.
🧹 Nitpick comments (1)
packages/migrate/commands/jekyll.js (1)

39-44: Clarify legacy scrape aliases in the CLI help text.

img, media, and files remain as choices, but the description doesn’t say they’re legacy aliases for assets. That can confuse users scanning --help. Consider noting the back-compat behavior in the description.

✍️ Suggested wording tweak
-        desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only)'
+        desc: 'Configure scraping tasks (all = web + assets, web = metadata only, assets = download assets only; img/media/files are legacy aliases for assets)'

Updated help text to note that img, media, files are legacy aliases for assets
- Changed allowRemoteScraping logic to 'if not none' instead of checking for legacy 'media' option
- This allows remote scraping for all, web, and assets options
@55sketch 55sketch merged commit ea718fa into main Feb 5, 2026
3 checks passed
@55sketch 55sketch deleted the make-db-asset-scraper-default branch February 5, 2026 12:40
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

if (options?.scrape?.includes('all') || options.scrape?.includes('media')) {
allowRemoteScraping = true;
}
let allowRemoteScraping = !options?.scrape?.includes('none');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote scraping logic inconsistent with scrape options

Medium Severity

The allowRemoteScraping logic uses !options?.scrape?.includes('none') which is too permissive. When a user specifies --scrape web (expecting only web metadata scraping), this evaluates to true because 'web' doesn't include 'none'. This causes remote scraping of Libsyn podcasts and WordPress bookmark cards when it shouldn't. The logic should match how ctx.allowScrape.assets is computed in the source files, checking for 'all', 'assets', 'img', 'media', or 'files'.

Fix in Cursor Fix in Web

images: ctx.options.scrape.includes('img') || ctx.options.scrape.includes('all'),
media: ctx.options.scrape.includes('media') || ctx.options.scrape.includes('all'),
files: ctx.options.scrape.includes('files') || ctx.options.scrape.includes('all'),
assets: ctx.options.scrape.includes('all') || ctx.options.scrape.includes('assets') || ctx.options.scrape.includes('img') || ctx.options.scrape.includes('media') || ctx.options.scrape.includes('files'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused all property in ctx.allowScrape object

Low Severity

The all property (all: ctx.options.scrape.includes('all')) is defined in ctx.allowScrape across all 13 migration source files but is never read or used. Only ctx.allowScrape.web and ctx.allowScrape.assets are used in skip conditions. This property should be removed.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant