Institutional-Grade Documentation Edition
This repository contains the Ultimate Web Novel & Manga Scraper, a comprehensive WordPress plugin designed to automate the ingestion of manga and web novel content. It is engineered to integrate seamlessly with the Madara theme, transforming a standard WordPress installation into a fully automated content aggregation platform.
- Project Overview
- Feature Inventory
- System Requirements
- Technology Stack
- Directory Overview
- Installation
- Environment Setup
- Configuration
- Database Setup
- Admin & System Usage
- Development Workflow
- Production Deployment
- Security Considerations
- Limitations & Assumptions
- Maintenance
- Licensing
The plugin operates as a "God Object" within the WordPress ecosystem, specifically targeting the Madara manga theme. It acts as a bridge between external content sources (MangaFox, WuxiaWorld, Madara-based sites, etc.) and the local WordPress database.
It handles the entire lifecycle of content acquisition:
- Scheduling: Cron-based execution.
- Fetching: Multi-mode scraping (cURL, PhantomJS, Puppeteer).
- Processing: HTML parsing, cleaning, and text spinning.
- Translation: Automated translation via Google/DeepL/Microsoft.
- Storage: Saving to local FS, DB, or Cloud Storage (S3).
- Multi-Source Scraping: Built-in rules for major manga/novel sites.
- Headless Browser Support: Renders JavaScript-heavy sites using PhantomJS or Puppeteer.
- Translation Pipeline: Converts content language on-the-fly.
- Proxy Support: Rotates proxies to bypass IP bans.
- Cloudflare Bypass: Mechanisms to handle anti-bot protection.
- Madara Enhancements: specialized module for cloning other Madara sites via AJAX.
- Auto-Update: Updates existing manga with new chapters automatically.
- CMS: WordPress 5.0+
- Theme: Madara (Active)
- Plugin Dependency: Madara Core (
WP_MANGA_STORAGE) - PHP: 7.4+
- Extensions:
curl,dom,mbstring,json,libxml - Optional:
Node.js(for Puppeteer)PhantomJSbinaryshell_execenabled
- Language: PHP 7/8
- Frontend: jQuery (Admin UI)
- Parsers: PHP Simple HTML DOM Parser, DOMDocument
- Headless: PhantomJS (JS), Puppeteer (Node.js)
- Database: MySQL/MariaDB (WordPress Schema + Madara Custom Tables)
See DIRECTORY_STRUCTURE.md for a complete manifest.
root: Core logic (ultimate-manga-scraper.php).includes/: Madara integration classes.res/: Libraries, drivers, and admin UI templates.images/,scripts/,styles/: Assets.
See DEPLOYMENT.md for detailed steps.
- Upload plugin to
/wp-content/plugins/. - Activate via WordPress Admin.
- Ensure Madara theme is active.
- Permissions: Ensure the web server can write to
wp-content/uploadsandwp-content/plugins/ultimate-manga-scraper. - Cron: Disable WP-Cron and setup a system cron for reliability.
See CONFIGURATION.md.
Configuration is handled via Ultimate Web Novel & Manga Scraper -> Main Settings. Key areas:
- Headless Settings: Paths to binaries.
- Translation Keys: API credentials.
- Storage Backend: Local vs Cloud.
The plugin utilizes the standard WordPress wp_options table for storing rules and settings. Content is stored in wp_posts (Manga) and wp_postmeta. Chapter data is managed by Madara's storage engine.
- Define Rules: Go to the specific scraper tab (e.g., Manga Scraper).
- Add URL: Paste the TOC URL of the target manga.
- Set Schedule: Define how often to check for updates.
- Run: Click "Run This Rule Now" or wait for Cron.
- Monitor: Watch the "Activity & Logging" tab.
- Architecture: See ARCHITECTURE.md.
- Data Flow: See DATA_FLOW.md.
- Modifying: Edits should primarily be made in
ultimate-manga-scraper.phpfor core logic, orincludes/for Madara-specific logic.
- Security: See SECURITY.md.
- Optimization: Use Redis/Memcached object caching. Use a real Cron job.
- SSRF: The plugin makes outbound requests to user-defined URLs.
- RCE:
shell_execis used for headless browsers. Secure your server accordingly. - Access Control: Restrict Admin access.
- Theme Dependency: Assumes Madara theme structure is present.
- Site Changes: Scrapers rely on DOM structure. Target site changes will break scraping until updated.
- Legal: User is responsible for copyright compliance of scraped content.
- Logs: Rotate logs (
auto_clear_logs). - Updates: Check CHANGELOG.md.
Released into the Public Domain. See LICENSE for details.
Documentation Index: DOCUMENTATION_INDEX.md