feat: Add comprehensive web clipper functionality for issue #38#42
Open
feat: Add comprehensive web clipper functionality for issue #38#42
Conversation
Implement multi-strategy web page to markdown conversion with: **Core Features:** - Multiple extraction strategies: auto, readability, manual, full, structured - Mozilla Readability integration for article extraction - Custom CSS selector support for complex sites - Schema.org structured data extraction - Auto-strategy detection based on content patterns **Content Processing:** - HTML to Markdown conversion with Turndown.js - Image extraction and processing (skip, link-only, download, base64) - Link extraction and classification (internal/external) - Metadata extraction (title, author, published date, description) - Frontmatter generation with configurable options **Advanced Features:** - Custom HTTP headers and authentication support - Cookie file support for protected content - Configurable timeouts and redirect handling - Batch processing from URL files - Comprehensive error handling and retry logic - Dry-run mode for preview without file creation **CLI Interface:** - Full-featured `markmv clip` command with extensive options - Support for single URLs, batch processing, and custom output paths - JSON output format for programmatic usage - Verbose logging and detailed progress reporting - Integration with existing markmv command structure **TypeScript Implementation:** - Strict type safety without any type coercion - Comprehensive interfaces for all data structures - Proper error handling with typed exceptions - Full test coverage for all functionality **Testing:** - Comprehensive test suite for WebClipper core class (28 tests) - CLI command tests covering all scenarios (14 tests) - Mocked external dependencies (jsdom, readability, turndown) - Error condition testing and edge case coverage - Cross-platform filename sanitization tests This addresses the need for robust web content extraction that handles various site architectures including SPAs, documentation sites, blogs, and structured content with appropriate strategy selection.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements comprehensive web page to markdown conversion functionality as requested in issue #38. This adds a powerful
clipcommand to the markmv CLI that can extract content from various types of web pages using multiple intelligent strategies.Key Features
🚀 Multi-Strategy Content Extraction
🌐 Comprehensive Site Support
📋 Advanced Processing Options
🔧 Professional Features
Implementation Highlights
🏗️ Architecture
🧪 Testing
🔒 Quality Assurance
anytypes in production codeUsage Examples
CLI Integration
The new
clipcommand integrates seamlessly with the existing markmv CLI:Test Coverage
All functionality is thoroughly tested:
Technical Implementation
Closes #38
Test Plan