A comprehensive toolset for downloading and archiving blog posts from various platforms. This project includes tools for processing Medium, AWS, and Hugging Face blog posts with image downloads, HTML cleaning, and local archiving.
See the processed results in action:
- AWS Blog Posts - 68 AWS blog posts (2018-2021)
- Hugging Face Blog Posts - 25 Hugging Face blog posts (2021-2024)
- Medium Blog Posts - 25+ Medium blog posts (2015-2021)
These pages demonstrate the clean, organized output that this tool produces.
post-downloader/
├── README.md # This documentation
├── extract_juliensimon_urls.py # URL extraction utilities
├── extract_all_juliensimon_urls.py # Comprehensive URL extraction
├── extract_single_medium_post.py # Single Medium post extractor
├── requirements_simple.txt # Dependencies for single post extractor
├── medium/ # Medium posts processor
│ ├── README.md # Medium-specific documentation
│ ├── process_posts.py # Main Medium processing script
│ ├── requirements.txt # Python dependencies
│ ├── posts/ # Original Medium HTML files
│ └── posts_by_year/ # Processed posts organized by year
├── aws/ # AWS blog processor
│ ├── README.md # AWS-specific documentation
│ ├── aws_blog_downloader.py # AWS blog processing script
│ ├── requirements.txt # Python dependencies
│ ├── aws-blog-urls.txt # AWS blog URLs
│ └── downloads/ # Processed AWS posts
└── huggingface/ # Hugging Face blog processor
├── README.md # Hugging Face-specific documentation
├── huggingface_blog_downloader.py # Hugging Face blog processing script
├── requirements.txt # Python dependencies
├── huggingface-blog-urls.txt # Hugging Face blog URLs
└── downloads/ # Processed Hugging Face posts
- Real Date Conversion: Converts relative dates ("2 days ago") to actual calendar dates
- Content Cleaning: Removes unwanted UI elements between subtitle and image
- Clean Metadata: Extracts title, author, real date, and source URL
- Organized Filenames: Uses date and title for better file organization
- Rate Limiting: Respectful delays to avoid being blocked
- Error Handling: Robust error handling with graceful fallbacks
- Image Processing: Downloads remote images and converts them to WebP format
- HTML Cleaning: Removes unwanted Medium-specific attributes and classes
- Internal Link Processing: Updates links between posts to reference local files
- Year Organization: Automatically organizes processed posts by year
- Rate Limiting: Implements aggressive throttling to avoid being blocked
- Comprehensive Download: Downloads AWS blog posts with images
- WebP Conversion: Converts images to WebP format for better compression
- Sequential Naming: Uses sequential naming for images (image01.webp, etc.)
- Error Handling: Robust error handling with detailed logging
- Blog Post Download: Downloads Hugging Face blog posts
- Image Processing: Downloads and processes images from posts
- Local Archiving: Creates self-contained local archives
For extracting individual Medium posts with clean formatting:
Option 1: Using the existing batch processor (recommended)
cd medium
python process_posts.py --single-post "https://medium.com/@username/post-url"Option 2: Using the standalone extractor
pip install -r requirements_simple.txt
python extract_single_medium_post.py "https://medium.com/@username/post-url"Features:
- ✅ Real Date Extraction: Converts "2 days ago" to actual dates (e.g., "2025-07-31")
- ✅ Content Cleaning: Removes unwanted UI elements between subtitle and image
- ✅ Individual Folders: Creates organized directory structure with date and title
- ✅ Image Downloading: Downloads and converts images to WebP format
- ✅ Metadata Extraction: Title, author, date, and source URL
- ✅ Rate Limiting: Respectful delays to avoid being blocked
-
Navigate to the medium directory:
cd medium -
Install dependencies:
pip install -r requirements.txt
-
Process all posts:
python process_posts.py
-
Navigate to the aws directory:
cd aws -
Install dependencies:
pip install -r requirements.txt
-
Process AWS blog posts:
python aws_blog_downloader.py
-
Navigate to the huggingface directory:
cd huggingface -
Install dependencies:
pip install -r requirements.txt
-
Process Hugging Face blog posts:
python huggingface_blog_downloader.py
The project includes utilities for extracting URLs from various sources:
extract_juliensimon_urls.py: Basic URL extraction from HTML filesextract_all_juliensimon_urls.py: Comprehensive URL extraction with filtering
- Download: Downloads images from various CDNs with proper headers
- Conversion: Converts images to WebP format for better compression
- Sequential Naming: Uses consistent naming (image01.webp, image02.webp, etc.)
- Error Handling: Continues processing even if some images fail
- Cleaning: Removes platform-specific attributes and classes
- Link Updates: Updates image and internal links to reference local files
- Self-Contained: Creates fully local archives that work offline
- Throttling: Implements delays between downloads to avoid being blocked
- Exponential Backoff: Handles HTTP 429 errors with increasing delays
- Resume Capability: Can resume processing from where it left off
- Comprehensive Test Suites: Each processor includes thorough testing
- Unit Tests: Test individual functions and components
- Integration Tests: Test end-to-end processing workflows
- Mock Testing: Network calls are mocked for reliable testing
Each processor creates organized output directories:
processed_posts/
├── YYYY-MM-DD_Post-Title/
│ ├── index.html # Cleaned HTML file
│ ├── image01.webp # Downloaded and converted images
│ ├── image02.webp
│ └── ...
└── ...
The tools in this project have been used to create clean, organized archives of blog posts:
- AWS Blog Posts - 68 AWS blog posts (2018-2021) processed with the AWS blog downloader
- Hugging Face Blog Posts - 25 Hugging Face blog posts (2021-2024) processed with the Hugging Face blog downloader
- Medium Blog Posts - 25+ Medium blog posts (2015-2021) processed with the Medium posts processor
These pages demonstrate the clean, organized output that these tools produce - self-contained archives with local images, clean HTML, and proper cross-linking between posts.
- beautifulsoup4: HTML parsing and manipulation
- Pillow: Image processing and WebP conversion
- lxml: Fast XML/HTML parser for BeautifulSoup
- requests: HTTP client for downloading content
- urllib: Alternative HTTP client for some processors
- The scripts implement automatic throttling and retry logic
- If interrupted, simply run the script again - it will resume from where it left off
- You can adjust delay times in the scripts if needed
- Some images may fail due to network issues or rate limiting
- The scripts log failures and continue processing other content
- Failed downloads don't stop the overall processing
- The scripts process content one at a time to minimize memory usage
- For very large archives, consider processing in smaller batches
The project uses pre-commit hooks to ensure consistent code quality and formatting.
Install and configure pre-commit hooks:
# Install pre-commit
pip install pre-commit
# Install the hooks
pre-commit install
# Or use the setup script
./setup_pre_commit.shThe following hooks run automatically on every commit:
- black: Code formatting (Python)
- isort: Import sorting (Python)
- trailing-whitespace: Remove trailing whitespace
- end-of-file-fixer: Ensure files end with newline
- check-yaml: Validate YAML files
- check-added-large-files: Prevent large files (>1MB)
- check-merge-conflict: Prevent merge conflicts
Run formatting manually:
# Format all files
pre-commit run --all-files
# Format specific files
pre-commit run black --files path/to/file.py
pre-commit run isort --files path/to/file.py- black: 88 character line length, Python 3.8+ compatible
- isort: Black-compatible profile, organized imports
- pyproject.toml: Centralized configuration for all tools
The project includes a comprehensive test suite to ensure all functionality works correctly.
Run basic functionality tests (no network access required):
cd medium
python test_basic.pyRun comprehensive tests without network calls:
cd medium
python test_simple.pyRun the complete test suite (includes mocked network tests):
cd medium
python test_medium_processor.pyThe test suite covers:
- ✅ Configuration management
- ✅ HTML processing and cleaning
- ✅ Image URL extraction
- ✅ Internal link detection
- ✅ Filename sanitization
- ✅ Image filename generation
- ✅ Post directory name creation
- ✅ Configuration loading and saving
Feel free to submit issues or pull requests to improve the tools. Each platform processor is designed to be extensible for other similar platforms.
This project is provided as-is for educational and archival purposes. Please respect the terms of service of the platforms you're downloading content from.