A high-performance, multi-threaded Python web scraper for downloading images from Unsplash and Pexels.
This scraper uses Selenium WebDriver to crawl image providers, extract image metadata, and download images in parallel using multiple threads. Designed for efficiency and scalability with configurable worker counts, batch processing, and persistent storage.
- Multi-threaded Architecture - Configurable crawler and downloader threads
- Multi-Provider Support - Works with both Unsplash and Pexels
- Smart Image Handling - Automatic URL rewriting for specific dimensions
- Deduplication - Automatic duplicate detection and filtering
- Batch Processing - Efficient queue management for large-scale crawling
- Progress Tracking - Real-time logging and status updates
- Flexible Configuration - Extensive CLI options for fine-tuning
# Clone the repository
git clone https://github.com/NuLLFiX/scraper.git
cd scraper
# Install dependencies
pip install selenium requests click
# Run with default settings
python image-scraper.py --provider pexels --crawl-limit 5- Python 3.7+
- Google Chrome browser
- ChromeDriver (matching your Chrome version)
- Clone the repository:
git clone https://github.com/NuLLFiX/scraper.git
cd i_scraper- Install Python dependencies:
pip install selenium requests click- Download ChromeDriver and ensure it's in your PATH:
- Download from: https://chromedriver.chromium.org/downloads
# Download from Pexels (5 pages)
python image-scraper.py --provider pexels --crawl-limit 5
# Download from Unsplash (5 pages)
python image-scraper.py --provider unsplash --crawl-limit 5
# Download specific width
python image-scraper.py --provider pexels --width 800 --crawl-limit 5
# Use headless mode
python image-scraper.py --provider unsplash --headless True --crawl-limit 10| Option | Description | Default |
|---|---|---|
--provider |
Image provider (unsplash/pexels) | unsplash |
--crawl-limit |
Maximum pages to crawl | 5 |
--width |
Image width (0 for original) | 0 |
--crawlers |
Number of crawler threads | 2 |
--downloaders |
Number of downloader threads | 2 |
--categories |
Comma-separated categories | provider defaults |
--headless |
Run browser headless | False |
--user-agent |
Custom User-Agent string | None |
--retries |
Download retry attempts | 3 |
--timeout |
Request timeout (seconds) | 30 |
--quiet |
Reduce logging verbosity | False |
i_scraper/
├── image-scraper.py # CLI entry point
├── scraper.log # Log file (generated)
├── README.md # This file
└── src/
├── coordinator.py # Main coordination logic
├── workers.py # Crawler and downloader workers
├── storage.py # Persistent storage management
├── unsplash.py # Unsplash-specific extraction
└── pexels.py # Pexels-specific extraction
┌─────────────────────────────────────────────────────────────┐
│ Main Thread (CLI) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Coordinator Thread │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Crawl Queue │ │ Download Queue │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Crawler Threads │ │ Downloader Threads │
│ (Selenium) │ │ (requests) │
└─────────────────────┘ └─────────────────────┘
This project is based on the crawling-for-images tutorial by amacal:
- Original source: https://github.com/amacal/learning-python
Each visited image is stored as JSON:
url- Original image URLfollow- Provider photo URLdescription- Image description/alt texttags- Optional list of tags
Images are stored as PNG files named by their photo ID in subdirectories organized by first 4 characters.
MIT License - See LICENSE for details.
Copyright (c) 2024 NuLLFiX
This project is for educational purposes only. Please respect the image providers' terms of service and copyright when using this scraper.