Image Scraper

A high-performance, multi-threaded Python web scraper for downloading images from Unsplash and Pexels.

Overview

This scraper uses Selenium WebDriver to crawl image providers, extract image metadata, and download images in parallel using multiple threads. Designed for efficiency and scalability with configurable worker counts, batch processing, and persistent storage.

Features

Multi-threaded Architecture - Configurable crawler and downloader threads
Multi-Provider Support - Works with both Unsplash and Pexels
Smart Image Handling - Automatic URL rewriting for specific dimensions
Deduplication - Automatic duplicate detection and filtering
Batch Processing - Efficient queue management for large-scale crawling
Progress Tracking - Real-time logging and status updates
Flexible Configuration - Extensive CLI options for fine-tuning

Quick Start

# Clone the repository
git clone https://github.com/NuLLFiX/scraper.git
cd scraper

# Install dependencies
pip install selenium requests click

# Run with default settings
python image-scraper.py --provider pexels --crawl-limit 5

Installation

Prerequisites

Python 3.7+
Google Chrome browser
ChromeDriver (matching your Chrome version)

Setup

Clone the repository:

git clone https://github.com/NuLLFiX/scraper.git
cd i_scraper

Install Python dependencies:

pip install selenium requests click

Download ChromeDriver and ensure it's in your PATH:
- Download from: https://chromedriver.chromium.org/downloads

Usage

Basic Examples

# Download from Pexels (5 pages)
python image-scraper.py --provider pexels --crawl-limit 5

# Download from Unsplash (5 pages)
python image-scraper.py --provider unsplash --crawl-limit 5

# Download specific width
python image-scraper.py --provider pexels --width 800 --crawl-limit 5

# Use headless mode
python image-scraper.py --provider unsplash --headless True --crawl-limit 10

Command-Line Options

Option	Description	Default
`--provider`	Image provider (unsplash/pexels)	unsplash
`--crawl-limit`	Maximum pages to crawl	5
`--width`	Image width (0 for original)	0
`--crawlers`	Number of crawler threads	2
`--downloaders`	Number of downloader threads	2
`--categories`	Comma-separated categories	provider defaults
`--headless`	Run browser headless	False
`--user-agent`	Custom User-Agent string	None
`--retries`	Download retry attempts	3
`--timeout`	Request timeout (seconds)	30
`--quiet`	Reduce logging verbosity	False

Project Structure

i_scraper/
├── image-scraper.py      # CLI entry point
├── scraper.log           # Log file (generated)
├── README.md             # This file
└── src/
    ├── coordinator.py    # Main coordination logic
    ├── workers.py       # Crawler and downloader workers
    ├── storage.py       # Persistent storage management
    ├── unsplash.py      # Unsplash-specific extraction
    └── pexels.py        # Pexels-specific extraction

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Main Thread (CLI)                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Coordinator Thread                          │
│  ┌─────────────────┐         ┌─────────────────┐           │
│  │  Crawl Queue    │         │  Download Queue │           │
│  └─────────────────┘         └─────────────────┘           │
└─────────────────────────────────────────────────────────────┘
           │                                    │
           ▼                                    ▼
┌─────────────────────┐            ┌─────────────────────┐
│  Crawler Threads    │            │  Downloader Threads │
│  (Selenium)        │            │  (requests)         │
└─────────────────────┘            └─────────────────────┘

Credits

This project is based on the crawling-for-images tutorial by amacal:

Original source: https://github.com/amacal/learning-python

Storage Format

Visited Metadata

Each visited image is stored as JSON:

url - Original image URL
follow - Provider photo URL
description - Image description/alt text
tags - Optional list of tags

Downloaded Images

Images are stored as PNG files named by their photo ID in subdirectories organized by first 4 characters.

License

MIT License - See LICENSE for details.

Disclaimer

This project is for educational purposes only. Please respect the image providers' terms of service and copyright when using this scraper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Scraper

Overview

Features

Quick Start

Installation

Prerequisites

Setup

Usage

Basic Examples

Command-Line Options

Project Structure

Architecture

Credits

Storage Format

Visited Metadata

Downloaded Images

License

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
image-scraper.py		image-scraper.py

License

NuLLFiX/scraper

Folders and files

Latest commit

History

Repository files navigation

Image Scraper

Overview

Features

Quick Start

Installation

Prerequisites

Setup

Usage

Basic Examples

Command-Line Options

Project Structure

Architecture

Credits

Storage Format

Visited Metadata

Downloaded Images

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages