Skip to content

NuLLFiX/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Scraper

Author License GitHub

A high-performance, multi-threaded Python web scraper for downloading images from Unsplash and Pexels.

Overview

This scraper uses Selenium WebDriver to crawl image providers, extract image metadata, and download images in parallel using multiple threads. Designed for efficiency and scalability with configurable worker counts, batch processing, and persistent storage.

Features

  • Multi-threaded Architecture - Configurable crawler and downloader threads
  • Multi-Provider Support - Works with both Unsplash and Pexels
  • Smart Image Handling - Automatic URL rewriting for specific dimensions
  • Deduplication - Automatic duplicate detection and filtering
  • Batch Processing - Efficient queue management for large-scale crawling
  • Progress Tracking - Real-time logging and status updates
  • Flexible Configuration - Extensive CLI options for fine-tuning

Quick Start

# Clone the repository
git clone https://github.com/NuLLFiX/scraper.git
cd scraper

# Install dependencies
pip install selenium requests click

# Run with default settings
python image-scraper.py --provider pexels --crawl-limit 5

Installation

Prerequisites

  • Python 3.7+
  • Google Chrome browser
  • ChromeDriver (matching your Chrome version)

Setup

  1. Clone the repository:
git clone https://github.com/NuLLFiX/scraper.git
cd i_scraper
  1. Install Python dependencies:
pip install selenium requests click
  1. Download ChromeDriver and ensure it's in your PATH:

Usage

Basic Examples

# Download from Pexels (5 pages)
python image-scraper.py --provider pexels --crawl-limit 5

# Download from Unsplash (5 pages)
python image-scraper.py --provider unsplash --crawl-limit 5

# Download specific width
python image-scraper.py --provider pexels --width 800 --crawl-limit 5

# Use headless mode
python image-scraper.py --provider unsplash --headless True --crawl-limit 10

Command-Line Options

Option Description Default
--provider Image provider (unsplash/pexels) unsplash
--crawl-limit Maximum pages to crawl 5
--width Image width (0 for original) 0
--crawlers Number of crawler threads 2
--downloaders Number of downloader threads 2
--categories Comma-separated categories provider defaults
--headless Run browser headless False
--user-agent Custom User-Agent string None
--retries Download retry attempts 3
--timeout Request timeout (seconds) 30
--quiet Reduce logging verbosity False

Project Structure

i_scraper/
├── image-scraper.py      # CLI entry point
├── scraper.log           # Log file (generated)
├── README.md             # This file
└── src/
    ├── coordinator.py    # Main coordination logic
    ├── workers.py       # Crawler and downloader workers
    ├── storage.py       # Persistent storage management
    ├── unsplash.py      # Unsplash-specific extraction
    └── pexels.py        # Pexels-specific extraction

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Main Thread (CLI)                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Coordinator Thread                          │
│  ┌─────────────────┐         ┌─────────────────┐           │
│  │  Crawl Queue    │         │  Download Queue │           │
│  └─────────────────┘         └─────────────────┘           │
└─────────────────────────────────────────────────────────────┘
           │                                    │
           ▼                                    ▼
┌─────────────────────┐            ┌─────────────────────┐
│  Crawler Threads    │            │  Downloader Threads │
│  (Selenium)        │            │  (requests)         │
└─────────────────────┘            └─────────────────────┘

Credits

This project is based on the crawling-for-images tutorial by amacal:

Storage Format

Visited Metadata

Each visited image is stored as JSON:

  • url - Original image URL
  • follow - Provider photo URL
  • description - Image description/alt text
  • tags - Optional list of tags

Downloaded Images

Images are stored as PNG files named by their photo ID in subdirectories organized by first 4 characters.

License

MIT License - See LICENSE for details.

Copyright (c) 2024 NuLLFiX

Disclaimer

This project is for educational purposes only. Please respect the image providers' terms of service and copyright when using this scraper.


Website

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages