SkunkScrape

Modular ZA-focused lead discovery and scraping toolkit with CLI + GUI, proxy rotation, discovery utilities, and export pipeline.

Features

Two entry points: Python CLI (skunkscrape) and Tkinter GUI.
Plugin architecture with a simple manifest.json.
Proxy rotation supporting JSON and legacy text formats.
ZA discovery tools (CT, Common Crawl, directories/jobs).
Pipeline hooks for normalization, DNC/HLR enrichment (stubs), and exports.
Scheduler for recurring jobs (schedule/croniter).
Packagable as a single executable (PyInstaller) or container (Docker).

Architecture

Core: configuration, logging, shared utilities, exceptions.
CLI: orchestrates plugins and batches.
GUI: category → plugin selector + proxy picker.
Plugins: each scraper is self-contained and exposes main().
Discovery: host/source generation for ZA domains and socials.
Pipeline: normalization, exporters, and scheduling.

Project Layout


SkunkScrape/
├── pyproject.toml
├── README.md
├── LICENSE
├── .gitignore
├── .env.example
├── requirements.txt
│
├── data/
│   ├── proxies/
│   │   ├── proxies.json
│   │   └── Webshare 10 proxies.txt
│   ├── seeds/
│   ├── logs/
│   ├── cache/
│   └── exports/
│
├── assets/
│   ├── banner.png
│   └── favicon.ico
│
├── skunkscrape/
│   ├── **init**.py
│   ├── core/
│   │   ├── config.py
│   │   ├── logging.py
│   │   ├── utils.py
│   │   └── exceptions.py
│   ├── cli/
│   │   └── main.py
│   ├── gui/
│   │   └── main_gui.py
│   ├── plugins/
│   │   ├── manifest.json
│   │   ├── gumtree_scraper.py
│   │   ├── autotrader_scraper.py
│   │   ├── property24_scraper.py
│   │   └── smart_contact_crawler.py
│   ├── discovery/
│   │   ├── discover_coza_sources.py
│   │   ├── source_generator.py
│   │   └── discovery_runner.py
│   └── pipeline/
│       ├── collector.py
│       ├── exporter.py
│       └── scheduler.py
│
├── tests/
│   ├── test_plugins.py
│   ├── test_utils.py
│   └── test_gui_launcher.py
│
├── scripts/
│   ├── build_exe.ps1
│   ├── run_all_scrapers.ps1
│   ├── fix_plugins.ps1
│   └── scan_project_tree.ps1
│
└── Dockerfile

Requirements

Python 3.10+ (3.11 recommended).
Windows, macOS, or Linux.
Recommended: virtual environment.

Quick Start

# Create & activate venv (Windows PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

# Copy environment template
copy .env.example .env

List available plugins and run one:

python -m skunkscrape.cli.main list
python -m skunkscrape.cli.main run --name gumtree_scraper

Launch the GUI:

python -m skunkscrape.gui.main_gui

Configuration

.env

Read by skunkscrape/core/config.py:

PROXY_FILE=data/proxies/proxies.json
LOG_LEVEL=INFO
EXPORT_DIR=data/exports
WEBHOOK_URL=
CRM_HUBSPOT_KEY=
CRM_SALESFORCE_KEY=

Proxies

Preferred: data/proxies/proxies.json

[
  { "host": "198.23.239.134", "port": 6540, "user": "userA", "pass": "secretA" },
  { "host": "45.38.107.97",   "port": 6014, "user": "userB", "pass": "secretB" }
]

Legacy: Webshare 10 proxies.txt (format ip:port:user:pass).

Plugin Manifest

skunkscrape/plugins/manifest.json groups plugins by category:

{
  "categories": {
    "Directories": { "plugins": ["gumtree_scraper","junkmail_scraper","sayellow_scraper"] },
    "Jobs":        { "plugins": ["pnet_scraper","careerjunction_scraper","careers24_scraper"] },
    "Property":    { "plugins": ["property24_scraper","privateproperty_scraper"] },
    "Autos":       { "plugins": ["autotrader_scraper"] }
  }
}

Usage

CLI

# List plugins
python -m skunkscrape.cli.main list

# Run a single plugin
python -m skunkscrape.cli.main run --name pnet_scraper

# Run all plugins defined in the manifest
python -m skunkscrape.cli.main run --all

Plugin contract: each plugin module exports a main(**kwargs) function.

GUI

python -m skunkscrape.gui.main_gui

Category → Plugin dropdowns come from manifest.json.
Proxy dropdown comes from PROXY_FILE.

Discovery Utilities

python -m skunkscrape.discovery.discover_coza_sources
python -m skunkscrape.discovery.source_generator --out data/seeds/sources.txt --max 100000 --threads 32 --proxy-file "data/proxies/Webshare 10 proxies.txt"

Pipeline

collector.py: normalization, DNC/HLR (stubs).
exporter.py: CSV/CRM/Webhook/Discord exporters (stubs).

Scheduling

scheduler.py: wrappers around schedule/croniter.
Drive schedules via env or a small YAML/TOML.

Exports

Default export directory is data/exports. CSV supported now; CRM/Webhook connectors ready for extension.

Packaging & Deployment

PyInstaller

.\scripts\build_exe.ps1
# Output: dist/SkunkScrape.exe

Docker

docker build -t skunkscrape:latest .
docker run --rm -it -v "%cd%/data:/app/data" skunkscrape:latest

Development

pip install -r requirements.txt
pip install -e .[dev]

Testing

pytest -q

Linting & Formatting

ruff check .
black .
isort .

Logging

Logs write to data/logs/. Configure level via .env (LOG_LEVEL=DEBUG|INFO|WARNING).

Troubleshooting

ModuleNotFoundError: skunkscrape → run from repo root or pip install -e ..
GUI KeyError: 'categories' → ensure plugins/manifest.json exists and is valid.
Proxy timeouts → validate credentials; test endpoints sans proxy first.
PyInstaller missing assets → add --add-data "assets;assets".

Roadmap

Web dashboard (React/Next.js) + Python API.
Setuptools entry_points for plugin discovery.
CRM connectors (HubSpot, Salesforce, Zoho).
Enrichment (HLR/email validation).
Cloud scheduler (Cloud Run + Scheduler or GitHub Actions cron).

Contributing

Bug reports and pull requests are welcome on GitHub: https://github.com/SKUNKSCRAPE/skunkscrape

License

This project is licensed under the MIT License — see LICENSE.

Badge quick-setup (optional)

CI badge: ensure a workflow at .github/workflows/ci.yml.
Codecov: create a project on Codecov and replace the Coverage badge with the Codecov URL they provide.
PyPI/Docker badges: uncomment once you publish to PyPI or Docker Hub.

If you want, I can also drop in a minimal .github/workflows/ci.yml to power the CI badge and a codecov.yml starter so the coverage badge becomes live.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SkunkScrape

Table of Contents

Features

Architecture

Project Layout

Requirements

Quick Start

Configuration

.env

Proxies

Plugin Manifest

Usage

CLI

GUI

Discovery Utilities

Pipeline

Scheduling

Exports

Packaging & Deployment

PyInstaller

Docker

Development

Testing

Linting & Formatting

Logging

Troubleshooting

Roadmap

Contributing

License

Badge quick-setup (optional)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SkunkScrape_ProjectTree.txt		SkunkScrape_ProjectTree.txt
__init__.py		__init__.py
main.py		main.py
main_gui.py		main_gui.py

License

SKUNKSCRAPE/studio

Folders and files

Latest commit

History

Repository files navigation

SkunkScrape

Table of Contents

Features

Architecture

Project Layout

Requirements

Quick Start

Configuration

.env

Proxies

Plugin Manifest

Usage

CLI

GUI

Discovery Utilities

Pipeline

Scheduling

Exports

Packaging & Deployment

PyInstaller

Docker

Development

Testing

Linting & Formatting

Logging

Troubleshooting

Roadmap

Contributing

License

Badge quick-setup (optional)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages