Modular ZA-focused lead discovery and scraping toolkit with CLI + GUI, proxy rotation, discovery utilities, and export pipeline.
- Features
- Architecture
- Project Layout
- Requirements
- Quick Start
- Configuration
- Usage
- Exports
- Packaging & Deployment
- Development
- Logging
- Troubleshooting
- Roadmap
- Contributing
- License
- Two entry points: Python CLI (
skunkscrape) and Tkinter GUI. - Plugin architecture with a simple
manifest.json. - Proxy rotation supporting JSON and legacy text formats.
- ZA discovery tools (CT, Common Crawl, directories/jobs).
- Pipeline hooks for normalization, DNC/HLR enrichment (stubs), and exports.
- Scheduler for recurring jobs (schedule/croniter).
- Packagable as a single executable (PyInstaller) or container (Docker).
- Core: configuration, logging, shared utilities, exceptions.
- CLI: orchestrates plugins and batches.
- GUI: category → plugin selector + proxy picker.
- Plugins: each scraper is self-contained and exposes
main(). - Discovery: host/source generation for ZA domains and socials.
- Pipeline: normalization, exporters, and scheduling.
SkunkScrape/
├── pyproject.toml
├── README.md
├── LICENSE
├── .gitignore
├── .env.example
├── requirements.txt
│
├── data/
│ ├── proxies/
│ │ ├── proxies.json
│ │ └── Webshare 10 proxies.txt
│ ├── seeds/
│ ├── logs/
│ ├── cache/
│ └── exports/
│
├── assets/
│ ├── banner.png
│ └── favicon.ico
│
├── skunkscrape/
│ ├── **init**.py
│ ├── core/
│ │ ├── config.py
│ │ ├── logging.py
│ │ ├── utils.py
│ │ └── exceptions.py
│ ├── cli/
│ │ └── main.py
│ ├── gui/
│ │ └── main_gui.py
│ ├── plugins/
│ │ ├── manifest.json
│ │ ├── gumtree_scraper.py
│ │ ├── autotrader_scraper.py
│ │ ├── property24_scraper.py
│ │ └── smart_contact_crawler.py
│ ├── discovery/
│ │ ├── discover_coza_sources.py
│ │ ├── source_generator.py
│ │ └── discovery_runner.py
│ └── pipeline/
│ ├── collector.py
│ ├── exporter.py
│ └── scheduler.py
│
├── tests/
│ ├── test_plugins.py
│ ├── test_utils.py
│ └── test_gui_launcher.py
│
├── scripts/
│ ├── build_exe.ps1
│ ├── run_all_scrapers.ps1
│ ├── fix_plugins.ps1
│ └── scan_project_tree.ps1
│
└── Dockerfile
- Python 3.10+ (3.11 recommended).
- Windows, macOS, or Linux.
- Recommended: virtual environment.
# Create & activate venv (Windows PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Copy environment template
copy .env.example .envList available plugins and run one:
python -m skunkscrape.cli.main list
python -m skunkscrape.cli.main run --name gumtree_scraperLaunch the GUI:
python -m skunkscrape.gui.main_guiRead by skunkscrape/core/config.py:
PROXY_FILE=data/proxies/proxies.json
LOG_LEVEL=INFO
EXPORT_DIR=data/exports
WEBHOOK_URL=
CRM_HUBSPOT_KEY=
CRM_SALESFORCE_KEY=
Preferred: data/proxies/proxies.json
[
{ "host": "198.23.239.134", "port": 6540, "user": "userA", "pass": "secretA" },
{ "host": "45.38.107.97", "port": 6014, "user": "userB", "pass": "secretB" }
]Legacy: Webshare 10 proxies.txt (format ip:port:user:pass).
skunkscrape/plugins/manifest.json groups plugins by category:
{
"categories": {
"Directories": { "plugins": ["gumtree_scraper","junkmail_scraper","sayellow_scraper"] },
"Jobs": { "plugins": ["pnet_scraper","careerjunction_scraper","careers24_scraper"] },
"Property": { "plugins": ["property24_scraper","privateproperty_scraper"] },
"Autos": { "plugins": ["autotrader_scraper"] }
}
}# List plugins
python -m skunkscrape.cli.main list
# Run a single plugin
python -m skunkscrape.cli.main run --name pnet_scraper
# Run all plugins defined in the manifest
python -m skunkscrape.cli.main run --allPlugin contract: each plugin module exports a main(**kwargs) function.
python -m skunkscrape.gui.main_gui- Category → Plugin dropdowns come from
manifest.json. - Proxy dropdown comes from
PROXY_FILE.
python -m skunkscrape.discovery.discover_coza_sources
python -m skunkscrape.discovery.source_generator --out data/seeds/sources.txt --max 100000 --threads 32 --proxy-file "data/proxies/Webshare 10 proxies.txt"collector.py: normalization, DNC/HLR (stubs).exporter.py: CSV/CRM/Webhook/Discord exporters (stubs).
scheduler.py: wrappers aroundschedule/croniter.- Drive schedules via env or a small YAML/TOML.
Default export directory is data/exports.
CSV supported now; CRM/Webhook connectors ready for extension.
.\scripts\build_exe.ps1
# Output: dist/SkunkScrape.exedocker build -t skunkscrape:latest .
docker run --rm -it -v "%cd%/data:/app/data" skunkscrape:latestpip install -r requirements.txt
pip install -e .[dev]pytest -qruff check .
black .
isort .Logs write to data/logs/. Configure level via .env (LOG_LEVEL=DEBUG|INFO|WARNING).
ModuleNotFoundError: skunkscrape→ run from repo root orpip install -e ..- GUI
KeyError: 'categories'→ ensureplugins/manifest.jsonexists and is valid. - Proxy timeouts → validate credentials; test endpoints sans proxy first.
- PyInstaller missing assets → add
--add-data "assets;assets".
- Web dashboard (React/Next.js) + Python API.
- Setuptools
entry_pointsfor plugin discovery. - CRM connectors (HubSpot, Salesforce, Zoho).
- Enrichment (HLR/email validation).
- Cloud scheduler (Cloud Run + Scheduler or GitHub Actions cron).
Bug reports and pull requests are welcome on GitHub: https://github.com/SKUNKSCRAPE/skunkscrape
This project is licensed under the MIT License — see LICENSE.
- CI badge: ensure a workflow at
.github/workflows/ci.yml. - Codecov: create a project on Codecov and replace the Coverage badge with the Codecov URL they provide.
- PyPI/Docker badges: uncomment once you publish to PyPI or Docker Hub.
If you want, I can also drop in a minimal .github/workflows/ci.yml to power the CI badge and a codecov.yml starter so the coverage badge becomes live.