A full-featured job scraper application built with Python Flask, PostgreSQL, and Redis.
The Job Scraper application is designed to scrape job listings from various job boards, store them in a PostgreSQL database, and provide a web interface for browsing, searching, and exporting the collected data.
- Job Scraping: Automated scraping of job listings from multiple sources
- Web Interface: Flask-based dashboard to view, search, and manage job listings
- Data Export/Import: Export job data to CSV/JSON and import from external sources
- Monitoring: Integrated health checks and Prometheus metrics for observability
- Containerization: Docker setup for easy deployment and scaling
The application follows a modular architecture with the following components:
- Web Interface: Flask application with templates and static assets
- Scraper Module: Core scraping functionality with configurable job sources
- Database Layer: PostgreSQL storage with SQLAlchemy ORM
- Caching Layer: Redis for performance optimization
- Monitoring: Prometheus metrics and health check endpoints
- Python 3.10+
- PostgreSQL 13+
- Redis 6+
- Docker and Docker Compose (for containerized deployment)
-
Clone the repository:
git clone https://github.com/yourusername/job-scraper.git cd job-scraper -
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
cp .env.example .env # Edit .env with your configuration values -
Initialize the database:
# Ensure PostgreSQL is running python3 -m app.db.init_db -
Run the application:
python3 main.py
-
Build and start containers:
docker-compose up -d
-
Access the application at http://localhost:5000
The web interface provides the following features:
- Dashboard: Overview of scraping status and job statistics
- Job Listings: Browse and search collected job listings
- Export: Export job data to various formats
- Import: Import job data from external sources
- Scraper Control: Start, stop, and monitor scraping jobs
The application exposes the following API endpoints:
GET /api/jobs: Get a list of jobs with optional filteringGET /api/jobs/{id}: Get details for a specific jobPOST /api/start-scrape: Start a new scraping jobPOST /api/stop-scrape: Stop the current scraping jobGET /api/status: Get the current status of the scraperGET /api/export: Export job data to CSV/JSONPOST /api/import: Import job data from an external sourceGET /health: Health check endpointGET /metrics: Prometheus metrics endpoint
Configuration is managed through YAML files in the config/ directory:
app_config.yaml: General application settingsapi_config.yaml: API-specific settingslogging_config.yaml: Logging configuration
Environment-specific settings can be overridden using environment variables defined in the .env file.
The application includes built-in monitoring capabilities:
- Health Checks:
/healthendpoint for application health status - Prometheus Metrics:
/metricsendpoint for Prometheus metrics - Structured Logging: Detailed logs using the Python logging module
The core database tables include:
jobs: Stores job listings with details like title, company, location, etc.companies: Information about companies posting jobsscrape_runs: Records of scraping jobs with timestamps and statisticsusers: User accounts for the web interface (if authentication is enabled)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Beautiful Soup for HTML parsing
- Flask for the web framework
- SQLAlchemy for database ORM
- Prometheus for metrics collection
- Bootstrap for the web interface