JobMiner is a powerful Python-based web scraping toolkit for extracting and organizing job listings from multiple websites into structured data. Built with modularity and extensibility in mind, it provides a robust foundation for job market analysis and automated job searching.
- Modular Architecture: Easy-to-extend scraper system with base classes
- Multiple Output Formats: Export to JSON, CSV, or both
- Database Integration: Optional SQLite/PostgreSQL storage with search capabilities
- CLI Interface: Command-line tool for easy scraping operations
- Configuration Management: Flexible configuration system with environment variables
- Rate Limiting: Built-in delays and respectful scraping practices
- Error Handling: Comprehensive logging and error recovery
- Template Generation: Quick scraper template creation for new job sites
# Clone the repository
git clone https://github.com/beingvirus/JobMiner.git
cd JobMiner
# Install dependencies
pip install -r requirements.txt
# Optional: Install as package
pip install -e .
# List available scrapers
python jobminer_cli.py list-scrapers
# Run demo scraper
python jobminer_cli.py scrape demo-company "python developer" --location "san francisco" --pages 2
# Analyze scraped data
python jobminer_cli.py analyze jobs.json
from scrapers.demo_company.demo_company import DemoCompanyScraper
# Initialize scraper
scraper = DemoCompanyScraper()
# Scrape jobs
jobs = scraper.scrape_jobs(
search_term="python developer",
location="san francisco",
max_pages=2
)
# Save results
scraper.save_to_json(jobs, "jobs.json")
scraper.save_to_csv(jobs, "jobs.csv")
JobMiner/
├── base_scraper.py # Base scraper class with common functionality
├── jobminer_cli.py # Command-line interface
├── config.py # Configuration management
├── database.py # Database integration (optional)
├── requirements.txt # Project dependencies
├── setup.py # Package setup
├── .env.example # Environment variables template
├── scrapers/ # Individual scraper implementations
│ └── demo-company/
│ ├── demo_company.py # Demo scraper implementation
│ ├── demo_company_readme.md
│ └── requirements.txt
└── output/ # Default output directory
# Generate a new scraper template
python jobminer_cli.py init
# Follow the prompts to create your scraper
- Create a new directory in
scrapers/
- Implement the
BaseScraper
class:
from base_scraper import BaseScraper, JobListing
class YourScraper(BaseScraper):
def get_job_urls(self, search_term, location="", max_pages=1):
# Implement job URL extraction
pass
def parse_job(self, job_url):
# Implement job detail parsing
return JobListing(...)
- Test your scraper:
python your_scraper.py
Copy .env.example
to .env
and customize:
# Database
JOBMINER_DATABASE_URL=sqlite:///jobminer.db
# Logging
JOBMINER_LOG_LEVEL=INFO
# Scraper settings
JOBMINER_DEFAULT_DELAY=2.0
JobMiner automatically creates jobminer_config.json
with default settings:
{
"default_output_format": "both",
"output_directory": "output",
"default_scraper_config": {
"delay": 2.0,
"timeout": 30,
"max_retries": 3
}
}
Enable database storage for persistent job data:
from config import get_config
from database import get_db_manager
# Enable database in config
config = get_config()
config.database.enabled = True
# Save jobs to database
db_manager = get_db_manager()
db_manager.save_jobs(jobs, scraper_name="demo-company")
# Search jobs
results = db_manager.search_jobs("python developer")
# List available scrapers
jobminer list-scrapers
# Scrape jobs
jobminer scrape SCRAPER_NAME "SEARCH_TERM" [OPTIONS]
# Analyze results
jobminer analyze FILE_PATH
# Generate new scraper template
jobminer init
--location, -l
: Search location--pages, -p
: Number of pages to scrape--output, -o
: Output filename--format, -f
: Output format (json/csv/both)--delay, -d
: Delay between requests
We welcome contributions! This project is Hacktoberfest-friendly 🎃
- Add new scrapers for popular job sites
- Improve existing scrapers with better parsing
- Add features like advanced filtering or export options
- Fix bugs and improve error handling
- Improve documentation and examples
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature
- Make your changes and test thoroughly
- Submit a pull request with a clear description
See CONTRIBUTING.md for detailed guidelines.
Currently implemented scrapers:
- Demo Company - Template/example scraper for testing
- LinkedIn Jobs
- Indeed
- Glassdoor
- AngelList
- Stack Overflow Jobs
- Remote.co
Want to add a scraper for your favorite job site? Check out our contribution guide!
- Python 3.8+
- requests
- beautifulsoup4
- pandas
- click
- sqlalchemy (optional, for database features)
- selenium (optional, for JavaScript-heavy sites)
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for the open-source community
- Hacktoberfest 2024 participant
- Inspired by the need for better job market analysis tools
- 🐛 Bug Reports: GitHub Issues
- 💡 Feature Requests: GitHub Discussions
- 📖 Documentation: Project Wiki
Happy Job Mining! 🎯