This is an automated job scraping system designed to collect Operations Research and Optimization job postings from LinkedIn and Indeed using intelligent keyword filtering. The system extracts technical skills using advanced 3-layer pattern matching and stores comprehensive job data for analysis.
Primary Focus: Operations Research, Mathematical Optimization, Supply Chain, Routing/Scheduling, Simulation, and related optimization roles.
- Procedure (run → store → export): Run
run_netherlands_indeed_linkedin.pyto scrape, persist todata/jobs.db, then export withexport_to_csv.pyfor downstream analytics. - Precision/Recall via Tiered Filters: Tier 1 (titles) maximizes recall for optimization-themed roles; Tier 2 (descriptions) sharpens precision with solver/OR terms; negatives remove obvious non-OR noise.
- Data Completeness: Indeed yields richer company metadata (including employee counts); LinkedIn exposes lighter company fields via JobSpy (company name/URL mostly).
- Keyword Tuning Loop: Keep only search terms with healthy acceptance (Final/Found) and high Tier2 hits; prune noisy terms after each run based on the filter stats table.
- SEO-friendly insights: Focuses on “operations research jobs”, “optimization jobs”, “supply chain optimization”, “MILP / integer programming”, “Gurobi / OR-Tools”, and “routing/scheduling optimization”.
- Live Dashboard: Processed/NLP-enriched results are visualized at https://joblab.oploy.eu/ .
- System Architecture
- Key Features
- Input Requirements
- Processing Pipeline
- Output & Results
- Usage Procedures
- Database Schema
- Skill Extraction System
- Configuration
The system uses JobSpy library for direct scraping with intelligent filtering:
- Uses
python-jobspylibrary to scrape LinkedIn and Indeed - Searches with 8 optimization-related keywords per country
- Applies 3-tier keyword filtering to ensure job relevance
- Extracts skills using 3-layer pattern matching (977 skills)
- Stores complete job data in SQLite database
- Deduplication: Tracks URLs to prevent duplicates
├── Web Scraping: python-jobspy (LinkedIn & Indeed API wrapper)
├── Data Models: Pydantic v2 (validation)
├── Database: SQLite (jobs.db)
├── Skill Extraction: 3-layer regex pattern matching (977 skills)
├── Filtering: 3-tier keyword matching (broad → technical → negative)
├── Export: CSV format
└── Multi-Country: 10 countries with batch mode support
- 10 Countries: USA, Canada, UK, Netherlands, Germany, Denmark, France, Austria, Australia, India
- Market-Based Targeting: Larger markets get more jobs (USA/India: 500, UK: 375, etc.)
- Batch Mode: Sequential scraping with configurable delays (30-60 minutes)
- Multiplier Setting: Scale all country targets with single parameter (default: 2.5x)
The system ensures only Operations Research / Optimization jobs are collected:
"optim", "operations research", "supply chain", "logistics",
"routing", "scheduling", "decision science", "algorithm",
"data scientist", "machine learning", "analytics", "solver", "mathematical""operations research", "linear programming", "integer programming",
"mixed integer", "milp", "mip", "gurobi", "cplex", "or-tools", "ortools",
"constraint programming", "combinatorial optimization",
"mathematical optimization", "network optimization",
"vehicle routing", "routing optimization", "scheduling optimization",
"supply chain optimization", "supply chain", "inventory optimization",
"inventory management", "demand planning", "forecasting",
"heuristic", "metaheuristic", "convex optimization",
"stochastic optimization", "discrete optimization",
"simulation", "prescriptive analytics",
"pulp", "pyomo",
"Industrial Engineering", "Fulfillment Optimization""seo", "search engine", "sales optimization", "marketing optimization",
"conversion optimization", "website optimization", "social media"Filtering Logic: Accept job if (Title matches Tier 1 OR Description matches Tier 2) AND NOT (Title contains Tier 3)
"operation research", # lowercased for stricter matching
"Mathematical Optimization",
"MILP",
"Integer Programming",
"Gurobi",
"Routing Optimization",
"Supply Chain Optimization",
"Simulation Optimization",3-Layer Extraction System:
- Layer 1: Multi-word phrases (e.g., "Machine Learning", "Linear Programming")
- Layer 2: Context-aware extraction with validation
- Layer 3: Direct pattern matching from 977-skill reference database
- Performance: 80-85% accuracy at 0.3s/job (10x faster than spaCy)
- Output: Comma-separated skills stored with each job
- LinkedIn Protection: 10-second delay between queries
- Error Handling: Stops after 3 consecutive errors (rate limit detection)
- Batch Mode: 30-second delay between countries
- Recommended: Use batch mode with 30-60 minute delays for multi-country scraping
Captures 17 fields per job:
- Basic: Title, description, company, location, URL
- Enhanced: Remote status, job level, job function, industry
- Temporal: Posted date (last 4 weeks), scraped timestamp
- Analytics: Skills (extracted), search term used, country
- URLs: Platform URL (LinkedIn/Indeed) + Direct company URL
- Company size:
company_num_employees(robust on Indeed; rarely provided by LinkedIn via JobSpy)
- Goal: Maximize recall of relevant optimization jobs while keeping precision high enough to avoid “SEO/marketing optimization” noise.
- Tier 1 (Recall engine): Title keywords (e.g., "optim", "supply chain", "routing", "algorithm") pull in borderline-but-possibly-relevant roles. Expect higher false positives, but few misses.
- Tier 2 (Precision engine): Description keywords (e.g., "MILP", "integer programming", "Gurobi", "OR-Tools", "simulation", "supply chain optimization") confirm true OR/optimization relevance.
- Negative filter: Quickly discards obvious non-OR roles (SEO/marketing/sales optimization), boosting precision without hurting recall.
- Acceptance rule: Accept if (Tier1 OR Tier2) AND NOT negative → balanced recall/precision without heavy ML.
- Stats table: Each run prints per-search-term stats (Found, Neg, T1, T2, Both, NoMatch, Final, Rate). Use this to:
- Drop low-performing search terms (low Final/Found or many NoMatch).
- Strengthen Tier2 by adding solver/tech terms seen in good results.
- Adjust Tier1 when T1-only is high (potential false positives).
- Iterate from data: After each run, remove search terms with <50% acceptance; add new terms from accepted descriptions (solvers, methods, domains).
- Balance tiers: If too many T1-only hits, add more Tier2 technical terms; if recall is low, broaden Tier1 slightly.
- Domain variants: Add industry-specific phrases (e.g., "timetabling", "portfolio optimization", "workforce scheduling") when targeting new niches.
- Language/locale: When expanding countries, include local-language equivalents in Tier1 and Tier2.
- New domains: Replace Tier1/Tier2 keyword lists with your domain’s title cues and technical signals; keep negative list to protect precision.
- Skills extraction: Update
skills_reference_2025.jsonwith domain skills and regex patterns. - Markets: Edit
COUNTRIES,COUNTRY_JOB_TARGETS, andSEARCH_TERMSinrun_netherlands_indeed_linkedin.py. - Outputs: Add new DB columns via
src/db/schema.pyand Pydantic models insrc/models/models.pyif you need extra metadata.
Location: code/src/config/skills_reference_2025.json
Purpose: Used by 3-layer skill extraction system to identify technical skills in job descriptions.
{
"skills": {
"Python": {
"regex_pattern": "\\bpython\\b",
"category": "Programming Language",
"priority": 1
},
"Linear Programming": {
"regex_pattern": "linear\\s+programming|\\bLP\\b",
"category": "Optimization",
"priority": 2
},
"Gurobi": {
"regex_pattern": "\\bgurobi\\b",
"category": "Optimization Solver",
"priority": 1
}
// ... 977 total skills
}
}This is the ONLY configuration file - all other settings are hardcoded in the main script.
These are hardcoded in run_netherlands_indeed_linkedin.py:
COUNTRIES = {
"USA": {"name": "USA", "indeed_country": "USA", "flag": "🇺🇸"},
"Netherlands": {"name": "Netherlands", "indeed_country": "Netherlands", "flag": "🇳🇱"},
# ... 10 total countries
}SEARCH_TERMS = [
"operation research", # lowercased for stricter matching
"Mathematical Optimization",
"MILP",
"Integer Programming",
"Gurobi",
"Routing Optimization",
"Supply Chain Optimization",
"Simulation Optimization",
]COUNTRY_JOB_TARGETS = {
"USA": 200, # × 2.5 = 500 jobs
"India": 200, # × 2.5 = 500 jobs
"UK": 150, # × 2.5 = 375 jobs
"Germany": 100, # × 2.5 = 250 jobs
"Netherlands": 50, # × 2.5 = 125 jobs
# ... scaled by COUNTRY_JOB_MULTIPLIER = 2.5
}- FILTER_KEYWORDS_TITLE: 13 broad keywords for title matching
- FILTER_KEYWORDS_STRONG: 24 technical keywords for description matching
- FILTER_KEYWORDS_NEGATIVE: 9 keywords to reject non-OR jobs
Users configure via command-line arguments:
--jobs: Fallback job count (default: 50, overridden by country targets)--countries: Comma-separated country list (default: all 10)--batch: Enable batch mode with delays--delay: Minutes between countries in batch mode (default: 60)
┌─────────────────────────────────────────────────────────────────┐
│ 1. INITIALIZATION │
│ - Load skills reference (977 skills) │
│ - Initialize database (jobs.db) │
│ - Setup JobSpy scraper │
│ - Load country targets and search terms │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. MULTI-PLATFORM SCRAPING (Indeed + LinkedIn) │
│ For each country: │
│ │
│ ┌─ INDEED PLATFORM ─────────────────────────────────────┐ │
│ │ For each of 8 search terms: │ │
│ │ • Search Indeed with JobSpy API │ │
│ │ • Get results (based on country target) │ │
│ │ • Apply 3-tier keyword filter │ │
│ │ • Track URLs (prevent duplicates) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌─ LINKEDIN PLATFORM ──────────────────────────────────┐ │
│ │ Sequential query approach: │ │
│ │ • Run all 8 search terms with 10s delays │ │
│ │ • Fetch full descriptions (slower but complete) │ │
│ │ • Apply 3-tier keyword filter │ │
│ │ • Detect & stop on rate limiting │ │
│ │ • Stop after 3 consecutive errors │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ - Deduplication: URL tracking prevents duplicate jobs │
│ - Rate limiting: 10s between LinkedIn queries │
│ - Delay: 30s between countries (prevent overwhelming) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. KEYWORD FILTERING (3-Tier Quality Control) │
│ For each scraped job: │
│ │
│ ✅ TIER 1: Title Check │
│ - Match any of 13 broad keywords │
│ - Examples: "optim", "operations research", "routing" │
│ │
│ ✅ TIER 2: Description Check │
│ - Match any of 24 technical keywords │
│ - Examples: "linear programming", "gurobi", "milp" │
│ │
│ ❌ TIER 3: Negative Filter │
│ - Reject if title contains 9 negative keywords │
│ - Examples: "seo", "sales optimization", "marketing" │
│ │
│ Decision: Accept if (Tier 1 OR Tier 2) AND NOT Tier 3 │
│ Result: Only OR/Optimization-relevant jobs │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. DATA PROCESSING & ENRICHMENT │
│ For each filtered job: │
│ • Extract metadata from JobSpy results │
│ • Parse location data (city, state, country) │
│ • Parse posted date (last 4 weeks filter) │
│ • Extract remote status, job level, function, industry │
│ • Get both platform URL and company career URL │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 5. SKILL EXTRACTION (3-Layer Pattern Matching) │
│ For each job description: │
│ │
│ Layer 1: Multi-word phrases │
│ • Match complex terms: "Machine Learning", "OR-Tools" │
│ • Priority: Highest (processed first) │
│ │
│ Layer 2: Context-aware extraction │
│ • Validate with context: "Python" + "programming" │
│ • Reduces false positives │
│ │
│ Layer 3: Direct pattern matching │
│ • Apply 977 regex patterns from skills_reference.json │
│ • Comprehensive coverage across all categories │
│ │
│ Post-processing: │
│ • Filter common words (and, the, with, using) │
│ • Split conjunctions ("Python and SQL" → 2 skills) │
│ • Deduplicate (normalize similar skills) │
│ • Validate against reference (remove false positives) │
│ • Output: Comma-separated skill list │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 6. DATABASE STORAGE │
│ - Store complete job details to jobs table │
│ - 17 fields: IDs, URLs, content, metadata, skills │
│ - Atomic operations (thread-safe) │
│ - Auto-deduplication by URL │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 7. BATCH MODE (Multi-Country Sequential) │
│ If batch mode enabled: │
│ • Scrape first country completely │
│ • Wait configured delay (30-60 minutes) │
│ • Scrape next country │
│ • Repeat for all countries │
│ • Total time: 6-10 hours (run overnight) │
│ │
│ Benefits: │
│ • Avoids LinkedIn rate limiting │
│ • Gets 30-50 jobs per country (vs 10 without delays) │
│ • Resumable if interrupted │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 8. EXPORT & SUMMARY │
│ - Generate console statistics: │
│ • Jobs by country and platform │
│ • Total jobs, unique URLs │
│ • Filtering stats (original vs filtered) │
│ │
│ - CSV Export (manual): │
│ • Run: python export_to_csv.py │
│ • Output: jobs_export_YYYYMMDD_HHMMSS.csv │
│ • Includes all 17 fields + summaries │
└─────────────────────────────────────────────────────────────────┘
Two main tables:
| Field | Type | Description |
|---|---|---|
| job_id | TEXT PRIMARY KEY | MD5 hash of platform + URL |
| platform | TEXT | "LinkedIn" or "Indeed" |
| input_role | TEXT | Normalized search term |
| actual_role | TEXT | Scraped job title |
| url | TEXT UNIQUE | Job posting URL |
| scraped | INTEGER | 0 = pending, 1 = completed |
| Field | Type | Description |
|---|---|---|
| job_id | TEXT PRIMARY KEY | Links to job_urls |
| platform | TEXT | Source platform |
| input_role | TEXT | Normalized search term |
| actual_role | TEXT | Job title |
| url | TEXT UNIQUE | Job URL |
| job_description | TEXT | Full description |
| skills | TEXT | Comma-separated skills |
| company_name | TEXT | Company name |
| country | TEXT | Job country |
| location | TEXT | City/region |
| search_term | TEXT | Original search query |
| posted_date | TEXT | ISO format date |
| scraped_at | TEXT | Timestamp |
| is_remote | INTEGER | 1=remote, 0=onsite, NULL=unknown |
| job_level | TEXT | Seniority level |
| job_function | TEXT | Job category |
| company_industry | TEXT | Industry sector |
| company_url | TEXT | Direct application URL |
File: data/jobs_export_YYYYMMDD_HHMMSS.csv
Contains all job data with comprehensive summaries:
- Jobs by country
- Jobs by platform (LinkedIn/Indeed)
- Jobs by search term
- Unique companies count
- Remote vs on-site breakdown
- Job level distribution
Real-time progress tracking:
📊 Exporting jobs table...
✅ Exported 450 jobs to: data/jobs_export_20260102_013358.csv
📈 Summary by Country:
- USA: 150 jobs
- UK: 85 jobs
- Germany: 75 jobs
...
📈 Summary by Platform:
- LinkedIn: 320 jobs
- Indeed: 130 jobs
🏢 Total unique companies: 287
# Scrape one country for testing
python code/run_netherlands_indeed_linkedin.py --jobs 50What happens:
- Scrapes 50 jobs from Netherlands (default)
- Stores URLs then details
- Extracts skills automatically
- Saves to jobs.db
Time: ~5 minutes
# 1. Edit configuration in quick_batch.py
# Set: SCRAPE_ALL_COUNTRIES = True
# Set: DELAY_BETWEEN_COUNTRIES = 30 (minutes)
# 2. Run batch scraper
python code/quick_batch.py# Automatically scrapes all 10 countries with 60-min delays
python code/run_batch_scraper.py# Batch mode with all countries, 45-min delays
python code/run_netherlands_indeed_linkedin.py --batch --delay 45
# Batch mode with specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK,Germany" --delay 30What happens:
- Scrapes each country sequentially
- Waits configured delay between countries
- Safe from LinkedIn rate limiting
- Resumable if interrupted
Time: 6-10 hours (run overnight)
# Export all data to CSV
python code/export_to_csv.pyOutput: data/jobs_export_YYYYMMDD_HHMMSS.csv
Includes:
- All 17 job fields
- Summary statistics
- Breakdowns by country, platform, role
# View scraping progress
python code/check_current_urls.pyShows:
- Total URLs collected
- URLs scraped (have full details)
- URLs pending (need detail extraction)
- Progress by platform and role
# Clean duplicate entries
python code/remove_duplicates.pyRemoves:
- Duplicate job URLs
- Duplicate job descriptions (same job, different URLs)
# Check if URLs exist in database
python code/check_url_in_db.py
# Verify CSV URLs against database
python code/verify_csv_urls.py
# Check platform-specific URLs
python code/verify_platform_urls.pyThe system uses a single table for job storage (simplified from two-phase):
┌─────────────────────────────────────────────────┐
│ jobs │
├─────────────────────────────────────────────────┤
│ job_id (PK) TEXT # Unique identifier │
│ platform TEXT # "linkedin"/"indeed"│
│ actual_role TEXT # Job title │
│ url (UNIQUE) TEXT # Platform job URL │
│ job_description TEXT # Full description │
│ skills TEXT # Comma-separated │
│ company_name TEXT # Company name │
│ country TEXT # Country code │
│ location TEXT # City, State │
│ search_term TEXT # Search query used │
│ posted_date TEXT # ISO date │
│ scraped_at TEXT # Scrape timestamp │
│ is_remote INTEGER # 1=remote, 0=onsite│
│ job_level TEXT # Seniority level │
│ job_function TEXT # Job category │
│ company_industry TEXT # Industry sector │
│ company_url TEXT # Direct career URL │
└─────────────────────────────────────────────────┘
| Field | Type | Description | Example |
|---|---|---|---|
job_id |
TEXT | Generated: platform_country_id_hash |
linkedin_USA_12345_67890 |
platform |
TEXT | Source platform | linkedin or indeed |
actual_role |
TEXT | Job title from posting | Operations Research Analyst |
url |
TEXT | Platform job page URL (unique) | https://linkedin.com/jobs/view/123... |
job_description |
TEXT | Full job description | We are seeking an OR specialist... |
skills |
TEXT | Comma-separated extracted skills | Python, Gurobi, Linear Programming |
company_name |
TEXT | Company name | Amazon |
country |
TEXT | Country code | USA |
location |
TEXT | City/State/Country | Seattle, WA, USA |
search_term |
TEXT | Search query that found this job | Operations Research |
posted_date |
TEXT | Job posting date (ISO format) | 2026-01-01 |
scraped_at |
TEXT | When we scraped it | 2026-01-02T14:35:00 |
is_remote |
INTEGER | Remote work option | 1 (yes), 0 (no), NULL (unknown) |
job_level |
TEXT | LinkedIn seniority | Mid-Senior level |
job_function |
TEXT | Job category | Engineering |
company_industry |
TEXT | Industry | IT Services |
company_url |
TEXT | Direct application URL | https://amazon.jobs/... |
CREATE UNIQUE INDEX idx_jobs_url ON jobs(url);
CREATE INDEX idx_jobs_country ON jobs(country);
CREATE INDEX idx_jobs_platform ON jobs(platform);
CREATE INDEX idx_jobs_search_term ON jobs(search_term);Note: URL deduplication happens in-memory during scraping using a seen_urls set.
The system uses 977 skills from skills_reference_2025.json with regex pattern matching:
- Purpose: Catch complex technical terms first
- Examples: "Machine Learning", "Deep Learning", "Linear Programming", "OR-Tools"
- Method: Exact phrase matching with word boundaries
- Priority: Highest (processed first to prevent partial matches)
- Purpose: Validate single-word skills with surrounding context
- Examples: "Python" near "programming", "SQL" near "database"
- Method: Keyword + context validation
- Benefit: Reduces false positives (e.g., "python" the snake vs Python the language)
- Purpose: Comprehensive coverage from 977-skill reference database
- Method: Apply regex patterns from
skills_reference_2025.json - Examples:
"\\bgurobi\\b"matches "Gurobi""linear\\s+programming|\\bLP\\b"matches "Linear Programming" or "LP"
- Coverage: 977 skills across 15+ categories
Programming Languages: Python, Java, C++, R, SQL, Julia, etc. (50+ skills)
Optimization Tools: Gurobi, CPLEX, OR-Tools, SCIP, FICO Xpress, etc. (25+ skills)
OR Techniques: Linear Programming, MILP, Constraint Programming, etc. (40+ skills)
Data Science: Machine Learning, Deep Learning, NLP, etc. (100+ skills)
Cloud Platforms: AWS, Azure, GCP, etc. (30+ skills)
Databases: PostgreSQL, MongoDB, Redis, Cassandra, etc. (40+ skills)
Frameworks: TensorFlow, PyTorch, Spark, Pandas, etc. (80+ skills)
DevOps: Docker, Kubernetes, CI/CD, Jenkins, etc. (50+ skills)
Analytics: Tableau, Power BI, Excel, Matplotlib, etc. (35+ skills)
Supply Chain: SAP, Warehouse Management, Inventory Optimization, etc. (30+ skills)
Math/Stats: Statistics, Probability, Calculus, Heuristics, etc. (25+ skills)
Web Technologies: REST API, GraphQL, React, Node.js, etc. (60+ skills)
Big Data: Hadoop, Spark, Kafka, Airflow, etc. (35+ skills)
Version Control: Git, GitHub, GitLab, Bitbucket, etc. (15+ skills)
Project Management: Agile, Scrum, Jira, Lean, etc. (20+ skills)
... and more categories covering all technical domains
- Layer 1: Extract multi-word phrases →
["Linear Programming", "Machine Learning"] - Layer 2: Extract context-validated terms →
["Python", "SQL", "Gurobi"] - Layer 3: Apply 977 regex patterns →
["optimization", "CPLEX", "AWS"] - Filter Common Words: Remove grammatical words → Remove "and", "the", "with"
- Split Conjunctions:
"Python and SQL"→["Python", "SQL"] - Deduplicate: Merge similar →
"ML"+"Machine Learning"→"Machine Learning" - Validate: Cross-reference with skills_reference → Remove false positives
- Output: Comma-separated string →
"Python, Gurobi, Linear Programming, AWS"
Input Job Description:
We're seeking an Operations Research Engineer with expertise in
linear programming using Gurobi and CPLEX. You'll optimize
supply chain routes using Python and implement ML models.
Extracted Skills:
Linear Programming, Gurobi, CPLEX, Python, Machine Learning,
Operations Research, Supply Chain Optimization
Performance: 0.3 seconds per job, 80-85% accuracy
COUNTRY_JOB_TARGETS = {
"USA": 200, # Large market
"India": 200, # Large market
"UK": 150, # Medium-large
"Germany": 100, # Medium
"Netherlands": 50, # Small
# ... 10 total countries
}
# Multiplier to scale all targets
COUNTRY_JOB_MULTIPLIER = 2.5 # Default: 2.5x
# Examples: USA gets 200 × 2.5 = 500 jobsTo Modify: Edit COUNTRY_JOB_MULTIPLIER in the script to scale all targets at once.
SEARCH_TERMS = [
"operation research", # lowercased for stricter matching
"Mathematical Optimization",
"MILP",
"Integer Programming",
"Gurobi",
"Routing Optimization",
"Supply Chain Optimization",
"Simulation Optimization",
]To Modify: Edit SEARCH_TERMS list in the script to add/remove queries.
# Tier 1: Broad title keywords (13 terms)
FILTER_KEYWORDS_TITLE = [
"optim", "operations research", "supply chain",
"logistics", "routing", "scheduling", ...
]
# Tier 2: Technical keywords (24 terms)
FILTER_KEYWORDS_STRONG = [
"linear programming", "integer programming",
"gurobi", "cplex", "or-tools", ...
]
# Tier 3: Negative keywords (9 terms)
FILTER_KEYWORDS_NEGATIVE = [
"seo", "search engine", "sales optimization", ...
]To Modify: Edit these lists in the script to adjust filtering criteria.
LINKEDIN_SLEEP_SEC = 10.0 # Delay between queries
LINKEDIN_MAX_ERRORS = 3 # Stop after N errors
LINKEDIN_FETCH_DESCRIPTION = True # Get full descriptions# Edit these settings:
SCRAPE_ALL_COUNTRIES = True # True = all 10, False = specific
# SPECIFIC_COUNTRIES = ["Netherlands", "Germany"] # If False above
DELAY_BETWEEN_COUNTRIES = 30 # Minutes between countriesCOUNTRIES = None # None = all 10, or ["USA", "UK", "Germany"]
DELAY_MINUTES = 60 # Delay between countries
JOBS = 50 # Fallback (overridden by COUNTRY_JOB_TARGETS)# All available options
python code/run_netherlands_indeed_linkedin.py \
--jobs 50 \ # Fallback job count (usually ignored)
--batch \ # Enable batch mode (RECOMMENDED)
--delay 45 \ # Minutes between countries
--countries "USA,UK,Germany" # Comma-separated country list| Scenario | Delay Setting | Expected Results |
|---|---|---|
| Single Country Test | No delay needed | 30-50 jobs, ~5-10 minutes |
| 2-3 Countries | 30 minutes | 100-150 jobs, ~1.5-2 hours |
| 5-10 Countries | 45-60 minutes | 300-500 jobs, 6-10 hours |
| Production (All 10) | 60 minutes | 400-600 jobs, overnight |
Why Delays Are Critical:
- LinkedIn rate limiting: ~90-120 queries before blocking
- Without delays: Blocked after ~10 jobs total (unusable)
- With 30-60 min delays: 30-50 jobs per country (success)
- Delays are between countries, not queries
| Operation | Time | Details |
|---|---|---|
| URL Collection | 2-5 min | 100-200 URLs per country |
| Detail Extraction | 3-8 min | 30-50 jobs per country |
| Skill Extraction | 0.3s/job | 977-skill reference matching |
| CSV Export | <1 sec | All jobs to CSV |
| Single Country | ~5-10 min | Complete pipeline |
| All 10 Countries (batch) | 6-10 hours | With safe delays |
Total Jobs: 450
Countries: 10
Platforms: LinkedIn (70%), Indeed (30%)
Skills Extracted: ~15-25 per job
Companies: 287 unique
Remote Jobs: 35%
Solution: Use batch mode with 30-60 minute delays
python code/run_netherlands_indeed_linkedin.py --batch --delay 45Solution: Close other scripts accessing jobs.db
# Stop all running scrapers
# Then restartSolution: Check skills_reference_2025.json exists
ls code/src/config/skills_reference_2025.jsonSolution: Run detail extraction phase
# Check pending URLs
python code/check_current_urls.py
# They will be scraped in next run automatically- ✅ Use batch mode with 45-60 min delays
- ✅ Run overnight (6-10 hours total)
- ✅ Start with 2-3 countries to test
- ❌ Don't scrape all countries without delays
- ✅ Use single country mode
- ✅ Set --jobs 20 for quick tests
- ✅ Verify results with export_to_csv.py
- ❌ Don't use --batch for testing
- ✅ Review keyword filters regularly
- ✅ Update skills_reference_2025.json
- ✅ Run remove_duplicates.py periodically
- ✅ Check job descriptions match expected roles
- ✅ System auto-saves progress (checkpoint-based)
- ✅ Can resume interrupted scraping
- ✅ Database handles duplicates automatically
- ✅ Export data regularly as backup
job-scrapper/
├── README.md # Project overview
├── requirements.txt # Python dependencies
├── JOB_SCRAPER_DOCUMENTATION.md # This file (complete documentation)
│
├── code/ # Main code directory
│ ├── quick_batch.py # ⭐ EASIEST: Edit & run for batch mode
│ ├── run_batch_scraper.py # Auto batch scraper (alternative)
│ ├── run_netherlands_indeed_linkedin.py # 🎯 MAIN SCRAPER
│ ├── export_to_csv.py # Export database to CSV
│ ├── remove_duplicates.py # Clean duplicate entries
│ ├── check_current_urls.py # View scraping progress
│ ├── check_url_in_db.py # Verify if URL exists
│ ├── verify_csv_urls.py # Verify CSV against database
│ ├── BATCH_MODE_GUIDE.py # Usage guide for batch mode
│ │
│ ├── data/
│ │ ├── jobs.db # 📦 SQLite database (auto-created)
│ │ └── jobs_export_*.csv # Exported data files
│ │
│ └── src/
│ ├── config/
│ │ └── skills_reference_2025.json # ✅ ONLY CONFIG FILE (977 skills)
│ │
│ ├── db/
│ │ ├── connection.py # Database connection manager
│ │ ├── operations.py # CRUD operations
│ │ └── schema.py # Table schemas
│ │
│ ├── models/
│ │ └── models.py # Pydantic data models
│ │
│ ├── analysis/
│ │ └── skill_extraction/
│ │ ├── extractor.py # Main 3-layer extractor
│ │ ├── layer3_direct.py # Pattern matching
│ │ ├── advanced_regex_extractor.py # Layer 1 & 2
│ │ ├── normalize.py # Deduplication
│ │ ├── common_words_filter.py # Filter common words
│ │ └── confidence_scorer.py # Skill confidence
│ │
│ └── validation/
│ └── realtime_validator.py # Skill validation
│
└── venv/ # Python virtual environment
| File | Purpose | When to Use |
|---|---|---|
quick_batch.py |
Simplest batch scraper | Edit settings and run (recommended) |
run_netherlands_indeed_linkedin.py |
Main scraper script | Direct CLI usage with args |
export_to_csv.py |
Export jobs to CSV | After scraping completes |
skills_reference_2025.json |
977 skill patterns | Auto-loaded (don't modify) |
jobs.db |
SQLite database | Auto-created, stores all data |
remove_duplicates.py |
Clean duplicates | If you see duplicate jobs |
# ============ SCRAPING ============
# Quick test (one country)
python code/run_netherlands_indeed_linkedin.py --jobs 20
# Batch all countries (recommended)
python code/quick_batch.py
# Batch specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK" --delay 30
# ============ EXPORTING ============
# Export to CSV
python code/export_to_csv.py
# ============ MAINTENANCE ============
# View progress
python code/check_current_urls.py
# Remove duplicates
python code/remove_duplicates.py
# Verify URLs
python code/check_url_in_db.py
# ============ TESTING ============
# Test skill extraction
python code/test_skills.py
# Test real extraction
python code/test_real_extraction.py
# Verify CSV URLs
python code/verify_csv_urls.pyThis job scraper is a production-ready system for collecting Operations Research and Optimization jobs with:
✅ JobSpy-powered scraping for LinkedIn + Indeed
✅ 3-tier keyword filtering ensuring only OR/Optimization jobs
✅ 3-layer skill extraction with 977-skill reference database
✅ Multi-country support across 10 countries (USA, India, UK, Germany, etc.)
✅ Market-based targeting with 2.5x multiplier (500 jobs for USA, 125 for Netherlands)
✅ Batch mode with delays to avoid LinkedIn rate limiting
✅ Comprehensive metadata with 17 fields per job
✅ CSV export for easy analysis in Excel/Google Sheets
✅ In-memory deduplication preventing duplicate jobs
Simplest Method (Batch Mode):
# 1. Edit settings in quick_batch.py
# - SCRAPE_ALL_COUNTRIES = True
# - DELAY_BETWEEN_COUNTRIES = 30
#
# 2. Run overnight
python code/quick_batch.py
#
# 3. Export to CSV
python code/export_to_csv.pyExpected Results:
- Time: 6-10 hours (overnight run)
- Jobs: 400-600 total across 10 countries
- Distribution: USA/India get ~500 jobs, smaller countries ~125 jobs
- Platforms: ~70% LinkedIn, ~30% Indeed
- Skills: 15-25 skills extracted per job
Why This Works:
- ✅ Market-based targets ensure quality over quantity
- ✅ 3-tier filtering removes SEO, marketing, sales jobs
- ✅ Delays prevent LinkedIn blocking
- ✅ Skill extraction provides actionable insights
- ✅ CSV export enables custom analysis
Last Updated: January 2, 2026
Focus: Operations Research & Mathematical Optimization Jobs