Skip to content

Mbehbahani/Oploy-Job-Scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Scraper System - Complete Documentation

Overview

This is an automated job scraping system designed to collect Operations Research and Optimization job postings from LinkedIn and Indeed using intelligent keyword filtering. The system extracts technical skills using advanced 3-layer pattern matching and stores comprehensive job data for analysis.

Primary Focus: Operations Research, Mathematical Optimization, Supply Chain, Routing/Scheduling, Simulation, and related optimization roles.


Quick Start & Why It Matters

  • Procedure (run → store → export): Run run_netherlands_indeed_linkedin.py to scrape, persist to data/jobs.db, then export with export_to_csv.py for downstream analytics.
  • Precision/Recall via Tiered Filters: Tier 1 (titles) maximizes recall for optimization-themed roles; Tier 2 (descriptions) sharpens precision with solver/OR terms; negatives remove obvious non-OR noise.
  • Data Completeness: Indeed yields richer company metadata (including employee counts); LinkedIn exposes lighter company fields via JobSpy (company name/URL mostly).
  • Keyword Tuning Loop: Keep only search terms with healthy acceptance (Final/Found) and high Tier2 hits; prune noisy terms after each run based on the filter stats table.
  • SEO-friendly insights: Focuses on “operations research jobs”, “optimization jobs”, “supply chain optimization”, “MILP / integer programming”, “Gurobi / OR-Tools”, and “routing/scheduling optimization”.
  • Live Dashboard: Processed/NLP-enriched results are visualized at https://joblab.oploy.eu/ .


Table of Contents

  1. System Architecture
  2. Key Features
  3. Input Requirements
  4. Processing Pipeline
  5. Output & Results
  6. Usage Procedures
  7. Database Schema
  8. Skill Extraction System
  9. Configuration

System Architecture

Single-Phase Direct Scraping

The system uses JobSpy library for direct scraping with intelligent filtering:

How It Works:

  • Uses python-jobspy library to scrape LinkedIn and Indeed
  • Searches with 8 optimization-related keywords per country
  • Applies 3-tier keyword filtering to ensure job relevance
  • Extracts skills using 3-layer pattern matching (977 skills)
  • Stores complete job data in SQLite database
  • Deduplication: Tracks URLs to prevent duplicates

Technology Stack

├── Web Scraping:    python-jobspy (LinkedIn & Indeed API wrapper)
├── Data Models:     Pydantic v2 (validation)
├── Database:        SQLite (jobs.db)
├── Skill Extraction: 3-layer regex pattern matching (977 skills)
├── Filtering:       3-tier keyword matching (broad → technical → negative)
├── Export:          CSV format
└── Multi-Country:   10 countries with batch mode support

Key Features

1. Multi-Country Support

  • 10 Countries: USA, Canada, UK, Netherlands, Germany, Denmark, France, Austria, Australia, India
  • Market-Based Targeting: Larger markets get more jobs (USA/India: 500, UK: 375, etc.)
  • Batch Mode: Sequential scraping with configurable delays (30-60 minutes)
  • Multiplier Setting: Scale all country targets with single parameter (default: 2.5x)

2. 3-Tier Intelligent Keyword Filtering

The system ensures only Operations Research / Optimization jobs are collected:

Tier 1: Broad Title Keywords (Cast Wide Net)

"optim", "operations research", "supply chain", "logistics", 
"routing", "scheduling", "decision science", "algorithm",
"data scientist", "machine learning", "analytics", "solver", "mathematical"

Tier 2: Technical Keywords (High Precision)

"operations research", "linear programming", "integer programming",
"mixed integer", "milp", "mip", "gurobi", "cplex", "or-tools", "ortools",
"constraint programming", "combinatorial optimization",
"mathematical optimization", "network optimization",
"vehicle routing", "routing optimization", "scheduling optimization",
"supply chain optimization", "supply chain", "inventory optimization",
"inventory management", "demand planning", "forecasting",
"heuristic", "metaheuristic", "convex optimization",
"stochastic optimization", "discrete optimization",
"simulation", "prescriptive analytics",
"pulp", "pyomo",
"Industrial Engineering", "Fulfillment Optimization"

Tier 3: Negative Keywords (Reject Non-OR Jobs)

"seo", "search engine", "sales optimization", "marketing optimization",
"conversion optimization", "website optimization", "social media"

Filtering Logic: Accept job if (Title matches Tier 1 OR Description matches Tier 2) AND NOT (Title contains Tier 3)

3. 8 Optimization-Specific Search Terms (current)

"operation research",  # lowercased for stricter matching
"Mathematical Optimization",
"MILP",
"Integer Programming",
"Gurobi",
"Routing Optimization",
"Supply Chain Optimization",
"Simulation Optimization",

4. Advanced Skill Extraction

3-Layer Extraction System:

  • Layer 1: Multi-word phrases (e.g., "Machine Learning", "Linear Programming")
  • Layer 2: Context-aware extraction with validation
  • Layer 3: Direct pattern matching from 977-skill reference database
  • Performance: 80-85% accuracy at 0.3s/job (10x faster than spaCy)
  • Output: Comma-separated skills stored with each job

5. Adaptive Rate Limiting

  • LinkedIn Protection: 10-second delay between queries
  • Error Handling: Stops after 3 consecutive errors (rate limit detection)
  • Batch Mode: 30-second delay between countries
  • Recommended: Use batch mode with 30-60 minute delays for multi-country scraping

6. Comprehensive Metadata

Captures 17 fields per job:

  • Basic: Title, description, company, location, URL
  • Enhanced: Remote status, job level, job function, industry
  • Temporal: Posted date (last 4 weeks), scraped timestamp
  • Analytics: Skills (extracted), search term used, country
  • URLs: Platform URL (LinkedIn/Indeed) + Direct company URL
  • Company size: company_num_employees (robust on Indeed; rarely provided by LinkedIn via JobSpy)

Precision vs Recall (Tiered Filtering)

  • Goal: Maximize recall of relevant optimization jobs while keeping precision high enough to avoid “SEO/marketing optimization” noise.
  • Tier 1 (Recall engine): Title keywords (e.g., "optim", "supply chain", "routing", "algorithm") pull in borderline-but-possibly-relevant roles. Expect higher false positives, but few misses.
  • Tier 2 (Precision engine): Description keywords (e.g., "MILP", "integer programming", "Gurobi", "OR-Tools", "simulation", "supply chain optimization") confirm true OR/optimization relevance.
  • Negative filter: Quickly discards obvious non-OR roles (SEO/marketing/sales optimization), boosting precision without hurting recall.
  • Acceptance rule: Accept if (Tier1 OR Tier2) AND NOT negative → balanced recall/precision without heavy ML.
  • Stats table: Each run prints per-search-term stats (Found, Neg, T1, T2, Both, NoMatch, Final, Rate). Use this to:
    • Drop low-performing search terms (low Final/Found or many NoMatch).
    • Strengthen Tier2 by adding solver/tech terms seen in good results.
    • Adjust Tier1 when T1-only is high (potential false positives).

How to Improve Keyword Selection

  • Iterate from data: After each run, remove search terms with <50% acceptance; add new terms from accepted descriptions (solvers, methods, domains).
  • Balance tiers: If too many T1-only hits, add more Tier2 technical terms; if recall is low, broaden Tier1 slightly.
  • Domain variants: Add industry-specific phrases (e.g., "timetabling", "portfolio optimization", "workforce scheduling") when targeting new niches.
  • Language/locale: When expanding countries, include local-language equivalents in Tier1 and Tier2.

Adapting to Other Fields

  • New domains: Replace Tier1/Tier2 keyword lists with your domain’s title cues and technical signals; keep negative list to protect precision.
  • Skills extraction: Update skills_reference_2025.json with domain skills and regex patterns.
  • Markets: Edit COUNTRIES, COUNTRY_JOB_TARGETS, and SEARCH_TERMS in run_netherlands_indeed_linkedin.py.
  • Outputs: Add new DB columns via src/db/schema.py and Pydantic models in src/models/models.py if you need extra metadata.

Input Requirements

1. Configuration File (Skills Reference Only)

skills_reference_2025.json

Location: code/src/config/skills_reference_2025.json

Purpose: Used by 3-layer skill extraction system to identify technical skills in job descriptions.

{
  "skills": {
    "Python": {
      "regex_pattern": "\\bpython\\b",
      "category": "Programming Language",
      "priority": 1
    },
    "Linear Programming": {
      "regex_pattern": "linear\\s+programming|\\bLP\\b",
      "category": "Optimization",
      "priority": 2
    },
    "Gurobi": {
      "regex_pattern": "\\bgurobi\\b",
      "category": "Optimization Solver",
      "priority": 1
    }
    // ... 977 total skills
  }
}

This is the ONLY configuration file - all other settings are hardcoded in the main script.

2. Built-in Search Configuration

These are hardcoded in run_netherlands_indeed_linkedin.py:

Countries Dictionary (10 countries)

COUNTRIES = {
    "USA": {"name": "USA", "indeed_country": "USA", "flag": "🇺🇸"},
    "Netherlands": {"name": "Netherlands", "indeed_country": "Netherlands", "flag": "🇳🇱"},
    # ... 10 total countries
}

Search Terms (8 OR-specific queries)

SEARCH_TERMS = [
  "operation research",  # lowercased for stricter matching
  "Mathematical Optimization",
  "MILP",
  "Integer Programming",
  "Gurobi",
  "Routing Optimization",
  "Supply Chain Optimization",
  "Simulation Optimization",
]

Country Job Targets (Market-based)

COUNTRY_JOB_TARGETS = {
    "USA": 200,        # × 2.5 = 500 jobs
    "India": 200,      # × 2.5 = 500 jobs
    "UK": 150,         # × 2.5 = 375 jobs
    "Germany": 100,    # × 2.5 = 250 jobs
    "Netherlands": 50, # × 2.5 = 125 jobs
    # ... scaled by COUNTRY_JOB_MULTIPLIER = 2.5
}

Keyword Filters (3-tier filtering)

  • FILTER_KEYWORDS_TITLE: 13 broad keywords for title matching
  • FILTER_KEYWORDS_STRONG: 24 technical keywords for description matching
  • FILTER_KEYWORDS_NEGATIVE: 9 keywords to reject non-OR jobs

3. Runtime Parameters

Users configure via command-line arguments:

  • --jobs: Fallback job count (default: 50, overridden by country targets)
  • --countries: Comma-separated country list (default: all 10)
  • --batch: Enable batch mode with delays
  • --delay: Minutes between countries in batch mode (default: 60)

Processing Pipeline

Complete Workflow

┌─────────────────────────────────────────────────────────────────┐
│ 1. INITIALIZATION                                               │
│    - Load skills reference (977 skills)                         │
│    - Initialize database (jobs.db)                              │
│    - Setup JobSpy scraper                                       │
│    - Load country targets and search terms                      │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. MULTI-PLATFORM SCRAPING (Indeed + LinkedIn)                  │
│    For each country:                                            │
│                                                                 │
│    ┌─ INDEED PLATFORM ─────────────────────────────────────┐    │
│    │ For each of 8 search terms:                           │    │
│    │   • Search Indeed with JobSpy API                     │    │
│    │   • Get results (based on country target)             │    │
│    │   • Apply 3-tier keyword filter                       │    │
│    │   • Track URLs (prevent duplicates)                   │    │
│    └───────────────────────────────────────────────────────┘    │
│                                                                 │
│    ┌─ LINKEDIN PLATFORM ──────────────────────────────────┐    │
│    │ Sequential query approach:                            │    │
│    │   • Run all 8 search terms with 10s delays           │    │
│    │   • Fetch full descriptions (slower but complete)     │    │
│    │   • Apply 3-tier keyword filter                       │    │
│    │   • Detect & stop on rate limiting                    │    │
│    │   • Stop after 3 consecutive errors                   │    │
│    └───────────────────────────────────────────────────────┘    │
│                                                                 │
│    - Deduplication: URL tracking prevents duplicate jobs       │
│    - Rate limiting: 10s between LinkedIn queries               │
│    - Delay: 30s between countries (prevent overwhelming)       │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. KEYWORD FILTERING (3-Tier Quality Control)                   │
│    For each scraped job:                                        │
│                                                                 │
│    ✅ TIER 1: Title Check                                       │
│       - Match any of 13 broad keywords                          │
│       - Examples: "optim", "operations research", "routing"     │
│                                                                 │
│    ✅ TIER 2: Description Check                                 │
│       - Match any of 24 technical keywords                      │
│       - Examples: "linear programming", "gurobi", "milp"        │
│                                                                 │
│    ❌ TIER 3: Negative Filter                                   │
│       - Reject if title contains 9 negative keywords            │
│       - Examples: "seo", "sales optimization", "marketing"      │
│                                                                 │
│    Decision: Accept if (Tier 1 OR Tier 2) AND NOT Tier 3       │
│    Result: Only OR/Optimization-relevant jobs                   │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. DATA PROCESSING & ENRICHMENT                                 │
│    For each filtered job:                                       │
│      • Extract metadata from JobSpy results                     │
│      • Parse location data (city, state, country)               │
│      • Parse posted date (last 4 weeks filter)                  │
│      • Extract remote status, job level, function, industry     │
│      • Get both platform URL and company career URL             │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 5. SKILL EXTRACTION (3-Layer Pattern Matching)                  │
│    For each job description:                                    │
│                                                                 │
│    Layer 1: Multi-word phrases                                  │
│      • Match complex terms: "Machine Learning", "OR-Tools"      │
│      • Priority: Highest (processed first)                      │
│                                                                 │
│    Layer 2: Context-aware extraction                            │
│      • Validate with context: "Python" + "programming"          │
│      • Reduces false positives                                  │
│                                                                 │
│    Layer 3: Direct pattern matching                             │
│      • Apply 977 regex patterns from skills_reference.json      │
│      • Comprehensive coverage across all categories             │
│                                                                 │
│    Post-processing:                                             │
│      • Filter common words (and, the, with, using)              │
│      • Split conjunctions ("Python and SQL" → 2 skills)         │
│      • Deduplicate (normalize similar skills)                   │
│      • Validate against reference (remove false positives)      │
│      • Output: Comma-separated skill list                       │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 6. DATABASE STORAGE                                             │
│    - Store complete job details to jobs table                   │
│    - 17 fields: IDs, URLs, content, metadata, skills            │
│    - Atomic operations (thread-safe)                            │
│    - Auto-deduplication by URL                                  │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 7. BATCH MODE (Multi-Country Sequential)                        │
│    If batch mode enabled:                                       │
│      • Scrape first country completely                          │
│      • Wait configured delay (30-60 minutes)                    │
│      • Scrape next country                                      │
│      • Repeat for all countries                                 │
│      • Total time: 6-10 hours (run overnight)                   │
│                                                                 │
│    Benefits:                                                    │
│      • Avoids LinkedIn rate limiting                            │
│      • Gets 30-50 jobs per country (vs 10 without delays)       │
│      • Resumable if interrupted                                 │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 8. EXPORT & SUMMARY                                             │
│    - Generate console statistics:                               │
│      • Jobs by country and platform                             │
│      • Total jobs, unique URLs                                  │
│      • Filtering stats (original vs filtered)                   │
│                                                                 │
│    - CSV Export (manual):                                       │
│      • Run: python export_to_csv.py                             │
│      • Output: jobs_export_YYYYMMDD_HHMMSS.csv                 │
│      • Includes all 17 fields + summaries                       │
└─────────────────────────────────────────────────────────────────┘

Output & Results

1. Database (jobs.db)

Two main tables:

Table: job_urls

Field Type Description
job_id TEXT PRIMARY KEY MD5 hash of platform + URL
platform TEXT "LinkedIn" or "Indeed"
input_role TEXT Normalized search term
actual_role TEXT Scraped job title
url TEXT UNIQUE Job posting URL
scraped INTEGER 0 = pending, 1 = completed

Table: jobs

Field Type Description
job_id TEXT PRIMARY KEY Links to job_urls
platform TEXT Source platform
input_role TEXT Normalized search term
actual_role TEXT Job title
url TEXT UNIQUE Job URL
job_description TEXT Full description
skills TEXT Comma-separated skills
company_name TEXT Company name
country TEXT Job country
location TEXT City/region
search_term TEXT Original search query
posted_date TEXT ISO format date
scraped_at TEXT Timestamp
is_remote INTEGER 1=remote, 0=onsite, NULL=unknown
job_level TEXT Seniority level
job_function TEXT Job category
company_industry TEXT Industry sector
company_url TEXT Direct application URL

2. CSV Export

File: data/jobs_export_YYYYMMDD_HHMMSS.csv

Contains all job data with comprehensive summaries:

  • Jobs by country
  • Jobs by platform (LinkedIn/Indeed)
  • Jobs by search term
  • Unique companies count
  • Remote vs on-site breakdown
  • Job level distribution

3. Console Output

Real-time progress tracking:

📊 Exporting jobs table...
✅ Exported 450 jobs to: data/jobs_export_20260102_013358.csv

📈 Summary by Country:
  - USA: 150 jobs
  - UK: 85 jobs
  - Germany: 75 jobs
  ...

📈 Summary by Platform:
  - LinkedIn: 320 jobs
  - Indeed: 130 jobs

🏢 Total unique companies: 287

Usage Procedures

Procedure 1: Quick Start (Single Country)

# Scrape one country for testing
python code/run_netherlands_indeed_linkedin.py --jobs 50

What happens:

  1. Scrapes 50 jobs from Netherlands (default)
  2. Stores URLs then details
  3. Extracts skills automatically
  4. Saves to jobs.db

Time: ~5 minutes


Procedure 2: Batch Mode (All Countries)

Option A: Using quick_batch.py (Recommended for Beginners)

# 1. Edit configuration in quick_batch.py
# Set: SCRAPE_ALL_COUNTRIES = True
# Set: DELAY_BETWEEN_COUNTRIES = 30 (minutes)

# 2. Run batch scraper
python code/quick_batch.py

Option B: Using run_batch_scraper.py

# Automatically scrapes all 10 countries with 60-min delays
python code/run_batch_scraper.py

Option C: Direct Command Line

# Batch mode with all countries, 45-min delays
python code/run_netherlands_indeed_linkedin.py --batch --delay 45

# Batch mode with specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK,Germany" --delay 30

What happens:

  1. Scrapes each country sequentially
  2. Waits configured delay between countries
  3. Safe from LinkedIn rate limiting
  4. Resumable if interrupted

Time: 6-10 hours (run overnight)


Procedure 3: Export Results to CSV

# Export all data to CSV
python code/export_to_csv.py

Output: data/jobs_export_YYYYMMDD_HHMMSS.csv

Includes:

  • All 17 job fields
  • Summary statistics
  • Breakdowns by country, platform, role

Procedure 4: Check Database Status

# View scraping progress
python code/check_current_urls.py

Shows:

  • Total URLs collected
  • URLs scraped (have full details)
  • URLs pending (need detail extraction)
  • Progress by platform and role

Procedure 5: Remove Duplicates

# Clean duplicate entries
python code/remove_duplicates.py

Removes:

  • Duplicate job URLs
  • Duplicate job descriptions (same job, different URLs)

Procedure 6: Verify URLs

# Check if URLs exist in database
python code/check_url_in_db.py

# Verify CSV URLs against database
python code/verify_csv_urls.py

# Check platform-specific URLs
python code/verify_platform_urls.py

Database Schema

Single Table Design

The system uses a single table for job storage (simplified from two-phase):

┌─────────────────────────────────────────────────┐
│                    jobs                         │
├─────────────────────────────────────────────────┤
│ job_id (PK)        TEXT     # Unique identifier │
│ platform           TEXT     # "linkedin"/"indeed"│
│ actual_role        TEXT     # Job title         │
│ url (UNIQUE)       TEXT     # Platform job URL  │
│ job_description    TEXT     # Full description  │
│ skills             TEXT     # Comma-separated   │
│ company_name       TEXT     # Company name      │
│ country            TEXT     # Country code      │
│ location           TEXT     # City, State       │
│ search_term        TEXT     # Search query used │
│ posted_date        TEXT     # ISO date          │
│ scraped_at         TEXT     # Scrape timestamp  │
│ is_remote          INTEGER  # 1=remote, 0=onsite│
│ job_level          TEXT     # Seniority level   │
│ job_function       TEXT     # Job category      │
│ company_industry   TEXT     # Industry sector   │
│ company_url        TEXT     # Direct career URL │
└─────────────────────────────────────────────────┘

Field Details

Field Type Description Example
job_id TEXT Generated: platform_country_id_hash linkedin_USA_12345_67890
platform TEXT Source platform linkedin or indeed
actual_role TEXT Job title from posting Operations Research Analyst
url TEXT Platform job page URL (unique) https://linkedin.com/jobs/view/123...
job_description TEXT Full job description We are seeking an OR specialist...
skills TEXT Comma-separated extracted skills Python, Gurobi, Linear Programming
company_name TEXT Company name Amazon
country TEXT Country code USA
location TEXT City/State/Country Seattle, WA, USA
search_term TEXT Search query that found this job Operations Research
posted_date TEXT Job posting date (ISO format) 2026-01-01
scraped_at TEXT When we scraped it 2026-01-02T14:35:00
is_remote INTEGER Remote work option 1 (yes), 0 (no), NULL (unknown)
job_level TEXT LinkedIn seniority Mid-Senior level
job_function TEXT Job category Engineering
company_industry TEXT Industry IT Services
company_url TEXT Direct application URL https://amazon.jobs/...

Key Indexes

CREATE UNIQUE INDEX idx_jobs_url ON jobs(url);
CREATE INDEX idx_jobs_country ON jobs(country);
CREATE INDEX idx_jobs_platform ON jobs(platform);
CREATE INDEX idx_jobs_search_term ON jobs(search_term);

Note: URL deduplication happens in-memory during scraping using a seen_urls set.


Skill Extraction System

3-Layer Architecture

The system uses 977 skills from skills_reference_2025.json with regex pattern matching:

Layer 1: Multi-Word Phrases (Priority)

  • Purpose: Catch complex technical terms first
  • Examples: "Machine Learning", "Deep Learning", "Linear Programming", "OR-Tools"
  • Method: Exact phrase matching with word boundaries
  • Priority: Highest (processed first to prevent partial matches)

Layer 2: Context-Aware Extraction

  • Purpose: Validate single-word skills with surrounding context
  • Examples: "Python" near "programming", "SQL" near "database"
  • Method: Keyword + context validation
  • Benefit: Reduces false positives (e.g., "python" the snake vs Python the language)

Layer 3: Direct Pattern Matching

  • Purpose: Comprehensive coverage from 977-skill reference database
  • Method: Apply regex patterns from skills_reference_2025.json
  • Examples:
    • "\\bgurobi\\b" matches "Gurobi"
    • "linear\\s+programming|\\bLP\\b" matches "Linear Programming" or "LP"
  • Coverage: 977 skills across 15+ categories

Skill Categories (977 Total)

Programming Languages:   Python, Java, C++, R, SQL, Julia, etc. (50+ skills)
Optimization Tools:      Gurobi, CPLEX, OR-Tools, SCIP, FICO Xpress, etc. (25+ skills)
OR Techniques:           Linear Programming, MILP, Constraint Programming, etc. (40+ skills)
Data Science:            Machine Learning, Deep Learning, NLP, etc. (100+ skills)
Cloud Platforms:         AWS, Azure, GCP, etc. (30+ skills)
Databases:               PostgreSQL, MongoDB, Redis, Cassandra, etc. (40+ skills)
Frameworks:              TensorFlow, PyTorch, Spark, Pandas, etc. (80+ skills)
DevOps:                  Docker, Kubernetes, CI/CD, Jenkins, etc. (50+ skills)
Analytics:               Tableau, Power BI, Excel, Matplotlib, etc. (35+ skills)
Supply Chain:            SAP, Warehouse Management, Inventory Optimization, etc. (30+ skills)
Math/Stats:              Statistics, Probability, Calculus, Heuristics, etc. (25+ skills)
Web Technologies:        REST API, GraphQL, React, Node.js, etc. (60+ skills)
Big Data:                Hadoop, Spark, Kafka, Airflow, etc. (35+ skills)
Version Control:         Git, GitHub, GitLab, Bitbucket, etc. (15+ skills)
Project Management:      Agile, Scrum, Jira, Lean, etc. (20+ skills)
... and more categories covering all technical domains

Processing Steps

  1. Layer 1: Extract multi-word phrases → ["Linear Programming", "Machine Learning"]
  2. Layer 2: Extract context-validated terms → ["Python", "SQL", "Gurobi"]
  3. Layer 3: Apply 977 regex patterns → ["optimization", "CPLEX", "AWS"]
  4. Filter Common Words: Remove grammatical words → Remove "and", "the", "with"
  5. Split Conjunctions: "Python and SQL"["Python", "SQL"]
  6. Deduplicate: Merge similar → "ML" + "Machine Learning""Machine Learning"
  7. Validate: Cross-reference with skills_reference → Remove false positives
  8. Output: Comma-separated string → "Python, Gurobi, Linear Programming, AWS"

Example Extraction

Input Job Description:

We're seeking an Operations Research Engineer with expertise in 
linear programming using Gurobi and CPLEX. You'll optimize 
supply chain routes using Python and implement ML models.

Extracted Skills:

Linear Programming, Gurobi, CPLEX, Python, Machine Learning, 
Operations Research, Supply Chain Optimization

Performance: 0.3 seconds per job, 80-85% accuracy


Configuration

1. Built-in Configuration (in run_netherlands_indeed_linkedin.py)

Country Job Targets (Market-Based)

COUNTRY_JOB_TARGETS = {
    "USA": 200,        # Large market
    "India": 200,      # Large market
    "UK": 150,         # Medium-large
    "Germany": 100,    # Medium
    "Netherlands": 50, # Small
    # ... 10 total countries
}

# Multiplier to scale all targets
COUNTRY_JOB_MULTIPLIER = 2.5  # Default: 2.5x
# Examples: USA gets 200 × 2.5 = 500 jobs

To Modify: Edit COUNTRY_JOB_MULTIPLIER in the script to scale all targets at once.

Search Terms (Optimization-Focused)

SEARCH_TERMS = [
  "operation research",  # lowercased for stricter matching
  "Mathematical Optimization",
  "MILP",
  "Integer Programming",
  "Gurobi",
  "Routing Optimization",
  "Supply Chain Optimization",
  "Simulation Optimization",
]

To Modify: Edit SEARCH_TERMS list in the script to add/remove queries.

Keyword Filters (3-Tier System)

# Tier 1: Broad title keywords (13 terms)
FILTER_KEYWORDS_TITLE = [
    "optim", "operations research", "supply chain", 
    "logistics", "routing", "scheduling", ...
]

# Tier 2: Technical keywords (24 terms)
FILTER_KEYWORDS_STRONG = [
    "linear programming", "integer programming", 
    "gurobi", "cplex", "or-tools", ...
]

# Tier 3: Negative keywords (9 terms)
FILTER_KEYWORDS_NEGATIVE = [
    "seo", "search engine", "sales optimization", ...
]

To Modify: Edit these lists in the script to adjust filtering criteria.

LinkedIn Rate Limiting

LINKEDIN_SLEEP_SEC = 10.0       # Delay between queries
LINKEDIN_MAX_ERRORS = 3         # Stop after N errors
LINKEDIN_FETCH_DESCRIPTION = True  # Get full descriptions

2. Quick Batch Configuration Files

Option A: quick_batch.py (Easiest)

# Edit these settings:
SCRAPE_ALL_COUNTRIES = True      # True = all 10, False = specific
# SPECIFIC_COUNTRIES = ["Netherlands", "Germany"]  # If False above
DELAY_BETWEEN_COUNTRIES = 30     # Minutes between countries

Option B: run_batch_scraper.py (Advanced)

COUNTRIES = None  # None = all 10, or ["USA", "UK", "Germany"]
DELAY_MINUTES = 60  # Delay between countries
JOBS = 50  # Fallback (overridden by COUNTRY_JOB_TARGETS)

3. Command Line Arguments

# All available options
python code/run_netherlands_indeed_linkedin.py \
  --jobs 50 \                    # Fallback job count (usually ignored)
  --batch \                       # Enable batch mode (RECOMMENDED)
  --delay 45 \                    # Minutes between countries
  --countries "USA,UK,Germany"    # Comma-separated country list

Rate Limiting Best Practices

Scenario Delay Setting Expected Results
Single Country Test No delay needed 30-50 jobs, ~5-10 minutes
2-3 Countries 30 minutes 100-150 jobs, ~1.5-2 hours
5-10 Countries 45-60 minutes 300-500 jobs, 6-10 hours
Production (All 10) 60 minutes 400-600 jobs, overnight

Why Delays Are Critical:

  • LinkedIn rate limiting: ~90-120 queries before blocking
  • Without delays: Blocked after ~10 jobs total (unusable)
  • With 30-60 min delays: 30-50 jobs per country (success)
  • Delays are between countries, not queries

Performance Metrics

Speed Benchmarks

Operation Time Details
URL Collection 2-5 min 100-200 URLs per country
Detail Extraction 3-8 min 30-50 jobs per country
Skill Extraction 0.3s/job 977-skill reference matching
CSV Export <1 sec All jobs to CSV
Single Country ~5-10 min Complete pipeline
All 10 Countries (batch) 6-10 hours With safe delays

Database Statistics (Example Run)

Total Jobs: 450
Countries: 10
Platforms: LinkedIn (70%), Indeed (30%)
Skills Extracted: ~15-25 per job
Companies: 287 unique
Remote Jobs: 35%

Troubleshooting

Common Issues

1. "LinkedIn Blocked - Only Got 10 Jobs"

Solution: Use batch mode with 30-60 minute delays

python code/run_netherlands_indeed_linkedin.py --batch --delay 45

2. "Database Locked Error"

Solution: Close other scripts accessing jobs.db

# Stop all running scrapers
# Then restart

3. "No Skills Extracted"

Solution: Check skills_reference_2025.json exists

ls code/src/config/skills_reference_2025.json

4. "URLs Not Scraping Details"

Solution: Run detail extraction phase

# Check pending URLs
python code/check_current_urls.py

# They will be scraped in next run automatically

Best Practices

1. For Production Scraping

  • ✅ Use batch mode with 45-60 min delays
  • ✅ Run overnight (6-10 hours total)
  • ✅ Start with 2-3 countries to test
  • ❌ Don't scrape all countries without delays

2. For Testing

  • ✅ Use single country mode
  • ✅ Set --jobs 20 for quick tests
  • ✅ Verify results with export_to_csv.py
  • ❌ Don't use --batch for testing

3. For Data Quality

  • ✅ Review keyword filters regularly
  • ✅ Update skills_reference_2025.json
  • ✅ Run remove_duplicates.py periodically
  • ✅ Check job descriptions match expected roles

4. For Reliability

  • ✅ System auto-saves progress (checkpoint-based)
  • ✅ Can resume interrupted scraping
  • ✅ Database handles duplicates automatically
  • ✅ Export data regularly as backup

File Structure Reference

job-scrapper/
├── README.md                          # Project overview
├── requirements.txt                   # Python dependencies
├── JOB_SCRAPER_DOCUMENTATION.md      # This file (complete documentation)
│
├── code/                             # Main code directory
│   ├── quick_batch.py                # ⭐ EASIEST: Edit & run for batch mode
│   ├── run_batch_scraper.py          # Auto batch scraper (alternative)
│   ├── run_netherlands_indeed_linkedin.py  # 🎯 MAIN SCRAPER
│   ├── export_to_csv.py              # Export database to CSV
│   ├── remove_duplicates.py          # Clean duplicate entries
│   ├── check_current_urls.py         # View scraping progress
│   ├── check_url_in_db.py            # Verify if URL exists
│   ├── verify_csv_urls.py            # Verify CSV against database
│   ├── BATCH_MODE_GUIDE.py           # Usage guide for batch mode
│   │
│   ├── data/
│   │   ├── jobs.db                   # 📦 SQLite database (auto-created)
│   │   └── jobs_export_*.csv         # Exported data files
│   │
│   └── src/
│       ├── config/
│       │   └── skills_reference_2025.json   # ✅ ONLY CONFIG FILE (977 skills)
│       │
│       ├── db/
│       │   ├── connection.py         # Database connection manager
│       │   ├── operations.py         # CRUD operations
│       │   └── schema.py             # Table schemas
│       │
│       ├── models/
│       │   └── models.py             # Pydantic data models
│       │
│       ├── analysis/
│       │   └── skill_extraction/
│       │       ├── extractor.py              # Main 3-layer extractor
│       │       ├── layer3_direct.py          # Pattern matching
│       │       ├── advanced_regex_extractor.py  # Layer 1 & 2
│       │       ├── normalize.py              # Deduplication
│       │       ├── common_words_filter.py    # Filter common words
│       │       └── confidence_scorer.py      # Skill confidence
│       │
│       └── validation/
│           └── realtime_validator.py # Skill validation
│
└── venv/                             # Python virtual environment

Key Files Explained

File Purpose When to Use
quick_batch.py Simplest batch scraper Edit settings and run (recommended)
run_netherlands_indeed_linkedin.py Main scraper script Direct CLI usage with args
export_to_csv.py Export jobs to CSV After scraping completes
skills_reference_2025.json 977 skill patterns Auto-loaded (don't modify)
jobs.db SQLite database Auto-created, stores all data
remove_duplicates.py Clean duplicates If you see duplicate jobs

Quick Reference Commands

# ============ SCRAPING ============

# Quick test (one country)
python code/run_netherlands_indeed_linkedin.py --jobs 20

# Batch all countries (recommended)
python code/quick_batch.py

# Batch specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK" --delay 30


# ============ EXPORTING ============

# Export to CSV
python code/export_to_csv.py


# ============ MAINTENANCE ============

# View progress
python code/check_current_urls.py

# Remove duplicates
python code/remove_duplicates.py

# Verify URLs
python code/check_url_in_db.py


# ============ TESTING ============

# Test skill extraction
python code/test_skills.py

# Test real extraction
python code/test_real_extraction.py

# Verify CSV URLs
python code/verify_csv_urls.py

Summary

This job scraper is a production-ready system for collecting Operations Research and Optimization jobs with:

JobSpy-powered scraping for LinkedIn + Indeed
3-tier keyword filtering ensuring only OR/Optimization jobs
3-layer skill extraction with 977-skill reference database
Multi-country support across 10 countries (USA, India, UK, Germany, etc.)
Market-based targeting with 2.5x multiplier (500 jobs for USA, 125 for Netherlands)
Batch mode with delays to avoid LinkedIn rate limiting
Comprehensive metadata with 17 fields per job
CSV export for easy analysis in Excel/Google Sheets
In-memory deduplication preventing duplicate jobs

Quick Start

Simplest Method (Batch Mode):

# 1. Edit settings in quick_batch.py
#    - SCRAPE_ALL_COUNTRIES = True
#    - DELAY_BETWEEN_COUNTRIES = 30
#
# 2. Run overnight
python code/quick_batch.py
#
# 3. Export to CSV
python code/export_to_csv.py

Expected Results:

  • Time: 6-10 hours (overnight run)
  • Jobs: 400-600 total across 10 countries
  • Distribution: USA/India get ~500 jobs, smaller countries ~125 jobs
  • Platforms: ~70% LinkedIn, ~30% Indeed
  • Skills: 15-25 skills extracted per job

Why This Works:

  • ✅ Market-based targets ensure quality over quantity
  • ✅ 3-tier filtering removes SEO, marketing, sales jobs
  • ✅ Delays prevent LinkedIn blocking
  • ✅ Skill extraction provides actionable insights
  • ✅ CSV export enables custom analysis

Last Updated: January 2, 2026
Focus: Operations Research & Mathematical Optimization Jobs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages