Job Scraper System - Complete Documentation

Overview

This is an automated job scraping system designed to collect Operations Research and Optimization job postings from LinkedIn and Indeed using intelligent keyword filtering. The system extracts technical skills using advanced 3-layer pattern matching and stores comprehensive job data for analysis.

Primary Focus: Operations Research, Mathematical Optimization, Supply Chain, Routing/Scheduling, Simulation, and related optimization roles.

Quick Start & Why It Matters

Procedure (run → store → export): Run run_netherlands_indeed_linkedin.py to scrape, persist to data/jobs.db, then export with export_to_csv.py for downstream analytics.
Precision/Recall via Tiered Filters: Tier 1 (titles) maximizes recall for optimization-themed roles; Tier 2 (descriptions) sharpens precision with solver/OR terms; negatives remove obvious non-OR noise.
Data Completeness: Indeed yields richer company metadata (including employee counts); LinkedIn exposes lighter company fields via JobSpy (company name/URL mostly).
Keyword Tuning Loop: Keep only search terms with healthy acceptance (Final/Found) and high Tier2 hits; prune noisy terms after each run based on the filter stats table.
SEO-friendly insights: Focuses on “operations research jobs”, “optimization jobs”, “supply chain optimization”, “MILP / integer programming”, “Gurobi / OR-Tools”, and “routing/scheduling optimization”.
Live Dashboard: Processed/NLP-enriched results are visualized at https://joblab.oploy.eu/ .

System Architecture

Single-Phase Direct Scraping

The system uses JobSpy library for direct scraping with intelligent filtering:

How It Works:

Uses python-jobspy library to scrape LinkedIn and Indeed
Searches with 8 optimization-related keywords per country
Applies 3-tier keyword filtering to ensure job relevance
Extracts skills using 3-layer pattern matching (977 skills)
Stores complete job data in SQLite database
Deduplication: Tracks URLs to prevent duplicates

Technology Stack

├── Web Scraping:    python-jobspy (LinkedIn & Indeed API wrapper)
├── Data Models:     Pydantic v2 (validation)
├── Database:        SQLite (jobs.db)
├── Skill Extraction: 3-layer regex pattern matching (977 skills)
├── Filtering:       3-tier keyword matching (broad → technical → negative)
├── Export:          CSV format
└── Multi-Country:   10 countries with batch mode support

Key Features

1. Multi-Country Support

10 Countries: USA, Canada, UK, Netherlands, Germany, Denmark, France, Austria, Australia, India
Market-Based Targeting: Larger markets get more jobs (USA/India: 500, UK: 375, etc.)
Batch Mode: Sequential scraping with configurable delays (30-60 minutes)
Multiplier Setting: Scale all country targets with single parameter (default: 2.5x)

2. 3-Tier Intelligent Keyword Filtering

The system ensures only Operations Research / Optimization jobs are collected:

Tier 1: Broad Title Keywords (Cast Wide Net)

"optim", "operations research", "supply chain", "logistics", 
"routing", "scheduling", "decision science", "algorithm",
"data scientist", "machine learning", "analytics", "solver", "mathematical"

Tier 2: Technical Keywords (High Precision)

"operations research", "linear programming", "integer programming",
"mixed integer", "milp", "mip", "gurobi", "cplex", "or-tools", "ortools",
"constraint programming", "combinatorial optimization",
"mathematical optimization", "network optimization",
"vehicle routing", "routing optimization", "scheduling optimization",
"supply chain optimization", "supply chain", "inventory optimization",
"inventory management", "demand planning", "forecasting",
"heuristic", "metaheuristic", "convex optimization",
"stochastic optimization", "discrete optimization",
"simulation", "prescriptive analytics",
"pulp", "pyomo",
"Industrial Engineering", "Fulfillment Optimization"

Tier 3: Negative Keywords (Reject Non-OR Jobs)

"seo", "search engine", "sales optimization", "marketing optimization",
"conversion optimization", "website optimization", "social media"

Filtering Logic: Accept job if (Title matches Tier 1 OR Description matches Tier 2) AND NOT (Title contains Tier 3)

3. 8 Optimization-Specific Search Terms (current)

"operation research",  # lowercased for stricter matching
"Mathematical Optimization",
"MILP",
"Integer Programming",
"Gurobi",
"Routing Optimization",
"Supply Chain Optimization",
"Simulation Optimization",

4. Advanced Skill Extraction

3-Layer Extraction System:

Layer 1: Multi-word phrases (e.g., "Machine Learning", "Linear Programming")
Layer 2: Context-aware extraction with validation
Layer 3: Direct pattern matching from 977-skill reference database
Performance: 80-85% accuracy at 0.3s/job (10x faster than spaCy)
Output: Comma-separated skills stored with each job

5. Adaptive Rate Limiting

LinkedIn Protection: 10-second delay between queries
Error Handling: Stops after 3 consecutive errors (rate limit detection)
Batch Mode: 30-second delay between countries
Recommended: Use batch mode with 30-60 minute delays for multi-country scraping

6. Comprehensive Metadata

Captures 17 fields per job:

Basic: Title, description, company, location, URL
Enhanced: Remote status, job level, job function, industry
Temporal: Posted date (last 4 weeks), scraped timestamp
Analytics: Skills (extracted), search term used, country
URLs: Platform URL (LinkedIn/Indeed) + Direct company URL
Company size: company_num_employees (robust on Indeed; rarely provided by LinkedIn via JobSpy)

Precision vs Recall (Tiered Filtering)

Goal: Maximize recall of relevant optimization jobs while keeping precision high enough to avoid “SEO/marketing optimization” noise.
Tier 1 (Recall engine): Title keywords (e.g., "optim", "supply chain", "routing", "algorithm") pull in borderline-but-possibly-relevant roles. Expect higher false positives, but few misses.
Tier 2 (Precision engine): Description keywords (e.g., "MILP", "integer programming", "Gurobi", "OR-Tools", "simulation", "supply chain optimization") confirm true OR/optimization relevance.
Negative filter: Quickly discards obvious non-OR roles (SEO/marketing/sales optimization), boosting precision without hurting recall.
Acceptance rule: Accept if (Tier1 OR Tier2) AND NOT negative → balanced recall/precision without heavy ML.
Stats table: Each run prints per-search-term stats (Found, Neg, T1, T2, Both, NoMatch, Final, Rate). Use this to:
- Drop low-performing search terms (low Final/Found or many NoMatch).
- Strengthen Tier2 by adding solver/tech terms seen in good results.
- Adjust Tier1 when T1-only is high (potential false positives).

How to Improve Keyword Selection

Iterate from data: After each run, remove search terms with <50% acceptance; add new terms from accepted descriptions (solvers, methods, domains).
Balance tiers: If too many T1-only hits, add more Tier2 technical terms; if recall is low, broaden Tier1 slightly.
Domain variants: Add industry-specific phrases (e.g., "timetabling", "portfolio optimization", "workforce scheduling") when targeting new niches.
Language/locale: When expanding countries, include local-language equivalents in Tier1 and Tier2.

Adapting to Other Fields

New domains: Replace Tier1/Tier2 keyword lists with your domain’s title cues and technical signals; keep negative list to protect precision.
Skills extraction: Update skills_reference_2025.json with domain skills and regex patterns.
Markets: Edit COUNTRIES, COUNTRY_JOB_TARGETS, and SEARCH_TERMS in run_netherlands_indeed_linkedin.py.
Outputs: Add new DB columns via src/db/schema.py and Pydantic models in src/models/models.py if you need extra metadata.

Input Requirements

1. Configuration File (Skills Reference Only)

`skills_reference_2025.json`

Location: code/src/config/skills_reference_2025.json

Purpose: Used by 3-layer skill extraction system to identify technical skills in job descriptions.

{
  "skills": {
    "Python": {
      "regex_pattern": "\\bpython\\b",
      "category": "Programming Language",
      "priority": 1
    },
    "Linear Programming": {
      "regex_pattern": "linear\\s+programming|\\bLP\\b",
      "category": "Optimization",
      "priority": 2
    },
    "Gurobi": {
      "regex_pattern": "\\bgurobi\\b",
      "category": "Optimization Solver",
      "priority": 1
    }
    // ... 977 total skills
  }
}

This is the ONLY configuration file - all other settings are hardcoded in the main script.

2. Built-in Search Configuration

These are hardcoded in run_netherlands_indeed_linkedin.py:

Countries Dictionary (10 countries)

COUNTRIES = {
    "USA": {"name": "USA", "indeed_country": "USA", "flag": "🇺🇸"},
    "Netherlands": {"name": "Netherlands", "indeed_country": "Netherlands", "flag": "🇳🇱"},
    # ... 10 total countries
}

Search Terms (8 OR-specific queries)

SEARCH_TERMS = [
  "operation research",  # lowercased for stricter matching
  "Mathematical Optimization",
  "MILP",
  "Integer Programming",
  "Gurobi",
  "Routing Optimization",
  "Supply Chain Optimization",
  "Simulation Optimization",
]

Country Job Targets (Market-based)

COUNTRY_JOB_TARGETS = {
    "USA": 200,        # × 2.5 = 500 jobs
    "India": 200,      # × 2.5 = 500 jobs
    "UK": 150,         # × 2.5 = 375 jobs
    "Germany": 100,    # × 2.5 = 250 jobs
    "Netherlands": 50, # × 2.5 = 125 jobs
    # ... scaled by COUNTRY_JOB_MULTIPLIER = 2.5
}

Keyword Filters (3-tier filtering)

FILTER_KEYWORDS_TITLE: 13 broad keywords for title matching
FILTER_KEYWORDS_STRONG: 24 technical keywords for description matching
FILTER_KEYWORDS_NEGATIVE: 9 keywords to reject non-OR jobs

3. Runtime Parameters

Users configure via command-line arguments:

--jobs: Fallback job count (default: 50, overridden by country targets)
--countries: Comma-separated country list (default: all 10)
--batch: Enable batch mode with delays
--delay: Minutes between countries in batch mode (default: 60)

Processing Pipeline

Complete Workflow

┌─────────────────────────────────────────────────────────────────┐
│ 1. INITIALIZATION                                               │
│    - Load skills reference (977 skills)                         │
│    - Initialize database (jobs.db)                              │
│    - Setup JobSpy scraper                                       │
│    - Load country targets and search terms                      │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. MULTI-PLATFORM SCRAPING (Indeed + LinkedIn)                  │
│    For each country:                                            │
│                                                                 │
│    ┌─ INDEED PLATFORM ─────────────────────────────────────┐    │
│    │ For each of 8 search terms:                           │    │
│    │   • Search Indeed with JobSpy API                     │    │
│    │   • Get results (based on country target)             │    │
│    │   • Apply 3-tier keyword filter                       │    │
│    │   • Track URLs (prevent duplicates)                   │    │
│    └───────────────────────────────────────────────────────┘    │
│                                                                 │
│    ┌─ LINKEDIN PLATFORM ──────────────────────────────────┐    │
│    │ Sequential query approach:                            │    │
│    │   • Run all 8 search terms with 10s delays           │    │
│    │   • Fetch full descriptions (slower but complete)     │    │
│    │   • Apply 3-tier keyword filter                       │    │
│    │   • Detect & stop on rate limiting                    │    │
│    │   • Stop after 3 consecutive errors                   │    │
│    └───────────────────────────────────────────────────────┘    │
│                                                                 │
│    - Deduplication: URL tracking prevents duplicate jobs       │
│    - Rate limiting: 10s between LinkedIn queries               │
│    - Delay: 30s between countries (prevent overwhelming)       │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. KEYWORD FILTERING (3-Tier Quality Control)                   │
│    For each scraped job:                                        │
│                                                                 │
│    ✅ TIER 1: Title Check                                       │
│       - Match any of 13 broad keywords                          │
│       - Examples: "optim", "operations research", "routing"     │
│                                                                 │
│    ✅ TIER 2: Description Check                                 │
│       - Match any of 24 technical keywords                      │
│       - Examples: "linear programming", "gurobi", "milp"        │
│                                                                 │
│    ❌ TIER 3: Negative Filter                                   │
│       - Reject if title contains 9 negative keywords            │
│       - Examples: "seo", "sales optimization", "marketing"      │
│                                                                 │
│    Decision: Accept if (Tier 1 OR Tier 2) AND NOT Tier 3       │
│    Result: Only OR/Optimization-relevant jobs                   │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. DATA PROCESSING & ENRICHMENT                                 │
│    For each filtered job:                                       │
│      • Extract metadata from JobSpy results                     │
│      • Parse location data (city, state, country)               │
│      • Parse posted date (last 4 weeks filter)                  │
│      • Extract remote status, job level, function, industry     │
│      • Get both platform URL and company career URL             │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 5. SKILL EXTRACTION (3-Layer Pattern Matching)                  │
│    For each job description:                                    │
│                                                                 │
│    Layer 1: Multi-word phrases                                  │
│      • Match complex terms: "Machine Learning", "OR-Tools"      │
│      • Priority: Highest (processed first)                      │
│                                                                 │
│    Layer 2: Context-aware extraction                            │
│      • Validate with context: "Python" + "programming"          │
│      • Reduces false positives                                  │
│                                                                 │
│    Layer 3: Direct pattern matching                             │
│      • Apply 977 regex patterns from skills_reference.json      │
│      • Comprehensive coverage across all categories             │
│                                                                 │
│    Post-processing:                                             │
│      • Filter common words (and, the, with, using)              │
│      • Split conjunctions ("Python and SQL" → 2 skills)         │
│      • Deduplicate (normalize similar skills)                   │
│      • Validate against reference (remove false positives)      │
│      • Output: Comma-separated skill list                       │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 6. DATABASE STORAGE                                             │
│    - Store complete job details to jobs table                   │
│    - 17 fields: IDs, URLs, content, metadata, skills            │
│    - Atomic operations (thread-safe)                            │
│    - Auto-deduplication by URL                                  │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 7. BATCH MODE (Multi-Country Sequential)                        │
│    If batch mode enabled:                                       │
│      • Scrape first country completely                          │
│      • Wait configured delay (30-60 minutes)                    │
│      • Scrape next country                                      │
│      • Repeat for all countries                                 │
│      • Total time: 6-10 hours (run overnight)                   │
│                                                                 │
│    Benefits:                                                    │
│      • Avoids LinkedIn rate limiting                            │
│      • Gets 30-50 jobs per country (vs 10 without delays)       │
│      • Resumable if interrupted                                 │
└─────────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────────┐
│ 8. EXPORT & SUMMARY                                             │
│    - Generate console statistics:                               │
│      • Jobs by country and platform                             │
│      • Total jobs, unique URLs                                  │
│      • Filtering stats (original vs filtered)                   │
│                                                                 │
│    - CSV Export (manual):                                       │
│      • Run: python export_to_csv.py                             │
│      • Output: jobs_export_YYYYMMDD_HHMMSS.csv                 │
│      • Includes all 17 fields + summaries                       │
└─────────────────────────────────────────────────────────────────┘

Output & Results

1. Database (jobs.db)

Two main tables:

Table: `job_urls`

Field	Type	Description
job_id	TEXT PRIMARY KEY	MD5 hash of platform + URL
platform	TEXT	"LinkedIn" or "Indeed"
input_role	TEXT	Normalized search term
actual_role	TEXT	Scraped job title
url	TEXT UNIQUE	Job posting URL
scraped	INTEGER	0 = pending, 1 = completed

Table: `jobs`

Field	Type	Description
job_id	TEXT PRIMARY KEY	Links to job_urls
platform	TEXT	Source platform
input_role	TEXT	Normalized search term
actual_role	TEXT	Job title
url	TEXT UNIQUE	Job URL
job_description	TEXT	Full description
skills	TEXT	Comma-separated skills
company_name	TEXT	Company name
country	TEXT	Job country
location	TEXT	City/region
search_term	TEXT	Original search query
posted_date	TEXT	ISO format date
scraped_at	TEXT	Timestamp
is_remote	INTEGER	1=remote, 0=onsite, NULL=unknown
job_level	TEXT	Seniority level
job_function	TEXT	Job category
company_industry	TEXT	Industry sector
company_url	TEXT	Direct application URL

2. CSV Export

File: data/jobs_export_YYYYMMDD_HHMMSS.csv

Contains all job data with comprehensive summaries:

Jobs by country
Jobs by platform (LinkedIn/Indeed)
Jobs by search term
Unique companies count
Remote vs on-site breakdown
Job level distribution

3. Console Output

Real-time progress tracking:

📊 Exporting jobs table...
✅ Exported 450 jobs to: data/jobs_export_20260102_013358.csv

📈 Summary by Country:
  - USA: 150 jobs
  - UK: 85 jobs
  - Germany: 75 jobs
  ...

📈 Summary by Platform:
  - LinkedIn: 320 jobs
  - Indeed: 130 jobs

🏢 Total unique companies: 287

Usage Procedures

Procedure 1: Quick Start (Single Country)

# Scrape one country for testing
python code/run_netherlands_indeed_linkedin.py --jobs 50

What happens:

Scrapes 50 jobs from Netherlands (default)
Stores URLs then details
Extracts skills automatically
Saves to jobs.db

Time: ~5 minutes

Procedure 2: Batch Mode (All Countries)

Option A: Using quick_batch.py (Recommended for Beginners)

# 1. Edit configuration in quick_batch.py
# Set: SCRAPE_ALL_COUNTRIES = True
# Set: DELAY_BETWEEN_COUNTRIES = 30 (minutes)

# 2. Run batch scraper
python code/quick_batch.py

Option B: Using run_batch_scraper.py

# Automatically scrapes all 10 countries with 60-min delays
python code/run_batch_scraper.py

Option C: Direct Command Line

# Batch mode with all countries, 45-min delays
python code/run_netherlands_indeed_linkedin.py --batch --delay 45

# Batch mode with specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK,Germany" --delay 30

What happens:

Scrapes each country sequentially
Waits configured delay between countries
Safe from LinkedIn rate limiting
Resumable if interrupted

Time: 6-10 hours (run overnight)

Procedure 3: Export Results to CSV

# Export all data to CSV
python code/export_to_csv.py

Output: data/jobs_export_YYYYMMDD_HHMMSS.csv

Includes:

All 17 job fields
Summary statistics
Breakdowns by country, platform, role

Procedure 4: Check Database Status

# View scraping progress
python code/check_current_urls.py

Shows:

Total URLs collected
URLs scraped (have full details)
URLs pending (need detail extraction)
Progress by platform and role

Procedure 5: Remove Duplicates

# Clean duplicate entries
python code/remove_duplicates.py

Removes:

Duplicate job URLs
Duplicate job descriptions (same job, different URLs)

Procedure 6: Verify URLs

# Check if URLs exist in database
python code/check_url_in_db.py

# Verify CSV URLs against database
python code/verify_csv_urls.py

# Check platform-specific URLs
python code/verify_platform_urls.py

Database Schema

Single Table Design

The system uses a single table for job storage (simplified from two-phase):

┌─────────────────────────────────────────────────┐
│                    jobs                         │
├─────────────────────────────────────────────────┤
│ job_id (PK)        TEXT     # Unique identifier │
│ platform           TEXT     # "linkedin"/"indeed"│
│ actual_role        TEXT     # Job title         │
│ url (UNIQUE)       TEXT     # Platform job URL  │
│ job_description    TEXT     # Full description  │
│ skills             TEXT     # Comma-separated   │
│ company_name       TEXT     # Company name      │
│ country            TEXT     # Country code      │
│ location           TEXT     # City, State       │
│ search_term        TEXT     # Search query used │
│ posted_date        TEXT     # ISO date          │
│ scraped_at         TEXT     # Scrape timestamp  │
│ is_remote          INTEGER  # 1=remote, 0=onsite│
│ job_level          TEXT     # Seniority level   │
│ job_function       TEXT     # Job category      │
│ company_industry   TEXT     # Industry sector   │
│ company_url        TEXT     # Direct career URL │
└─────────────────────────────────────────────────┘

Field Details

Field	Type	Description	Example
`job_id`	TEXT	Generated: `platform_country_id_hash`	`linkedin_USA_12345_67890`
`platform`	TEXT	Source platform	`linkedin` or `indeed`
`actual_role`	TEXT	Job title from posting	`Operations Research Analyst`
`url`	TEXT	Platform job page URL (unique)	`https://linkedin.com/jobs/view/123...`
`job_description`	TEXT	Full job description	`We are seeking an OR specialist...`
`skills`	TEXT	Comma-separated extracted skills	`Python, Gurobi, Linear Programming`
`company_name`	TEXT	Company name	`Amazon`
`country`	TEXT	Country code	`USA`
`location`	TEXT	City/State/Country	`Seattle, WA, USA`
`search_term`	TEXT	Search query that found this job	`Operations Research`
`posted_date`	TEXT	Job posting date (ISO format)	`2026-01-01`
`scraped_at`	TEXT	When we scraped it	`2026-01-02T14:35:00`
`is_remote`	INTEGER	Remote work option	`1` (yes), `0` (no), `NULL` (unknown)
`job_level`	TEXT	LinkedIn seniority	`Mid-Senior level`
`job_function`	TEXT	Job category	`Engineering`
`company_industry`	TEXT	Industry	`IT Services`
`company_url`	TEXT	Direct application URL	`https://amazon.jobs/...`

Key Indexes

CREATE UNIQUE INDEX idx_jobs_url ON jobs(url);
CREATE INDEX idx_jobs_country ON jobs(country);
CREATE INDEX idx_jobs_platform ON jobs(platform);
CREATE INDEX idx_jobs_search_term ON jobs(search_term);

Note: URL deduplication happens in-memory during scraping using a seen_urls set.

Skill Extraction System

3-Layer Architecture

The system uses 977 skills from skills_reference_2025.json with regex pattern matching:

Layer 1: Multi-Word Phrases (Priority)

Purpose: Catch complex technical terms first
Examples: "Machine Learning", "Deep Learning", "Linear Programming", "OR-Tools"
Method: Exact phrase matching with word boundaries
Priority: Highest (processed first to prevent partial matches)

Layer 2: Context-Aware Extraction

Purpose: Validate single-word skills with surrounding context
Examples: "Python" near "programming", "SQL" near "database"
Method: Keyword + context validation
Benefit: Reduces false positives (e.g., "python" the snake vs Python the language)

Layer 3: Direct Pattern Matching

Purpose: Comprehensive coverage from 977-skill reference database
Method: Apply regex patterns from skills_reference_2025.json
Examples:
- "\\bgurobi\\b" matches "Gurobi"
- "linear\\s+programming|\\bLP\\b" matches "Linear Programming" or "LP"
Coverage: 977 skills across 15+ categories

Skill Categories (977 Total)

Programming Languages:   Python, Java, C++, R, SQL, Julia, etc. (50+ skills)
Optimization Tools:      Gurobi, CPLEX, OR-Tools, SCIP, FICO Xpress, etc. (25+ skills)
OR Techniques:           Linear Programming, MILP, Constraint Programming, etc. (40+ skills)
Data Science:            Machine Learning, Deep Learning, NLP, etc. (100+ skills)
Cloud Platforms:         AWS, Azure, GCP, etc. (30+ skills)
Databases:               PostgreSQL, MongoDB, Redis, Cassandra, etc. (40+ skills)
Frameworks:              TensorFlow, PyTorch, Spark, Pandas, etc. (80+ skills)
DevOps:                  Docker, Kubernetes, CI/CD, Jenkins, etc. (50+ skills)
Analytics:               Tableau, Power BI, Excel, Matplotlib, etc. (35+ skills)
Supply Chain:            SAP, Warehouse Management, Inventory Optimization, etc. (30+ skills)
Math/Stats:              Statistics, Probability, Calculus, Heuristics, etc. (25+ skills)
Web Technologies:        REST API, GraphQL, React, Node.js, etc. (60+ skills)
Big Data:                Hadoop, Spark, Kafka, Airflow, etc. (35+ skills)
Version Control:         Git, GitHub, GitLab, Bitbucket, etc. (15+ skills)
Project Management:      Agile, Scrum, Jira, Lean, etc. (20+ skills)
... and more categories covering all technical domains

Processing Steps

Layer 1: Extract multi-word phrases → ["Linear Programming", "Machine Learning"]
Layer 2: Extract context-validated terms → ["Python", "SQL", "Gurobi"]
Layer 3: Apply 977 regex patterns → ["optimization", "CPLEX", "AWS"]
Filter Common Words: Remove grammatical words → Remove "and", "the", "with"
Split Conjunctions: "Python and SQL" → ["Python", "SQL"]
Deduplicate: Merge similar → "ML" + "Machine Learning" → "Machine Learning"
Validate: Cross-reference with skills_reference → Remove false positives
Output: Comma-separated string → "Python, Gurobi, Linear Programming, AWS"

Example Extraction

Input Job Description:

We're seeking an Operations Research Engineer with expertise in 
linear programming using Gurobi and CPLEX. You'll optimize 
supply chain routes using Python and implement ML models.

Extracted Skills:

Linear Programming, Gurobi, CPLEX, Python, Machine Learning, 
Operations Research, Supply Chain Optimization

Performance: 0.3 seconds per job, 80-85% accuracy

Configuration

1. Built-in Configuration (in `run_netherlands_indeed_linkedin.py`)

Country Job Targets (Market-Based)

COUNTRY_JOB_TARGETS = {
    "USA": 200,        # Large market
    "India": 200,      # Large market
    "UK": 150,         # Medium-large
    "Germany": 100,    # Medium
    "Netherlands": 50, # Small
    # ... 10 total countries
}

# Multiplier to scale all targets
COUNTRY_JOB_MULTIPLIER = 2.5  # Default: 2.5x
# Examples: USA gets 200 × 2.5 = 500 jobs

To Modify: Edit COUNTRY_JOB_MULTIPLIER in the script to scale all targets at once.

Search Terms (Optimization-Focused)

SEARCH_TERMS = [
  "operation research",  # lowercased for stricter matching
  "Mathematical Optimization",
  "MILP",
  "Integer Programming",
  "Gurobi",
  "Routing Optimization",
  "Supply Chain Optimization",
  "Simulation Optimization",
]

To Modify: Edit SEARCH_TERMS list in the script to add/remove queries.

Keyword Filters (3-Tier System)

# Tier 1: Broad title keywords (13 terms)
FILTER_KEYWORDS_TITLE = [
    "optim", "operations research", "supply chain", 
    "logistics", "routing", "scheduling", ...
]

# Tier 2: Technical keywords (24 terms)
FILTER_KEYWORDS_STRONG = [
    "linear programming", "integer programming", 
    "gurobi", "cplex", "or-tools", ...
]

# Tier 3: Negative keywords (9 terms)
FILTER_KEYWORDS_NEGATIVE = [
    "seo", "search engine", "sales optimization", ...
]

To Modify: Edit these lists in the script to adjust filtering criteria.

LinkedIn Rate Limiting

LINKEDIN_SLEEP_SEC = 10.0       # Delay between queries
LINKEDIN_MAX_ERRORS = 3         # Stop after N errors
LINKEDIN_FETCH_DESCRIPTION = True  # Get full descriptions

2. Quick Batch Configuration Files

Option A: `quick_batch.py` (Easiest)

# Edit these settings:
SCRAPE_ALL_COUNTRIES = True      # True = all 10, False = specific
# SPECIFIC_COUNTRIES = ["Netherlands", "Germany"]  # If False above
DELAY_BETWEEN_COUNTRIES = 30     # Minutes between countries

Option B: `run_batch_scraper.py` (Advanced)

COUNTRIES = None  # None = all 10, or ["USA", "UK", "Germany"]
DELAY_MINUTES = 60  # Delay between countries
JOBS = 50  # Fallback (overridden by COUNTRY_JOB_TARGETS)

3. Command Line Arguments

# All available options
python code/run_netherlands_indeed_linkedin.py \
  --jobs 50 \                    # Fallback job count (usually ignored)
  --batch \                       # Enable batch mode (RECOMMENDED)
  --delay 45 \                    # Minutes between countries
  --countries "USA,UK,Germany"    # Comma-separated country list

Rate Limiting Best Practices

Scenario	Delay Setting	Expected Results
Single Country Test	No delay needed	30-50 jobs, ~5-10 minutes
2-3 Countries	30 minutes	100-150 jobs, ~1.5-2 hours
5-10 Countries	45-60 minutes	300-500 jobs, 6-10 hours
Production (All 10)	60 minutes	400-600 jobs, overnight

Why Delays Are Critical:

LinkedIn rate limiting: ~90-120 queries before blocking
Without delays: Blocked after ~10 jobs total (unusable)
With 30-60 min delays: 30-50 jobs per country (success)
Delays are between countries, not queries

Performance Metrics

Speed Benchmarks

Operation	Time	Details
URL Collection	2-5 min	100-200 URLs per country
Detail Extraction	3-8 min	30-50 jobs per country
Skill Extraction	0.3s/job	977-skill reference matching
CSV Export	<1 sec	All jobs to CSV
Single Country	~5-10 min	Complete pipeline
All 10 Countries (batch)	6-10 hours	With safe delays

Database Statistics (Example Run)

Total Jobs: 450
Countries: 10
Platforms: LinkedIn (70%), Indeed (30%)
Skills Extracted: ~15-25 per job
Companies: 287 unique
Remote Jobs: 35%

Troubleshooting

Common Issues

1. "LinkedIn Blocked - Only Got 10 Jobs"

Solution: Use batch mode with 30-60 minute delays

python code/run_netherlands_indeed_linkedin.py --batch --delay 45

2. "Database Locked Error"

Solution: Close other scripts accessing jobs.db

# Stop all running scrapers
# Then restart

3. "No Skills Extracted"

Solution: Check skills_reference_2025.json exists

ls code/src/config/skills_reference_2025.json

4. "URLs Not Scraping Details"

Solution: Run detail extraction phase

# Check pending URLs
python code/check_current_urls.py

# They will be scraped in next run automatically

Best Practices

1. For Production Scraping

✅ Use batch mode with 45-60 min delays
✅ Run overnight (6-10 hours total)
✅ Start with 2-3 countries to test
❌ Don't scrape all countries without delays

2. For Testing

✅ Use single country mode
✅ Set --jobs 20 for quick tests
✅ Verify results with export_to_csv.py
❌ Don't use --batch for testing

3. For Data Quality

✅ Review keyword filters regularly
✅ Update skills_reference_2025.json
✅ Run remove_duplicates.py periodically
✅ Check job descriptions match expected roles

4. For Reliability

✅ System auto-saves progress (checkpoint-based)
✅ Can resume interrupted scraping
✅ Database handles duplicates automatically
✅ Export data regularly as backup

File Structure Reference

job-scrapper/
├── README.md                          # Project overview
├── requirements.txt                   # Python dependencies
├── JOB_SCRAPER_DOCUMENTATION.md      # This file (complete documentation)
│
├── code/                             # Main code directory
│   ├── quick_batch.py                # ⭐ EASIEST: Edit & run for batch mode
│   ├── run_batch_scraper.py          # Auto batch scraper (alternative)
│   ├── run_netherlands_indeed_linkedin.py  # 🎯 MAIN SCRAPER
│   ├── export_to_csv.py              # Export database to CSV
│   ├── remove_duplicates.py          # Clean duplicate entries
│   ├── check_current_urls.py         # View scraping progress
│   ├── check_url_in_db.py            # Verify if URL exists
│   ├── verify_csv_urls.py            # Verify CSV against database
│   ├── BATCH_MODE_GUIDE.py           # Usage guide for batch mode
│   │
│   ├── data/
│   │   ├── jobs.db                   # 📦 SQLite database (auto-created)
│   │   └── jobs_export_*.csv         # Exported data files
│   │
│   └── src/
│       ├── config/
│       │   └── skills_reference_2025.json   # ✅ ONLY CONFIG FILE (977 skills)
│       │
│       ├── db/
│       │   ├── connection.py         # Database connection manager
│       │   ├── operations.py         # CRUD operations
│       │   └── schema.py             # Table schemas
│       │
│       ├── models/
│       │   └── models.py             # Pydantic data models
│       │
│       ├── analysis/
│       │   └── skill_extraction/
│       │       ├── extractor.py              # Main 3-layer extractor
│       │       ├── layer3_direct.py          # Pattern matching
│       │       ├── advanced_regex_extractor.py  # Layer 1 & 2
│       │       ├── normalize.py              # Deduplication
│       │       ├── common_words_filter.py    # Filter common words
│       │       └── confidence_scorer.py      # Skill confidence
│       │
│       └── validation/
│           └── realtime_validator.py # Skill validation
│
└── venv/                             # Python virtual environment

Key Files Explained

File	Purpose	When to Use
`quick_batch.py`	Simplest batch scraper	Edit settings and run (recommended)
`run_netherlands_indeed_linkedin.py`	Main scraper script	Direct CLI usage with args
`export_to_csv.py`	Export jobs to CSV	After scraping completes
`skills_reference_2025.json`	977 skill patterns	Auto-loaded (don't modify)
`jobs.db`	SQLite database	Auto-created, stores all data
`remove_duplicates.py`	Clean duplicates	If you see duplicate jobs

Quick Reference Commands

# ============ SCRAPING ============

# Quick test (one country)
python code/run_netherlands_indeed_linkedin.py --jobs 20

# Batch all countries (recommended)
python code/quick_batch.py

# Batch specific countries
python code/run_netherlands_indeed_linkedin.py --batch --countries "USA,UK" --delay 30


# ============ EXPORTING ============

# Export to CSV
python code/export_to_csv.py


# ============ MAINTENANCE ============

# View progress
python code/check_current_urls.py

# Remove duplicates
python code/remove_duplicates.py

# Verify URLs
python code/check_url_in_db.py


# ============ TESTING ============

# Test skill extraction
python code/test_skills.py

# Test real extraction
python code/test_real_extraction.py

# Verify CSV URLs
python code/verify_csv_urls.py

Summary

This job scraper is a production-ready system for collecting Operations Research and Optimization jobs with:

✅ JobSpy-powered scraping for LinkedIn + Indeed
✅ 3-tier keyword filtering ensuring only OR/Optimization jobs
✅ 3-layer skill extraction with 977-skill reference database
✅ Multi-country support across 10 countries (USA, India, UK, Germany, etc.)
✅ Market-based targeting with 2.5x multiplier (500 jobs for USA, 125 for Netherlands)
✅ Batch mode with delays to avoid LinkedIn rate limiting
✅ Comprehensive metadata with 17 fields per job
✅ CSV export for easy analysis in Excel/Google Sheets
✅ In-memory deduplication preventing duplicate jobs

Quick Start

Simplest Method (Batch Mode):

# 1. Edit settings in quick_batch.py
#    - SCRAPE_ALL_COUNTRIES = True
#    - DELAY_BETWEEN_COUNTRIES = 30
#
# 2. Run overnight
python code/quick_batch.py
#
# 3. Export to CSV
python code/export_to_csv.py

Expected Results:

Time: 6-10 hours (overnight run)
Jobs: 400-600 total across 10 countries
Distribution: USA/India get ~500 jobs, smaller countries ~125 jobs
Platforms: ~70% LinkedIn, ~30% Indeed
Skills: 15-25 skills extracted per job

Why This Works:

✅ Market-based targets ensure quality over quantity
✅ 3-tier filtering removes SEO, marketing, sales jobs
✅ Delays prevent LinkedIn blocking
✅ Skill extraction provides actionable insights
✅ CSV export enables custom analysis

Last Updated: January 2, 2026
Focus: Operations Research & Mathematical Optimization Jobs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
quick_batch.py		quick_batch.py
run_netherlands_indeed_linkedin.py		run_netherlands_indeed_linkedin.py

Mbehbahani/Oploy-Job-Scrapper

Folders and files

Latest commit

History

Repository files navigation