Skip to content

Predict employee voluntary turnover using a rank-constrained decision tree with isotonic calibration and temporal cross-validation.

Notifications You must be signed in to change notification settings

Bayeko/Employee-turnover-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Employee Turnover Prediction β€” Final Model

Rank-constrained decision tree with m-estimate smoothing and out-of-fold isotonic calibration.
Implements temporal outer cross-validation (year-based) with inner repeated CV (3Γ—5) and the 1-SE rule for model selection.
Final refit on 2019–2022 data and scoring of the active 2023 cohort to predict 2024 voluntary turnover.

Status Python License


πŸ“ Project Structure

. β”œβ”€ turnover_decisiontree_finalversion.py # main script (training, CV, scoring) β”œβ”€ scripts/ β”‚ └─ download_data.sh # optional data helper └─ data/ # optional folder for CSVs


πŸ“Š Data Requirements

The model expects five yearly CSV snapshots: employee_information_2019.csv employee_information_2020.csv employee_information_2021.csv employee_information_2022.csv employee_information_2023.csv

Expected Columns

Category Examples
Identifiers unique_id, record_date, work_location, voluntary_termination_count
Tenure & Mobility tenure_group / tenure_company_ratio, years_in_position, months_since_last_promotion
Structure hierarchy_level, is_people_manager, fte, age_group
Performance highest_performance_rating / latest_performance_rating, bravo_award, high_potential
Talent / Traits recruitment_source, five_dynamics_personality

gender is excluded by design.
Missing values are imputed; numeric columns get an additional _is_missing flag.
The binary target is derived from voluntary_termination_count (T or 1 β†’ terminated).

Optional helper:

./scripts/download_data.sh [DEST_DIR] # default: ~/Downloads

βš™οΈ Installation

Python β‰₯ 3.9

pip install numpy pandas scikit-learn matplotlib

or

pip install -r requirements.txt

Example requirements.txt:

numpy>=1.24 pandas>=2.0 scikit-learn>=1.3 matplotlib>=3.7

πŸš€ How to Run

From the folder containing your CSV files:

python turnover_decisiontree_finalversion.py

If files are inside ./data:

(cd data && python ../turnover_decisiontree_finalversion.py)

🧠 Model Overview

Verify & load yearly data.

Build targets: last record per employee per year β†’ label as β€œ1” if voluntarily left in t + 1.

Preprocess: median/mode imputation, One-Hot encoding, and rank-based feature gating.

Model: decision tree with rank constraints and m-estimate smoothing at leaves.

Inner CV (3Γ—5): grid search

max_depth ∈ {7, 8, 9}

min_leaf ∈ {40, 60, 80, 120}

m_shrink ∈ {Β½Β·min_leaf, min_leaf, 2Β·min_leaf} β†’ pick the simplest model within one SE of best AUC.

Temporal outer CV: train ≀ year βˆ’ 1, test on {2020, 2021, 2022}.

Isotonic calibration: out-of-fold (no leakage).

Final refit: on 2019–2022 using aggregated parameters (voting).

Predict active 2023 employees β†’ 2024 turnover risk.

Export results.

πŸ“‚ Outputs File Description decision_tree_readable.txt Readable text version of the final tree turnover_probability_distribution.png Histogram of predicted probabilities individual_turnover_probabilities_sorted.csv Employee-level probabilities (sorted) avg_turnover_by_location_sorted.csv Average predicted turnover per site (%) πŸ“ˆ Example Results Task Metric Example Score Comment Predict average turnover by location RMSE 0.089 Lower = better Predict individual voluntary turnover AUC 0.679 0.5 = random Β· 1.0 = perfect

(Example benchmark values β€” your results will depend on your data.)

🧩 Modeling Notes

Rank gating: progressively unlocks feature groups by tree depth (tenure β†’ hierarchy β†’ site β†’ performance β†’ recruitment).

m-estimate smoothing: Bayesian regularization toward global mean.

Out-of-Fold isotonic calibration: monotonic probability correction without data leakage.

1-SE rule: prefers the simplest model within one standard error of the best AUC.

🧰 Troubleshooting

Missing files β†’ script stops and lists required filenames.

Missing columns β†’ safely ignored; complete data improves accuracy.

Minor score variations may appear across scikit-learn versions.

If work_location is missing, per-site averages are skipped.

Random seed fixed at RANDOM_STATE = 42.

⚠️ gender is intentionally excluded for fairness. Always ensure data governance and privacy compliance before re-using this model.


βœ… Instructions:

  1. Copy everything inside the gray code block above.
  2. Paste it into your README.md file.
  3. Commit it β€” GitHub will automatically render headers, bold text, emojis, and tables perfectly.

About

Predict employee voluntary turnover using a rank-constrained decision tree with isotonic calibration and temporal cross-validation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published