Rank-constrained decision tree with m-estimate smoothing and out-of-fold isotonic calibration.
Implements temporal outer cross-validation (year-based) with inner repeated CV (3Γ5) and the 1-SE rule for model selection.
Final refit on 2019β2022 data and scoring of the active 2023 cohort to predict 2024 voluntary turnover.
. ββ turnover_decisiontree_finalversion.py # main script (training, CV, scoring) ββ scripts/ β ββ download_data.sh # optional data helper ββ data/ # optional folder for CSVs
The model expects five yearly CSV snapshots: employee_information_2019.csv employee_information_2020.csv employee_information_2021.csv employee_information_2022.csv employee_information_2023.csv
| Category | Examples |
|---|---|
| Identifiers | unique_id, record_date, work_location, voluntary_termination_count |
| Tenure & Mobility | tenure_group / tenure_company_ratio, years_in_position, months_since_last_promotion |
| Structure | hierarchy_level, is_people_manager, fte, age_group |
| Performance | highest_performance_rating / latest_performance_rating, bravo_award, high_potential |
| Talent / Traits | recruitment_source, five_dynamics_personality |
genderis excluded by design.
Missing values are imputed; numeric columns get an additional_is_missingflag.
The binary target is derived fromvoluntary_termination_count(Tor1β terminated).
Optional helper:
./scripts/download_data.sh [DEST_DIR] # default: ~/Downloads
βοΈ Installation
Python β₯ 3.9
pip install numpy pandas scikit-learn matplotlib
pip install -r requirements.txt
Example requirements.txt:
numpy>=1.24 pandas>=2.0 scikit-learn>=1.3 matplotlib>=3.7
π How to Run
From the folder containing your CSV files:
python turnover_decisiontree_finalversion.py
If files are inside ./data:
(cd data && python ../turnover_decisiontree_finalversion.py)
π§ Model Overview
Verify & load yearly data.
Build targets: last record per employee per year β label as β1β if voluntarily left in t + 1.
Preprocess: median/mode imputation, One-Hot encoding, and rank-based feature gating.
Model: decision tree with rank constraints and m-estimate smoothing at leaves.
Inner CV (3Γ5): grid search
max_depth β {7, 8, 9}
min_leaf β {40, 60, 80, 120}
m_shrink β {Β½Β·min_leaf, min_leaf, 2Β·min_leaf} β pick the simplest model within one SE of best AUC.
Temporal outer CV: train β€ year β 1, test on {2020, 2021, 2022}.
Isotonic calibration: out-of-fold (no leakage).
Final refit: on 2019β2022 using aggregated parameters (voting).
Predict active 2023 employees β 2024 turnover risk.
Export results.
π Outputs File Description decision_tree_readable.txt Readable text version of the final tree turnover_probability_distribution.png Histogram of predicted probabilities individual_turnover_probabilities_sorted.csv Employee-level probabilities (sorted) avg_turnover_by_location_sorted.csv Average predicted turnover per site (%) π Example Results Task Metric Example Score Comment Predict average turnover by location RMSE 0.089 Lower = better Predict individual voluntary turnover AUC 0.679 0.5 = random Β· 1.0 = perfect
(Example benchmark values β your results will depend on your data.)
π§© Modeling Notes
Rank gating: progressively unlocks feature groups by tree depth (tenure β hierarchy β site β performance β recruitment).
m-estimate smoothing: Bayesian regularization toward global mean.
Out-of-Fold isotonic calibration: monotonic probability correction without data leakage.
1-SE rule: prefers the simplest model within one standard error of the best AUC.
π§° Troubleshooting
Missing files β script stops and lists required filenames.
Missing columns β safely ignored; complete data improves accuracy.
Minor score variations may appear across scikit-learn versions.
If work_location is missing, per-site averages are skipped.
Random seed fixed at RANDOM_STATE = 42.
β Instructions:
- Copy everything inside the gray code block above.
- Paste it into your
README.mdfile. - Commit it β GitHub will automatically render headers, bold text, emojis, and tables perfectly.