Skip to content

Comments

Add/hist gbc#144

Merged
singjc merged 43 commits intoPyProphet:masterfrom
singjc:add/histGBC
Oct 18, 2025
Merged

Add/hist gbc#144
singjc merged 43 commits intoPyProphet:masterfrom
singjc:add/histGBC

Conversation

@singjc
Copy link
Contributor

@singjc singjc commented Jun 4, 2025

Summary

This PR adds HistGradientBoostingClassifier from scikit-learn as a new classifier option in PyProphet, providing a high-performance alternative to XGBoost without requiring additional dependencies.

Changes Made

✅ Implementation Complete

  • Complete HistGBCLearner class with all required methods (learn, score, get/set_parameters, tune)
  • Hyperparameter tuning via RandomizedSearchCV
  • Feature importances stored in XGBoost-compatible format
  • Full integration with PyProphet pipeline

✅ CLI & Configuration

  • Added HistGradientBoosting as classifier choice in CLI
  • Updated RunnerConfig to support new classifier
  • Updated help text and documentation

✅ Documentation

  • Updated user guide with comprehensive classifier comparison
  • Added HistGBCLearner to API documentation
  • Created comparison script (compare_classifiers.py)
  • Added implementation summary document

✅ Files Changed

  • pyprophet/scoring/classifiers.py - Complete HistGBCLearner implementation
  • pyprophet/scoring/pyprophet.py - Integration with scoring pipeline
  • pyprophet/scoring/runner.py - Model persistence handling
  • pyprophet/cli/score.py - CLI option
  • pyprophet/_config.py - Configuration support
  • docs/user_guide/pyprophet_workflow.rst - User documentation
  • docs/api/scoring.rst - API documentation
  • compare_classifiers.py - Performance comparison tool (new)

Usage Examples

# Basic usage
pyprophet score --in data.osw --classifier=HistGradientBoosting

# With hyperparameter tuning
pyprophet score --in data.osw --classifier=HistGradientBoosting --autotune

# Compare with XGBoost
python compare_classifiers.py --in test_data.osw

Benefits

  • ✅ No XGBoost dependency required
  • ✅ Native sklearn integration
  • ✅ Similar performance to XGBoost
  • ✅ Supports hyperparameter autotuning
  • ✅ Fully tested and documented

Checklist

  • Add HistGradientBoostingClassifier implementation
  • Test performance framework (comparison script provided)
  • Add documentation
  • All syntax validated
  • Performance benchmarking on real datasets (ready for maintainers to test)

Example Comparison Performance

Tested on Gold Standard S. pyo dataset from PASS01508.

$OMP_NUM_THREADS=6 python sandbox/compare_classifiers.py --in merged.osw 

================================================================================
Running with XGBoost classifier
================================================================================
  Summary for Q-Value ≤ 1%:
================================================================================
                 run_id num_ids min_area mean_area max_area
0    690411522375919463    8968      0.0    272.15  30384.6
1   1006404389538767196    9053      0.5    324.73  37667.7
2   1895217364647358706    8488     2.26    290.69  28462.3
3   1993070538656663830   10341      0.0    515.97  41947.8
4   2086104329746226633   10127      0.0     402.9  51808.6
5   2895829782584737863    8783      0.0    282.14  29902.4
6   3312727970122698354   11010      0.0    513.58  60251.0
7   3384946816235683019    9355     1.01    346.85  44013.3
8   4228224315318701854    9776      0.0    301.71  30645.4
9   6061028540430771831    9819      0.0    521.31  35058.4
10  8196394674452833952    8261      0.0    230.19  21726.6
11  8238807937114849326   10138      0.0    453.51  38966.3
12  8292393358414978236   10064     0.28    332.86  33808.5
13  8627438106464817423   10159      0.0    521.61  40618.2
14  8749089703153095849    7000      0.0    225.82  24693.5
15  8889961272137748833   10813      0.0    424.65  47524.5

================================================================================
Running with HistGradientBoosting classifier
================================================================================
  Summary for Q-Value ≤ 1%:
================================================================================
                 run_id num_ids min_area mean_area max_area
0    690411522375919463    8958      0.0    271.94  17553.4
1   1006404389538767196    8998      0.0    326.66  24286.0
2   1895217364647358706    8428      0.0    295.97  28462.3
3   1993070538656663830   10292      0.0    522.89  41947.8
4   2086104329746226633   10142      0.0    407.45  51808.6
5   2895829782584737863    8694      0.0    287.42  29902.4
6   3312727970122698354   11040      0.0    518.19  60251.0
7   3384946816235683019    9337      0.0    344.93  22048.3
8   4228224315318701854    9796      0.0    302.12  27995.1
9   6061028540430771831    9739      0.0    538.31  67755.4
10  8196394674452833952    8251      0.0    234.49  29217.3
11  8238807937114849326   10046      0.0     458.2  38966.3
12  8292393358414978236   10040      0.0    334.09  33808.5
13  8627438106464817423   10075      0.0    530.58  40618.2
14  8749089703153095849    6753      0.0    229.73  22251.7
15  8889961272137748833   10832      0.0    427.43  47524.5

================================================================================
RUNTIME COMPARISON SUMMARY
================================================================================

Runtime:
  XGBoost                  : 191.86 seconds
  HistGradientBoosting     : 137.26 seconds

Speedup: 1.40x

@singjc singjc linked an issue Jun 4, 2025 that may be closed by this pull request
@singjc singjc added the refactor label Jun 4, 2025
singjc and others added 26 commits October 16, 2025 16:21
- Add complete HistGBCLearner class with all required methods (learn, score, get/set_parameters, tune)
- Wire HistGradientBoosting into CLI as a classifier choice
- Update RunnerConfig to support HistGradientBoosting classifier
- Update PyProphet to instantiate and handle HistGBCLearner
- Update runner.py to handle HistGradientBoosting in weight loading/saving
- Feature importances stored in XGBoost-compatible format for consistency
- Hyperparameter tuning support via RandomizedSearchCV
- Add comprehensive implementation summary document
- Add compare_classifiers.py script for performance comparison
- Update user guide to document all available classifiers
- Add HistGBCLearner to API documentation
- Document benefits and usage of HistGradientBoosting as XGBoost alternative
…stingClassifier

- Add custom loss_gain_score function using negative log loss as scoring metric
- This mimics XGBoost's gain metric (measures loss reduction when feature is used)
- Use stratified subsampling (up to 2000 samples) for faster computation
- Scale importances by 100x to match XGBoost gain magnitude
- Clamp negative values to zero (can occur due to noise in permutation)
- Use make_scorer with needs_proba=True for probability-based scoring
- Fixes issue where default accuracy scoring produced mostly zero importances
…t/Split/Multisplit; enable autotune and PFDR paths; extend strategies to support --classifier=HistGradientBoosting
@singjc singjc marked this pull request as ready for review October 18, 2025 02:39
@singjc singjc requested a review from Copilot October 18, 2025 02:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds scikit-learn’s HistGradientBoostingClassifier as a first-class classifier option alongside existing LDA/SVM/XGBoost, integrates it into the scoring pipeline, CLI, config, and tests, and provides a sandbox script for runtime comparisons.

  • New HistGBCLearner with learning, scoring, autotuning, and feature-importance support
  • Pipeline integration: CLI option, config updates, runner persistence handling, semi-supervised tuning path, and docs
  • Comprehensive tests and golden outputs for OSW/Parquet variants, plus a sandbox comparison script

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_pyprophet_score.py Adds histgbc flags to strategies and new test cases for OSW/Parquet split variants and autotuning
tests/_regtest_outputs/histgbc.out Golden outputs for new HistGBC tests
sandbox/compare_classifiers.py Script to compare XGBoost vs HistGradientBoosting runtime/results
pyprophet/stats.py Tuple pi0_lambda handling enhanced for (a,0,0) input
pyprophet/scoring/semi_supervised.py Integrates HistGBCLearner into semi-supervised tuning path
pyprophet/scoring/runner.py Threading advisory/adjustments for HGB, persistence paths and messages updated
pyprophet/scoring/pyprophet.py Constructs HistGBCLearner, supports importance logging, and apply-weights for HGB
pyprophet/scoring/classifiers.py Implements HistGBCLearner with tuning, learning, scoring, and importances
pyprophet/io/_base.py Persist HGB weights via binary path (shared with XGB) and minor formatting
pyprophet/cli/score.py Adds HistGradientBoosting to CLI and notes on OMP_NUM_THREADS
pyprophet/_config.py Config and type annotations updated to include HistGradientBoosting
docs/user_guide/pyprophet_workflow.rst User guide updated to introduce HistGradientBoosting
docs/api/scoring.rst API docs list HistGBCLearner
HISTGBC_TEST_ADDITIONS.md Notes on new tests and coverage
Comments suppressed due to low confidence (1)

pyprophet/scoring/runner.py:1

  • Newly enabled apply-weights path for HistGradientBoosting shares the XGBoost persistence (PYPROPHET_XGB table), but there is no test exercising apply-weights for HistGradientBoosting. Please add an apply-weights test (OSW and TSV/bin) analogous to the existing XGBoost apply-weights tests to verify read/write and scoring consistency.
"""

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

singjc and others added 13 commits October 17, 2025 22:53
…S or os.cpu_count() with fallback and clamping\n- Coordinate outer (RandomizedSearchCV n_jobs) vs inner (OpenMP) parallelism\n- Use threadpoolctl to enforce OpenMP limits during tune() and learn() fits\n- Avoid OMP_NUM_THREADS=0 and respect user-provided OMP_NUM_THREADS if set
Critical fix for OpenMP thread control with HistGradientBoosting:
- Defer sklearn imports in CLI until after OMP_NUM_THREADS is set
- Auto-set OMP_NUM_THREADS in score() if not already set by user
- Calculate optimal threads: ceil(total_cpus / requested_threads)
- Update warnings and docs to clarify timing requirement
- Keep threadpoolctl as runtime fallback (but env var is primary)

This ensures OMP_NUM_THREADS takes effect since it must be set before
NumPy/OpenMP initialization. Without this, HistGBC would use all CPUs
regardless of threadpoolctl calls.

Resolves issue where threadpoolctl alone was insufficient.
@singjc singjc merged commit c807047 into PyProphet:master Oct 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[REFACTOR] Replace XGB with sklearn HistGradientBoostingClassifier?

1 participant