Add/hist gbc by singjc · Pull Request #144 · PyProphet/pyprophet

singjc · 2025-06-04T13:59:30Z

Summary

This PR adds HistGradientBoostingClassifier from scikit-learn as a new classifier option in PyProphet, providing a high-performance alternative to XGBoost without requiring additional dependencies.

Changes Made

✅ Implementation Complete

Complete HistGBCLearner class with all required methods (learn, score, get/set_parameters, tune)
Hyperparameter tuning via RandomizedSearchCV
Feature importances stored in XGBoost-compatible format
Full integration with PyProphet pipeline

✅ CLI & Configuration

Added HistGradientBoosting as classifier choice in CLI
Updated RunnerConfig to support new classifier
Updated help text and documentation

✅ Documentation

Updated user guide with comprehensive classifier comparison
Added HistGBCLearner to API documentation
Created comparison script (compare_classifiers.py)
Added implementation summary document

✅ Files Changed

pyprophet/scoring/classifiers.py - Complete HistGBCLearner implementation
pyprophet/scoring/pyprophet.py - Integration with scoring pipeline
pyprophet/scoring/runner.py - Model persistence handling
pyprophet/cli/score.py - CLI option
pyprophet/_config.py - Configuration support
docs/user_guide/pyprophet_workflow.rst - User documentation
docs/api/scoring.rst - API documentation
compare_classifiers.py - Performance comparison tool (new)

Usage Examples

# Basic usage
pyprophet score --in data.osw --classifier=HistGradientBoosting

# With hyperparameter tuning
pyprophet score --in data.osw --classifier=HistGradientBoosting --autotune

# Compare with XGBoost
python compare_classifiers.py --in test_data.osw

Benefits

✅ No XGBoost dependency required
✅ Native sklearn integration
✅ Similar performance to XGBoost
✅ Supports hyperparameter autotuning
✅ Fully tested and documented

Checklist

Add HistGradientBoostingClassifier implementation
Test performance framework (comparison script provided)
Add documentation
All syntax validated
Performance benchmarking on real datasets (ready for maintainers to test)

Example Comparison Performance

Tested on Gold Standard S. pyo dataset from PASS01508.

$OMP_NUM_THREADS=6 python sandbox/compare_classifiers.py --in merged.osw 

================================================================================
Running with XGBoost classifier
================================================================================
  Summary for Q-Value ≤ 1%:
================================================================================
                 run_id num_ids min_area mean_area max_area
0    690411522375919463    8968      0.0    272.15  30384.6
1   1006404389538767196    9053      0.5    324.73  37667.7
2   1895217364647358706    8488     2.26    290.69  28462.3
3   1993070538656663830   10341      0.0    515.97  41947.8
4   2086104329746226633   10127      0.0     402.9  51808.6
5   2895829782584737863    8783      0.0    282.14  29902.4
6   3312727970122698354   11010      0.0    513.58  60251.0
7   3384946816235683019    9355     1.01    346.85  44013.3
8   4228224315318701854    9776      0.0    301.71  30645.4
9   6061028540430771831    9819      0.0    521.31  35058.4
10  8196394674452833952    8261      0.0    230.19  21726.6
11  8238807937114849326   10138      0.0    453.51  38966.3
12  8292393358414978236   10064     0.28    332.86  33808.5
13  8627438106464817423   10159      0.0    521.61  40618.2
14  8749089703153095849    7000      0.0    225.82  24693.5
15  8889961272137748833   10813      0.0    424.65  47524.5

================================================================================
Running with HistGradientBoosting classifier
================================================================================
  Summary for Q-Value ≤ 1%:
================================================================================
                 run_id num_ids min_area mean_area max_area
0    690411522375919463    8958      0.0    271.94  17553.4
1   1006404389538767196    8998      0.0    326.66  24286.0
2   1895217364647358706    8428      0.0    295.97  28462.3
3   1993070538656663830   10292      0.0    522.89  41947.8
4   2086104329746226633   10142      0.0    407.45  51808.6
5   2895829782584737863    8694      0.0    287.42  29902.4
6   3312727970122698354   11040      0.0    518.19  60251.0
7   3384946816235683019    9337      0.0    344.93  22048.3
8   4228224315318701854    9796      0.0    302.12  27995.1
9   6061028540430771831    9739      0.0    538.31  67755.4
10  8196394674452833952    8251      0.0    234.49  29217.3
11  8238807937114849326   10046      0.0     458.2  38966.3
12  8292393358414978236   10040      0.0    334.09  33808.5
13  8627438106464817423   10075      0.0    530.58  40618.2
14  8749089703153095849    6753      0.0    229.73  22251.7
15  8889961272137748833   10832      0.0    427.43  47524.5

================================================================================
RUNTIME COMPARISON SUMMARY
================================================================================

Runtime:
  XGBoost                  : 191.86 seconds
  HistGradientBoosting     : 137.26 seconds

Speedup: 1.40x

…mance

- Add complete HistGBCLearner class with all required methods (learn, score, get/set_parameters, tune) - Wire HistGradientBoosting into CLI as a classifier choice - Update RunnerConfig to support HistGradientBoosting classifier - Update PyProphet to instantiate and handle HistGBCLearner - Update runner.py to handle HistGradientBoosting in weight loading/saving - Feature importances stored in XGBoost-compatible format for consistency - Hyperparameter tuning support via RandomizedSearchCV

- Add comprehensive implementation summary document - Add compare_classifiers.py script for performance comparison - Update user guide to document all available classifiers - Add HistGBCLearner to API documentation - Document benefits and usage of HistGradientBoosting as XGBoost alternative

…prevent TypeError

…stGBC

…sifier feature importances

…stGBC

…ances

…stGBC

… set_parameters

…stGBC

…stingClassifier - Add custom loss_gain_score function using negative log loss as scoring metric - This mimics XGBoost's gain metric (measures loss reduction when feature is used) - Use stratified subsampling (up to 2000 samples) for faster computation - Scale importances by 100x to match XGBoost gain magnitude - Clamp negative values to zero (can occur due to noise in permutation) - Use make_scorer with needs_proba=True for probability-based scoring - Fixes issue where default accuracy scoring produced mostly zero importances

…tuning and parallel processing

…arning

…event oversubscription

… and HistGBCLearner

…ove model performance

…s for optimization

…t/Split/Multisplit; enable autotune and PFDR paths; extend strategies to support --classifier=HistGradientBoosting

…r across various configurations

…ter class

Copilot

Pull Request Overview

Adds scikit-learn’s HistGradientBoostingClassifier as a first-class classifier option alongside existing LDA/SVM/XGBoost, integrates it into the scoring pipeline, CLI, config, and tests, and provides a sandbox script for runtime comparisons.

New HistGBCLearner with learning, scoring, autotuning, and feature-importance support
Pipeline integration: CLI option, config updates, runner persistence handling, semi-supervised tuning path, and docs
Comprehensive tests and golden outputs for OSW/Parquet variants, plus a sandbox comparison script

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_pyprophet_score.py	Adds histgbc flags to strategies and new test cases for OSW/Parquet split variants and autotuning
tests/_regtest_outputs/histgbc.out	Golden outputs for new HistGBC tests
sandbox/compare_classifiers.py	Script to compare XGBoost vs HistGradientBoosting runtime/results
pyprophet/stats.py	Tuple pi0_lambda handling enhanced for (a,0,0) input
pyprophet/scoring/semi_supervised.py	Integrates HistGBCLearner into semi-supervised tuning path
pyprophet/scoring/runner.py	Threading advisory/adjustments for HGB, persistence paths and messages updated
pyprophet/scoring/pyprophet.py	Constructs HistGBCLearner, supports importance logging, and apply-weights for HGB
pyprophet/scoring/classifiers.py	Implements HistGBCLearner with tuning, learning, scoring, and importances
pyprophet/io/_base.py	Persist HGB weights via binary path (shared with XGB) and minor formatting
pyprophet/cli/score.py	Adds HistGradientBoosting to CLI and notes on OMP_NUM_THREADS
pyprophet/_config.py	Config and type annotations updated to include HistGradientBoosting
docs/user_guide/pyprophet_workflow.rst	User guide updated to introduce HistGradientBoosting
docs/api/scoring.rst	API docs list HistGBCLearner
HISTGBC_TEST_ADDITIONS.md	Notes on new tests and coverage

Comments suppressed due to low confidence (1)

pyprophet/scoring/runner.py:1

Newly enabled apply-weights path for HistGradientBoosting shares the XGBoost persistence (PYPROPHET_XGB table), but there is no test exercising apply-weights for HistGradientBoosting. Please add an apply-weights test (OSW and TSV/bin) analogous to the existing XGBoost apply-weights tests to verify read/write and scoring consistency.

"""

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pyprophet/stats.py

pyprophet/scoring/classifiers.py

pyprophet/scoring/runner.py

pyprophet/scoring/classifiers.py

sandbox/compare_classifiers.py

pyprophet/scoring/runner.py

…rrors

…rt order

…S or os.cpu_count() with fallback and clamping\n- Coordinate outer (RandomizedSearchCV n_jobs) vs inner (OpenMP) parallelism\n- Use threadpoolctl to enforce OpenMP limits during tune() and learn() fits\n- Avoid OMP_NUM_THREADS=0 and respect user-provided OMP_NUM_THREADS if set

…anup

Critical fix for OpenMP thread control with HistGradientBoosting: - Defer sklearn imports in CLI until after OMP_NUM_THREADS is set - Auto-set OMP_NUM_THREADS in score() if not already set by user - Calculate optimal threads: ceil(total_cpus / requested_threads) - Update warnings and docs to clarify timing requirement - Keep threadpoolctl as runtime fallback (but env var is primary) This ensures OMP_NUM_THREADS takes effect since it must be set before NumPy/OpenMP initialization. Without this, HistGBC would use all CPUs regardless of threadpoolctl calls. Resolves issue where threadpoolctl alone was insufficient.

…stGBC

…ing and update logging in runner

…stGBC

…radientBoosting

singjc added 2 commits May 29, 2025 13:15

feat: Add HistGradientBoostingClassifier for improved learning perfor…

1c90d1b

…mance

Merge branch 'master' of github.com:singjc/pyprophet into add/histGBC

8b76bd5

singjc linked an issue Jun 4, 2025 that may be closed by this pull request

[REFACTOR] Replace XGB with sklearn HistGradientBoostingClassifier? #143

Closed

singjc added the refactor label Jun 4, 2025

singjc and others added 26 commits October 16, 2025 16:21

Add PR completion summary documenting all changes

5007ea9

Merge branch 'master' of github.com:singjc/pyprophet into add/histGBC

d14236a

fix: filter out invalid params for HistGradientBoostingClassifier to …

f1b1454

…prevent TypeError

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

31848b2

…stGBC

fix: remove stray return and fix indentation in HistGBCLearner.learn

ade1eac

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

d58b3c2

…stGBC

fix: use permutation_importance fallback for HistGradientBoostingClas…

8ca1bcf

…sifier feature importances

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

4f1f05c

…stGBC

fix: add check for importance attribute before logging feature import…

b312b71

…ances

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

f10ea31

…stGBC

fix: store feature importance on classifier object to persist through…

ec24c4c

… set_parameters

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

bbc03dd

…stGBC

update: defaults for HistGBC

ab92c36

fix: pi0 when passing single value as lambda

645b0f3

update compare_classifiers.py

7a6de1d

feat: enhance HistGradientBoostingClassifier with improved parameter …

aaf7a23

…tuning and parallel processing

feat: add HistGBCLearner support for autotuning in semi-supervised le…

1429364

…arning

feat: add support for thread management in HistGradientBoosting to pr…

0f9fa67

…event oversubscription

feat: add documentation for thread management in HistGradientBoosting…

58eebfc

… and HistGBCLearner

fix: update l2_regularization default value in HistGBCLearner to impr…

240beb2

…ove model performance

remove: impl. summary docs

12fc38e

refactor: move compare_classifiers.py to sandbox and update parameter…

3967e89

…s for optimization

tests: add HistGradientBoosting classifier coverage across OSW/Parque…

7051bf3

…t/Split/Multisplit; enable autotune and PFDR paths; extend strategies to support --classifier=HistGradientBoosting

singjc added 2 commits October 17, 2025 22:02

tests: add regression test outputs for HistGradientBoosting classifie…

b780976

…r across various configurations

refactor: clean up code formatting and improve readability in BaseWri…

e0f2b67

…ter class

singjc marked this pull request as ready for review October 18, 2025 02:39

singjc requested a review from Copilot October 18, 2025 02:40

Copilot AI reviewed Oct 18, 2025

View reviewed changes

singjc and others added 13 commits October 17, 2025 22:53

fix: initialize ll variable in pi0est function to prevent potential e…

4aafcb2

…rrors

refactor: remove unused pandas import from compare_classifiers.py

b6fcc59

fix: correct typo in comment regarding OMP_NUM_THREADS and numpy impo…

4ea4c34

…rt order

remove: delete HISTGBC_TEST_ADDITIONS.md as part of documentation cle…

a33633a

…anup

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

2bbad64

…stGBC

refactor: move OMP_NUM_THREADS setup to main.py for HistGradientBoost…

6f6aa82

…ing and update logging in runner

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

4eba79d

…stGBC

docs: clarify OMP_NUM_THREADS setup requirements in main.py

1908f07

Merge branch 'add/histGBC' of github.com:singjc/pyprophet into add/hi…

d30992f

…stGBC

fix: ensure OMP_NUM_THREADS is set before any numpy imports for HistG…

bda90c5

…radientBoosting

fix: set OMP_NUM_THREADS to 1 for pytest execution in CI workflow

1a47ef7

singjc merged commit c807047 into PyProphet:master Oct 18, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add/hist gbc#144

Add/hist gbc#144
singjc merged 43 commits intoPyProphet:masterfrom
singjc:add/histGBC

singjc commented Jun 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

singjc commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

✅ Implementation Complete

✅ CLI & Configuration

✅ Documentation

✅ Files Changed

Usage Examples

Benefits

Checklist

Example Comparison Performance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

singjc commented Jun 4, 2025 •

edited

Loading