Conversation
- Add complete HistGBCLearner class with all required methods (learn, score, get/set_parameters, tune) - Wire HistGradientBoosting into CLI as a classifier choice - Update RunnerConfig to support HistGradientBoosting classifier - Update PyProphet to instantiate and handle HistGBCLearner - Update runner.py to handle HistGradientBoosting in weight loading/saving - Feature importances stored in XGBoost-compatible format for consistency - Hyperparameter tuning support via RandomizedSearchCV
- Add comprehensive implementation summary document - Add compare_classifiers.py script for performance comparison - Update user guide to document all available classifiers - Add HistGBCLearner to API documentation - Document benefits and usage of HistGradientBoosting as XGBoost alternative
…prevent TypeError
…sifier feature importances
…stingClassifier - Add custom loss_gain_score function using negative log loss as scoring metric - This mimics XGBoost's gain metric (measures loss reduction when feature is used) - Use stratified subsampling (up to 2000 samples) for faster computation - Scale importances by 100x to match XGBoost gain magnitude - Clamp negative values to zero (can occur due to noise in permutation) - Use make_scorer with needs_proba=True for probability-based scoring - Fixes issue where default accuracy scoring produced mostly zero importances
…tuning and parallel processing
…event oversubscription
… and HistGBCLearner
…ove model performance
…s for optimization
…t/Split/Multisplit; enable autotune and PFDR paths; extend strategies to support --classifier=HistGradientBoosting
…r across various configurations
Contributor
There was a problem hiding this comment.
Pull Request Overview
Adds scikit-learn’s HistGradientBoostingClassifier as a first-class classifier option alongside existing LDA/SVM/XGBoost, integrates it into the scoring pipeline, CLI, config, and tests, and provides a sandbox script for runtime comparisons.
- New HistGBCLearner with learning, scoring, autotuning, and feature-importance support
- Pipeline integration: CLI option, config updates, runner persistence handling, semi-supervised tuning path, and docs
- Comprehensive tests and golden outputs for OSW/Parquet variants, plus a sandbox comparison script
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_pyprophet_score.py | Adds histgbc flags to strategies and new test cases for OSW/Parquet split variants and autotuning |
| tests/_regtest_outputs/histgbc.out | Golden outputs for new HistGBC tests |
| sandbox/compare_classifiers.py | Script to compare XGBoost vs HistGradientBoosting runtime/results |
| pyprophet/stats.py | Tuple pi0_lambda handling enhanced for (a,0,0) input |
| pyprophet/scoring/semi_supervised.py | Integrates HistGBCLearner into semi-supervised tuning path |
| pyprophet/scoring/runner.py | Threading advisory/adjustments for HGB, persistence paths and messages updated |
| pyprophet/scoring/pyprophet.py | Constructs HistGBCLearner, supports importance logging, and apply-weights for HGB |
| pyprophet/scoring/classifiers.py | Implements HistGBCLearner with tuning, learning, scoring, and importances |
| pyprophet/io/_base.py | Persist HGB weights via binary path (shared with XGB) and minor formatting |
| pyprophet/cli/score.py | Adds HistGradientBoosting to CLI and notes on OMP_NUM_THREADS |
| pyprophet/_config.py | Config and type annotations updated to include HistGradientBoosting |
| docs/user_guide/pyprophet_workflow.rst | User guide updated to introduce HistGradientBoosting |
| docs/api/scoring.rst | API docs list HistGBCLearner |
| HISTGBC_TEST_ADDITIONS.md | Notes on new tests and coverage |
Comments suppressed due to low confidence (1)
pyprophet/scoring/runner.py:1
- Newly enabled apply-weights path for HistGradientBoosting shares the XGBoost persistence (PYPROPHET_XGB table), but there is no test exercising apply-weights for HistGradientBoosting. Please add an apply-weights test (OSW and TSV/bin) analogous to the existing XGBoost apply-weights tests to verify read/write and scoring consistency.
"""
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…S or os.cpu_count() with fallback and clamping\n- Coordinate outer (RandomizedSearchCV n_jobs) vs inner (OpenMP) parallelism\n- Use threadpoolctl to enforce OpenMP limits during tune() and learn() fits\n- Avoid OMP_NUM_THREADS=0 and respect user-provided OMP_NUM_THREADS if set
Critical fix for OpenMP thread control with HistGradientBoosting: - Defer sklearn imports in CLI until after OMP_NUM_THREADS is set - Auto-set OMP_NUM_THREADS in score() if not already set by user - Calculate optimal threads: ceil(total_cpus / requested_threads) - Update warnings and docs to clarify timing requirement - Keep threadpoolctl as runtime fallback (but env var is primary) This ensures OMP_NUM_THREADS takes effect since it must be set before NumPy/OpenMP initialization. Without this, HistGBC would use all CPUs regardless of threadpoolctl calls. Resolves issue where threadpoolctl alone was insufficient.
…ing and update logging in runner
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds HistGradientBoostingClassifier from scikit-learn as a new classifier option in PyProphet, providing a high-performance alternative to XGBoost without requiring additional dependencies.
Changes Made
✅ Implementation Complete
✅ CLI & Configuration
HistGradientBoostingas classifier choice in CLI✅ Documentation
compare_classifiers.py)✅ Files Changed
pyprophet/scoring/classifiers.py- Complete HistGBCLearner implementationpyprophet/scoring/pyprophet.py- Integration with scoring pipelinepyprophet/scoring/runner.py- Model persistence handlingpyprophet/cli/score.py- CLI optionpyprophet/_config.py- Configuration supportdocs/user_guide/pyprophet_workflow.rst- User documentationdocs/api/scoring.rst- API documentationcompare_classifiers.py- Performance comparison tool (new)Usage Examples
Benefits
Checklist
Example Comparison Performance
Tested on Gold Standard S. pyo dataset from PASS01508.