[TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15173

guitargeek · 2024-04-08T18:24:34Z

Consolidate RBDT as specified in the ROOT plan of work 2024.

The backends of RBDT are replaced with a single new backend:
the logic from the FastForest library:
https://github.com/guitargeek/XGBoost-FastForest

The logic in that library was originally taken from the GBRForest in
CMSSW:
https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h

The interface remains the same, only that the template parameter
specifying the backend is gone.

This change adds support for unbalanced trees.

To see the fixed logic, as one run a modified version of the tmva101_Training.py tutorial (remember to run tutorials/tmva/tmva100_DataPreparation.py first):

import ROOT
import numpy as np

from tmva100_DataPreparation import variables

def load_data(signal_filename, background_filename):
    # Read data from ROOT files
    data_sig = ROOT.RDataFrame("Events", signal_filename).AsNumpy()
    data_bkg = ROOT.RDataFrame("Events", background_filename).AsNumpy()

    # Convert inputs to format readable by machine learning tools
    x_sig = np.vstack([data_sig[var] for var in variables]).T
    x_bkg = np.vstack([data_bkg[var] for var in variables]).T
    x = np.vstack([x_sig, x_bkg])

    # Create labels
    num_sig = x_sig.shape[0]
    num_bkg = x_bkg.shape[0]
    y = np.hstack([np.ones(num_sig), np.zeros(num_bkg)])

    # Compute weights balancing both classes
    num_all = num_sig + num_bkg
    w = np.hstack([np.ones(num_sig) * num_all / num_sig, np.ones(num_bkg) * num_all / num_bkg])

    return x, y, w

# Load data
x, y, w = load_data("train_signal.root", "train_background.root")

# Fit xgboost model
from xgboost import XGBClassifier
xbdt = XGBClassifier(max_depth=3, n_estimators=500)
xbdt.fit(x, y, sample_weight=w)

# Save model in TMVA format
ROOT.TMVA.Experimental.SaveXGBoost(xbdt, "myBDT", "tmva101.root", num_inputs=x.shape[1])

# Load data
x, y_true, w = load_data("test_signal.root", "test_background.root")

# Load trained model
File = "tmva101.root"

bdt = ROOT.TMVA.Experimental.RBDT("myBDT", File)

# Make prediction
y_pred_ref = xbdt.predict_proba(x)[:, 1]
y_pred = bdt.Compute(x).T

diff = y_pred - y_pred_ref

print("max diff", np.max(np.abs(diff)))

Before, the max difference was 0.379578, and with this PR it is 5.9604645e-08, compatible with floating point precision.

FYI @martamaja10, @lmoneta, @vepadulano

github-actions · 2024-04-08T19:40:05Z

Test Results

12 files 12 suites 2d 13h 55m 24s ⏱️
2 629 tests 2 629 ✅ 0 💤 0 ❌
29 682 runs 29 682 ✅ 0 💤 0 ❌

Results for commit a036e7e.

♻️ This comment has been updated with latest results.

tmva/tmva/src/RBDT.cxx

hahnjo · 2024-04-09T06:42:08Z

To be honest, I'm not convinced this should go into 6.32: After branching, feature development is nominally over...

guitargeek · 2024-04-09T09:35:40Z

Thanks for the review!

The only tutorial for that feature silently gives wrong results: IMO this is a bugfix, not a new feature.

Also considering that there are people who would clearly benefit from this being in 6.32 (analysis grand challenge).

vepadulano

Thank you! This will indeed be nice to have for the AGC implementation (although I am not sure yet if this is a good argument for bringing it to 6.32, the test silently returning wrong data is a much stronger argument).

There is some code modernization and class cleanup to be done (applying ROOT's camelcase everywhere for example).

bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py

tmva/tmva/src/RBDT.cxx

vepadulano · 2024-04-09T10:01:21Z

tmva/tmva/src/RBDT.cxx

+   detail::softmaxTransformInplace(out, nOut);
+}
+
+void TMVA::Experimental::RBDT::compute(const Value_t *array, Value_t *out) const


camelcase. Also, change the c-style array interface

See this comment about array interface:
#15173 (comment)

vepadulano · 2024-04-09T10:01:30Z

tmva/tmva/src/RBDT.cxx

+   }
+}
+
+TMVA::Experimental::RBDT::Value_t TMVA::Experimental::RBDT::evaluateBinary(const Value_t *array) const


camelcase. Also, change the c-style array interface

See this comment about array interface:
#15173 (comment)

vepadulano · 2024-04-09T10:02:16Z

tmva/tmva/src/RBDT.cxx

+
+} // namespace
+
+RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)


Better explicit the full function signature

Suggested change

RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)

TMVA::Experimental::RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)

Also, camelcase

tmva/tmva/src/RBDT.cxx

guitargeek · 2024-04-09T10:48:36Z

Hi @vepadulano, thanks for your review comments! I have addressed all of them, except for the CamelCase in the implementation details. I think since this code is copied from an external library and I might want to sync it again at some point, it should be kept as-is for now. Would that be okay for you?

vepadulano · 2024-04-09T12:29:06Z

Would that be okay for you?

I would still like the repository to use the same casing everywhere, but probably if we look hard enough we can find other examples of functions not strictly following the CamelCase convention.

For the purposes of this PR, as long as anything outside of Internal namespace or unnamed namespaces in compilation units follows CamelCase, I would be ok with it.

guitargeek · 2024-04-09T13:09:55Z

For the purposes of this PR, as long as anything outside of Internal namespace or unnamed namespaces in compilation units follows CamelCase, I would be ok with it.

Yep that's the case! Everything not following the convention is in the anonymous namespace.

guitargeek · 2024-04-09T13:20:58Z

By the way, is xgboost installed in the new CI? If not, we are not testing these developments right now.

vepadulano · 2024-04-09T13:22:45Z

is xgboost installed in the new CI

If it's not in https://github.com/root-project/root/blob/master/requirements.txt then it's not installed

vepadulano · 2024-04-09T13:56:51Z

tmva/tmva/inc/TMVA/RBDT.hxx

+   std::vector<int> rootIndices_;
+   std::vector<unsigned int> cutIndices_;
+   std::vector<Value_t> cutValues_;
+   std::vector<int> leftIndices_;
+   std::vector<int> rightIndices_;
+   std::vector<Value_t> responses_;
+   std::vector<int> treeNumbers_;
+   std::vector<Value_t> baseResponses_;
+   Value_t baseScore_ = 0.0;
+   bool logistic_ = false;


I only noticed now that these data members have this trailing underscore convention, what does it mean exactly? We usually go with fDataMember. Also, they are public data members of a public interface, even more reason to apply the usual convention

Since I originally followed the CMSSW code style, I used trailing underscores to mark public data members. Coincidentally, the xgboost python package follows almost the same convention (only with snake_case_, as one can see in examples like xgb_model.n_classes_).

The fDataMember convention is for private data members AFAIK. Anyway the best solution is probably to make all these data members private anyway, and then we can go for fStyle.

I have made all the data members private and use the fDataMember style in my latest update!

tmva/tmva/src/RBDT.cxx

vepadulano · 2024-04-09T14:00:09Z

tmva/tmva/src/RBDT.cxx

+
+namespace detail {
+
+typedef std::unordered_map<int, int> IndexMap;


What is the meaning of the keys and values in this map type?

I have added some docs now, but I think it's not necessary to go into detail here.

In the future, I think we can get rid of it anyway: at some point the XGBoost devs took inspiration from my library to optimize their own BDT representation, if I remember the discussions on GitHub correctly.

So I think we can avoid this index translation in the future by parsing XGBoosts internal representation directly instead of going over the txt dump.

tmva/tmva/src/RBDT.cxx

tmva/tmva/test/CMakeLists.txt

hahnjo · 2024-04-09T14:16:58Z

But should we explicitly add optional ROOT dependencies there? Not sure. Anyway I put it there temporarily, for the sake of a CI run.

This won't work, dependencies from requirements.txt are installed directly into the CI image which is generated once per night based on master.

guitargeek · 2024-04-09T14:56:11Z

Ok thanks! Note that I'll also have to change the order of imports in the tutorials because of this issue, which was independently reproduced by @vepadulano :

Importing cppyy before xgboost breaks xgboost wlav/cppyy#227

martamaja10 · 2024-04-09T15:36:52Z

With help from Jonas, I tested this PR in the context of AGC (for now locally on my PC with 8 cores) and it works - the ml histograms are produced.

guitargeek · 2024-04-10T01:17:34Z

Finally the changes in this PR are tested. The macOS 13 node had XGBoost installed:

Test #1399: tutorial-tmva-tmva101_Training-py ............................   Passed   22.84 sec
Test #1400: tutorial-tmva-tmva102_Testing-py .............................   Passed    4.02 sec
Test #1083: tutorial-tmva-tmva103_Application ............................   Passed    9.33 sec
Test  #390: pyunittests-rbdt-xgboost .....................................   Passed    4.93 sec

The pyunittests-rbdt-xgboost that had to be disabled before is passing as well. The tutorials were passing before this PR already, because the BDT scores were only plotted and printed, not validated.

hahnjo · 2024-04-10T07:00:55Z

The only tutorial for that feature silently gives wrong results: IMO this is a bugfix, not a new feature.

So my understanding is that this never worked, right? Which also makes me doubt that there are any users that try to use this feature. I think in that case, we should seriously ask ourselves if we should support that at all...

In any case, I remain unconvinced that a +528 −939 PR is appropriate for backporting after branching.

guitargeek · 2024-04-10T07:25:13Z

Thanks Jonas for the valid concern.

There are some people who tried to use RBDT, and even people who noticed the problem or wrong values for XGBoost models:

https://root-forum.cern.ch/search?q=Rbdt

Also, the plan to "consolidate RBDT" was announced in the PoW 2024 and at several presentations, so at this point dropping the feature is not an option I think.

I agree with you that if a feature gave wrong results in the tutorial and nobody opened an issue about it for 5 years: doesn't exactly speak for the relevance of the feature (or maybe we didn't advertise it enough?).

vepadulano · 2024-04-10T07:35:13Z

On top of your already very valid comments, I wanted to add that one point in favour of the relevance of such a feature is that it was initially present in CMSSW, so evaluation of BDTs in experiment frameworks is still a thing. Providing a thin layer to import the de-facto standard xgboost models into a C++ execution environment just seems the right thing to do. Since this feature was sort-of-present-but-utterly-broken before, it makes sense imho to fix it in our codebase. Then I guess the question of eventually removing the feature completely and have it as a standalone library is still valid (for ROOT7 perhaps).

guitargeek · 2024-04-10T07:35:25Z

Actually I think users don't use the GitHub issue tracker that much in general. It's mostly used by devs, packagers, and experiment framework developers. So maybe it's normal we didn't hear about the problem there.

vepadulano · 2024-04-10T14:57:55Z

I have summarized the current understanding of the situation with RBDT at #15197

Given that, I believe we should not merge this at the moment (and thus not push for it being in 6.32). An investigation is required regarding how to provide our Python users with what they probably want in this case: evaluating xgboost models in a ROOT application. This may be served by the changes of this PR. But an alternative is also forgoing an internal implementation a-la RBDT and just try to see if we can hook the xgboost Python API directly into e.g. the RDataFrame API.

guitargeek · 2024-04-10T15:13:22Z

Awesome, thanks!

But an alternative is also forgoing an internal implementation a-la RBDT and just try to see if we can hook the xgboost Python API directly into e.g. the RDataFrame API.

I don't think that the XGBoost API is thread safe (it doesn't have to be because of the GIL), so that might be a problem with implicit multi-threading in RDataFrame.

I'm happy to continue the discussion later, but just for the record my 2 cents: I would still consider this a bugfix for 6.32.

guitargeek · 2024-04-15T16:59:38Z

To summarize what was decided today in the ROOT team meeting: it is very unfortunate that this bug was discovered a week after branching, please avoid this in the future. Nevertheless, since this is fixing a feature that we advertise and the feature is very specific, this should be merged and backported to 6.32 if the CI can be made green (even disabling the failing test would be acceptable, since n - 1 tested platforms are better than zero).

Consolidate RBDT as specified in the ROOT plan of work 2024. The backends of RBDT are replaced with a single new backend: the logic from the FastForest library: https://github.com/guitargeek/XGBoost-FastForest The logic in that library was originally taken from the GBRForest in CMSSW: https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h The interface remains the same, only that the template parameter specifying the backend is gone. This change adds support for unbalanced trees.

guitargeek · 2024-04-22T21:55:06Z

I have now installed the final review comments about the fMember style by @vepadulano.

Furthermore, the alma9 failures should be gone now, because we have deactivated tmva-cpu for this platform. Like this, the collision of the openblas version on the system (picked up by TMVA-CPU) and by NumPy should not happen.

lmoneta

LGTM!
Thanks a lot Jonas for this nice contribution

Change requests addressed

guitargeek added the in:TMVA label Apr 8, 2024

guitargeek self-assigned this Apr 8, 2024

guitargeek requested review from bellenot, couet, pcanal, lmoneta, dpiparo and vepadulano as code owners April 8, 2024 18:24

guitargeek marked this pull request as draft April 8, 2024 18:24

guitargeek force-pushed the fastforest branch from 509deda to fde738c Compare April 8, 2024 22:38

guitargeek changed the title ~~[TMVA] Consolidate RBDT~~ [TMVA] Merge RBDT with FastForest Apr 8, 2024

guitargeek force-pushed the fastforest branch from fde738c to e40e94a Compare April 8, 2024 22:41

guitargeek changed the title ~~[TMVA] Merge RBDT with FastForest~~ [TMVA] Fix RBDT evaluation of XGBoost models by mergin RBDT with FastForest Apr 8, 2024

guitargeek marked this pull request as ready for review April 8, 2024 23:05

guitargeek added this to the 6.32/00 milestone Apr 8, 2024

guitargeek force-pushed the fastforest branch from e40e94a to a21fedb Compare April 8, 2024 23:27

guitargeek mentioned this pull request Apr 9, 2024

[tree,io] prevent integer overflow when dealing with cachesize #14818

Merged

2 tasks

hahnjo reviewed Apr 9, 2024

View reviewed changes

guitargeek added the bug label Apr 9, 2024

guitargeek changed the title ~~[TMVA] Fix RBDT evaluation of XGBoost models by mergin RBDT with FastForest~~ [TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest Apr 9, 2024

vepadulano requested changes Apr 9, 2024

View reviewed changes

guitargeek force-pushed the fastforest branch from 161ec65 to 2df0c4a Compare April 9, 2024 10:52

vepadulano previously requested changes Apr 9, 2024

View reviewed changes

vepadulano mentioned this pull request Apr 9, 2024

[CI] Add xgboost and scikit-learn to requirements #15183

Merged

guitargeek force-pushed the fastforest branch 2 times, most recently from e610073 to 6f1de3b Compare April 10, 2024 00:43

vepadulano mentioned this pull request Apr 10, 2024

RBDT does not support xgboost inference #15197

Closed

1 task

guitargeek force-pushed the fastforest branch from 6f1de3b to bf532a4 Compare April 10, 2024 15:18

guitargeek mentioned this pull request Apr 12, 2024

TMVA missing protection against mis-configuration. #15211

Closed

1 task

guitargeek force-pushed the fastforest branch from bf532a4 to d96e51b Compare April 21, 2024 10:43

guitargeek mentioned this pull request Apr 21, 2024

XGBoost different inference results on AMD64, ARM and PPC cms-sw/cmssw#44542

Open

guitargeek force-pushed the fastforest branch from d96e51b to a036e7e Compare April 22, 2024 21:52

guitargeek requested a review from vepadulano April 22, 2024 21:55

lmoneta approved these changes Apr 26, 2024

View reviewed changes

guitargeek merged commit d5c0f48 into root-project:master Apr 26, 2024

guitargeek deleted the fastforest branch April 26, 2024 14:02

guitargeek mentioned this pull request Apr 26, 2024

[v630][TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15356

Merged

jmcarcell mentioned this pull request Jul 30, 2024

Add checks for TVMA::Experimental::RBDT since it is not a template anymore HEP-FCC/FCCAnalyses#395

Merged


		} // namespace

		RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)

	RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)
	TMVA::Experimental::RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)


		namespace detail {

		typedef std::unordered_map<int, int> IndexMap;

[TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15173

[TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15173

Uh oh!

Conversation

guitargeek commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hahnjo commented Apr 9, 2024

Uh oh!

guitargeek commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guitargeek commented Apr 9, 2024

Uh oh!

vepadulano commented Apr 9, 2024

Uh oh!

guitargeek commented Apr 9, 2024

Uh oh!

guitargeek commented Apr 9, 2024

Uh oh!

vepadulano commented Apr 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hahnjo commented Apr 9, 2024

Uh oh!

guitargeek commented Apr 9, 2024

Uh oh!

martamaja10 commented Apr 9, 2024

Uh oh!

guitargeek commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hahnjo commented Apr 10, 2024

Uh oh!

guitargeek commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

guitargeek commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

guitargeek commented Apr 9, 2024 •

edited

Loading

guitargeek commented Apr 10, 2024 •

edited

Loading

guitargeek commented Apr 10, 2024 •

edited

Loading

vepadulano commented Apr 10, 2024 •

edited

Loading

guitargeek commented Apr 10, 2024 •

edited

Loading