Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15173

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

guitargeek
Copy link
Contributor

@guitargeek guitargeek commented Apr 8, 2024

Consolidate RBDT as specified in the ROOT plan of work 2024.

The backends of RBDT are replaced with a single new backend:
the logic from the FastForest library:
https://github.com/guitargeek/XGBoost-FastForest

The logic in that library was originally taken from the GBRForest in
CMSSW:
https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h

The interface remains the same, only that the template parameter
specifying the backend is gone.

This change adds support for unbalanced trees.

To see the fixed logic, as one run a modified version of the tmva101_Training.py tutorial (remember to run tutorials/tmva/tmva100_DataPreparation.py first):

import ROOT
import numpy as np

from tmva100_DataPreparation import variables

def load_data(signal_filename, background_filename):
    # Read data from ROOT files
    data_sig = ROOT.RDataFrame("Events", signal_filename).AsNumpy()
    data_bkg = ROOT.RDataFrame("Events", background_filename).AsNumpy()

    # Convert inputs to format readable by machine learning tools
    x_sig = np.vstack([data_sig[var] for var in variables]).T
    x_bkg = np.vstack([data_bkg[var] for var in variables]).T
    x = np.vstack([x_sig, x_bkg])

    # Create labels
    num_sig = x_sig.shape[0]
    num_bkg = x_bkg.shape[0]
    y = np.hstack([np.ones(num_sig), np.zeros(num_bkg)])

    # Compute weights balancing both classes
    num_all = num_sig + num_bkg
    w = np.hstack([np.ones(num_sig) * num_all / num_sig, np.ones(num_bkg) * num_all / num_bkg])

    return x, y, w

# Load data
x, y, w = load_data("train_signal.root", "train_background.root")

# Fit xgboost model
from xgboost import XGBClassifier
xbdt = XGBClassifier(max_depth=3, n_estimators=500)
xbdt.fit(x, y, sample_weight=w)

# Save model in TMVA format
ROOT.TMVA.Experimental.SaveXGBoost(xbdt, "myBDT", "tmva101.root", num_inputs=x.shape[1])

# Load data
x, y_true, w = load_data("test_signal.root", "test_background.root")

# Load trained model
File = "tmva101.root"

bdt = ROOT.TMVA.Experimental.RBDT("myBDT", File)

# Make prediction
y_pred_ref = xbdt.predict_proba(x)[:, 1]
y_pred = bdt.Compute(x).T

diff = y_pred - y_pred_ref

print("max diff", np.max(np.abs(diff)))

Before, the max difference was 0.379578, and with this PR it is 5.9604645e-08, compatible with floating point precision.

FYI @martamaja10, @lmoneta, @vepadulano

@guitargeek guitargeek self-assigned this Apr 8, 2024
@guitargeek guitargeek marked this pull request as draft April 8, 2024 18:24
Copy link

github-actions bot commented Apr 8, 2024

Test Results

    12 files      12 suites   2d 13h 55m 24s ⏱️
 2 629 tests  2 629 ✅ 0 💤 0 ❌
29 682 runs  29 682 ✅ 0 💤 0 ❌

Results for commit a036e7e.

♻️ This comment has been updated with latest results.

@guitargeek guitargeek changed the title [TMVA] Consolidate RBDT [TMVA] Merge RBDT with FastForest Apr 8, 2024
@guitargeek guitargeek changed the title [TMVA] Merge RBDT with FastForest [TMVA] Fix RBDT evaluation of XGBoost models by mergin RBDT with FastForest Apr 8, 2024
@guitargeek guitargeek marked this pull request as ready for review April 8, 2024 23:05
@guitargeek guitargeek added this to the 6.32/00 milestone Apr 8, 2024
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
@hahnjo
Copy link
Member

hahnjo commented Apr 9, 2024

To be honest, I'm not convinced this should go into 6.32: After branching, feature development is nominally over...

@guitargeek
Copy link
Contributor Author

guitargeek commented Apr 9, 2024

Thanks for the review!

The only tutorial for that feature silently gives wrong results: IMO this is a bugfix, not a new feature.

Also considering that there are people who would clearly benefit from this being in 6.32 (analysis grand challenge).

@guitargeek guitargeek added the bug label Apr 9, 2024
@guitargeek guitargeek changed the title [TMVA] Fix RBDT evaluation of XGBoost models by mergin RBDT with FastForest [TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest Apr 9, 2024
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This will indeed be nice to have for the AGC implementation (although I am not sure yet if this is a good argument for bringing it to 6.32, the test silently returning wrong data is a much stronger argument).

There is some code modernization and class cleanup to be done (applying ROOT's camelcase everywhere for example).

tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
detail::softmaxTransformInplace(out, nOut);
}

void TMVA::Experimental::RBDT::compute(const Value_t *array, Value_t *out) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camelcase. Also, change the c-style array interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment about array interface:
#15173 (comment)

}
}

TMVA::Experimental::RBDT::Value_t TMVA::Experimental::RBDT::evaluateBinary(const Value_t *array) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camelcase. Also, change the c-style array interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment about array interface:
#15173 (comment)


} // namespace

RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better explicit the full function signature

Suggested change
RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)
TMVA::Experimental::RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, camelcase

tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
@guitargeek
Copy link
Contributor Author

Hi @vepadulano, thanks for your review comments! I have addressed all of them, except for the CamelCase in the implementation details. I think since this code is copied from an external library and I might want to sync it again at some point, it should be kept as-is for now. Would that be okay for you?

@vepadulano
Copy link
Member

Would that be okay for you?

I would still like the repository to use the same casing everywhere, but probably if we look hard enough we can find other examples of functions not strictly following the CamelCase convention.

For the purposes of this PR, as long as anything outside of Internal namespace or unnamed namespaces in compilation units follows CamelCase, I would be ok with it.

@guitargeek
Copy link
Contributor Author

For the purposes of this PR, as long as anything outside of Internal namespace or unnamed namespaces in compilation units follows CamelCase, I would be ok with it.

Yep that's the case! Everything not following the convention is in the anonymous namespace.

@guitargeek
Copy link
Contributor Author

By the way, is xgboost installed in the new CI? If not, we are not testing these developments right now.

@vepadulano
Copy link
Member

is xgboost installed in the new CI

If it's not in https://github.com/root-project/root/blob/master/requirements.txt then it's not installed

Comment on lines 66 to 75
std::vector<int> rootIndices_;
std::vector<unsigned int> cutIndices_;
std::vector<Value_t> cutValues_;
std::vector<int> leftIndices_;
std::vector<int> rightIndices_;
std::vector<Value_t> responses_;
std::vector<int> treeNumbers_;
std::vector<Value_t> baseResponses_;
Value_t baseScore_ = 0.0;
bool logistic_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only noticed now that these data members have this trailing underscore convention, what does it mean exactly? We usually go with fDataMember. Also, they are public data members of a public interface, even more reason to apply the usual convention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I originally followed the CMSSW code style, I used trailing underscores to mark public data members. Coincidentally, the xgboost python package follows almost the same convention (only with snake_case_, as one can see in examples like xgb_model.n_classes_).

The fDataMember convention is for private data members AFAIK. Anyway the best solution is probably to make all these data members private anyway, and then we can go for fStyle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made all the data members private and use the fDataMember style in my latest update!

tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved

namespace detail {

typedef std::unordered_map<int, int> IndexMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the keys and values in this map type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some docs now, but I think it's not necessary to go into detail here.

In the future, I think we can get rid of it anyway: at some point the XGBoost devs took inspiration from my library to optimize their own BDT representation, if I remember the discussions on GitHub correctly.

So I think we can avoid this index translation in the future by parsing XGBoosts internal representation directly instead of going over the txt dump.

tmva/tmva/src/RBDT.cxx Outdated Show resolved Hide resolved
tmva/tmva/test/CMakeLists.txt Show resolved Hide resolved
@hahnjo
Copy link
Member

hahnjo commented Apr 9, 2024

But should we explicitly add optional ROOT dependencies there? Not sure. Anyway I put it there temporarily, for the sake of a CI run.

This won't work, dependencies from requirements.txt are installed directly into the CI image which is generated once per night based on master.

@guitargeek
Copy link
Contributor Author

Ok thanks! Note that I'll also have to change the order of imports in the tutorials because of this issue, which was independently reproduced by @vepadulano :

@martamaja10
Copy link
Contributor

With help from Jonas, I tested this PR in the context of AGC (for now locally on my PC with 8 cores) and it works - the ml histograms are produced.

@guitargeek
Copy link
Contributor Author

guitargeek commented Apr 10, 2024

Finally the changes in this PR are tested. The macOS 13 node had XGBoost installed:

Test #1399: tutorial-tmva-tmva101_Training-py ............................   Passed   22.84 sec
Test #1400: tutorial-tmva-tmva102_Testing-py .............................   Passed    4.02 sec
Test #1083: tutorial-tmva-tmva103_Application ............................   Passed    9.33 sec
Test  #390: pyunittests-rbdt-xgboost .....................................   Passed    4.93 sec

The pyunittests-rbdt-xgboost that had to be disabled before is passing as well. The tutorials were passing before this PR already, because the BDT scores were only plotted and printed, not validated.

@hahnjo
Copy link
Member

hahnjo commented Apr 10, 2024

The only tutorial for that feature silently gives wrong results: IMO this is a bugfix, not a new feature.

So my understanding is that this never worked, right? Which also makes me doubt that there are any users that try to use this feature. I think in that case, we should seriously ask ourselves if we should support that at all...

In any case, I remain unconvinced that a +528 −939 PR is appropriate for backporting after branching.

@guitargeek
Copy link
Contributor Author

guitargeek commented Apr 10, 2024

Thanks Jonas for the valid concern.

There are some people who tried to use RBDT, and even people who noticed the problem or wrong values for XGBoost models:

https://root-forum.cern.ch/search?q=Rbdt

Also, the plan to "consolidate RBDT" was announced in the PoW 2024 and at several presentations, so at this point dropping the feature is not an option I think.

I agree with you that if a feature gave wrong results in the tutorial and nobody opened an issue about it for 5 years: doesn't exactly speak for the relevance of the feature (or maybe we didn't advertise it enough?).

@vepadulano
Copy link
Member

On top of your already very valid comments, I wanted to add that one point in favour of the relevance of such a feature is that it was initially present in CMSSW, so evaluation of BDTs in experiment frameworks is still a thing. Providing a thin layer to import the de-facto standard xgboost models into a C++ execution environment just seems the right thing to do. Since this feature was sort-of-present-but-utterly-broken before, it makes sense imho to fix it in our codebase. Then I guess the question of eventually removing the feature completely and have it as a standalone library is still valid (for ROOT7 perhaps).

@guitargeek
Copy link
Contributor Author

Actually I think users don't use the GitHub issue tracker that much in general. It's mostly used by devs, packagers, and experiment framework developers. So maybe it's normal we didn't hear about the problem there.

@vepadulano
Copy link
Member

vepadulano commented Apr 10, 2024

I have summarized the current understanding of the situation with RBDT at #15197

Given that, I believe we should not merge this at the moment (and thus not push for it being in 6.32). An investigation is required regarding how to provide our Python users with what they probably want in this case: evaluating xgboost models in a ROOT application. This may be served by the changes of this PR. But an alternative is also forgoing an internal implementation a-la RBDT and just try to see if we can hook the xgboost Python API directly into e.g. the RDataFrame API.

@guitargeek
Copy link
Contributor Author

guitargeek commented Apr 10, 2024

Awesome, thanks!

But an alternative is also forgoing an internal implementation a-la RBDT and just try to see if we can hook the xgboost Python API directly into e.g. the RDataFrame API.

I don't think that the XGBoost API is thread safe (it doesn't have to be because of the GIL), so that might be a problem with implicit multi-threading in RDataFrame.

I'm happy to continue the discussion later, but just for the record my 2 cents: I would still consider this a bugfix for 6.32.

@guitargeek
Copy link
Contributor Author

To summarize what was decided today in the ROOT team meeting: it is very unfortunate that this bug was discovered a week after branching, please avoid this in the future. Nevertheless, since this is fixing a feature that we advertise and the feature is very specific, this should be merged and backported to 6.32 if the CI can be made green (even disabling the failing test would be acceptable, since n - 1 tested platforms are better than zero).

Consolidate RBDT as specified in the ROOT plan of work 2024.

The backends of RBDT are replaced with a single new backend:
the logic from the FastForest library:
https://github.com/guitargeek/XGBoost-FastForest

The logic in that library was originally taken from the GBRForest in
CMSSW:
https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h

The interface remains the same, only that the template parameter
specifying the backend is gone.

This change adds support for unbalanced trees.
@guitargeek
Copy link
Contributor Author

I have now installed the final review comments about the fMember style by @vepadulano.

Furthermore, the alma9 failures should be gone now, because we have deactivated tmva-cpu for this platform. Like this, the collision of the openblas version on the system (picked up by TMVA-CPU) and by NumPy should not happen.

Copy link
Member

@lmoneta lmoneta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks a lot Jonas for this nice contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants