-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TMVA] Fix RBDT evaluation of XGBoost models by merging RBDT with FastForest #15173
Conversation
Test Results 12 files 12 suites 2d 13h 55m 24s ⏱️ Results for commit a036e7e. ♻️ This comment has been updated with latest results. |
To be honest, I'm not convinced this should go into 6.32: After branching, feature development is nominally over... |
Thanks for the review! The only tutorial for that feature silently gives wrong results: IMO this is a bugfix, not a new feature. Also considering that there are people who would clearly benefit from this being in 6.32 (analysis grand challenge). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This will indeed be nice to have for the AGC implementation (although I am not sure yet if this is a good argument for bringing it to 6.32, the test silently returning wrong data is a much stronger argument).
There is some code modernization and class cleanup to be done (applying ROOT's camelcase everywhere for example).
bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py
Outdated
Show resolved
Hide resolved
bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py
Outdated
Show resolved
Hide resolved
bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py
Outdated
Show resolved
Hide resolved
bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py
Outdated
Show resolved
Hide resolved
bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_tree_inference.py
Outdated
Show resolved
Hide resolved
tmva/tmva/src/RBDT.cxx
Outdated
detail::softmaxTransformInplace(out, nOut); | ||
} | ||
|
||
void TMVA::Experimental::RBDT::compute(const Value_t *array, Value_t *out) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camelcase. Also, change the c-style array interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment about array interface:
#15173 (comment)
tmva/tmva/src/RBDT.cxx
Outdated
} | ||
} | ||
|
||
TMVA::Experimental::RBDT::Value_t TMVA::Experimental::RBDT::evaluateBinary(const Value_t *array) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camelcase. Also, change the c-style array interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment about array interface:
#15173 (comment)
tmva/tmva/src/RBDT.cxx
Outdated
|
||
} // namespace | ||
|
||
RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better explicit the full function signature
RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses) | |
TMVA::Experimental::RBDT TMVA::Experimental::RBDT::load_txt(std::string const &txtpath, std::vector<std::string> &features, int nClasses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, camelcase
Hi @vepadulano, thanks for your review comments! I have addressed all of them, except for the CamelCase in the implementation details. I think since this code is copied from an external library and I might want to sync it again at some point, it should be kept as-is for now. Would that be okay for you? |
I would still like the repository to use the same casing everywhere, but probably if we look hard enough we can find other examples of functions not strictly following the CamelCase convention. For the purposes of this PR, as long as anything outside of |
Yep that's the case! Everything not following the convention is in the anonymous namespace. |
By the way, is xgboost installed in the new CI? If not, we are not testing these developments right now. |
If it's not in https://github.com/root-project/root/blob/master/requirements.txt then it's not installed |
tmva/tmva/inc/TMVA/RBDT.hxx
Outdated
std::vector<int> rootIndices_; | ||
std::vector<unsigned int> cutIndices_; | ||
std::vector<Value_t> cutValues_; | ||
std::vector<int> leftIndices_; | ||
std::vector<int> rightIndices_; | ||
std::vector<Value_t> responses_; | ||
std::vector<int> treeNumbers_; | ||
std::vector<Value_t> baseResponses_; | ||
Value_t baseScore_ = 0.0; | ||
bool logistic_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only noticed now that these data members have this trailing underscore convention, what does it mean exactly? We usually go with fDataMember
. Also, they are public data members of a public interface, even more reason to apply the usual convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I originally followed the CMSSW code style, I used trailing underscores to mark public data members. Coincidentally, the xgboost python package follows almost the same convention (only with snake_case_
, as one can see in examples like xgb_model.n_classes_
).
The fDataMember
convention is for private data members AFAIK. Anyway the best solution is probably to make all these data members private anyway, and then we can go for fStyle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made all the data members private and use the fDataMember
style in my latest update!
tmva/tmva/src/RBDT.cxx
Outdated
|
||
namespace detail { | ||
|
||
typedef std::unordered_map<int, int> IndexMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of the keys and values in this map type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some docs now, but I think it's not necessary to go into detail here.
In the future, I think we can get rid of it anyway: at some point the XGBoost devs took inspiration from my library to optimize their own BDT representation, if I remember the discussions on GitHub correctly.
So I think we can avoid this index translation in the future by parsing XGBoosts internal representation directly instead of going over the txt
dump.
This won't work, dependencies from |
Ok thanks! Note that I'll also have to change the order of imports in the tutorials because of this issue, which was independently reproduced by @vepadulano : |
With help from Jonas, I tested this PR in the context of AGC (for now locally on my PC with 8 cores) and it works - the ml histograms are produced. |
e610073
to
6f1de3b
Compare
Finally the changes in this PR are tested. The
The |
So my understanding is that this never worked, right? Which also makes me doubt that there are any users that try to use this feature. I think in that case, we should seriously ask ourselves if we should support that at all... In any case, I remain unconvinced that a |
Thanks Jonas for the valid concern. There are some people who tried to use RBDT, and even people who noticed the problem or wrong values for XGBoost models: https://root-forum.cern.ch/search?q=Rbdt Also, the plan to "consolidate RBDT" was announced in the PoW 2024 and at several presentations, so at this point dropping the feature is not an option I think. I agree with you that if a feature gave wrong results in the tutorial and nobody opened an issue about it for 5 years: doesn't exactly speak for the relevance of the feature (or maybe we didn't advertise it enough?). |
On top of your already very valid comments, I wanted to add that one point in favour of the relevance of such a feature is that it was initially present in CMSSW, so evaluation of BDTs in experiment frameworks is still a thing. Providing a thin layer to import the de-facto standard xgboost models into a C++ execution environment just seems the right thing to do. Since this feature was sort-of-present-but-utterly-broken before, it makes sense imho to fix it in our codebase. Then I guess the question of eventually removing the feature completely and have it as a standalone library is still valid (for ROOT7 perhaps). |
Actually I think users don't use the GitHub issue tracker that much in general. It's mostly used by devs, packagers, and experiment framework developers. So maybe it's normal we didn't hear about the problem there. |
I have summarized the current understanding of the situation with RBDT at #15197 Given that, I believe we should not merge this at the moment (and thus not push for it being in 6.32). An investigation is required regarding how to provide our Python users with what they probably want in this case: evaluating xgboost models in a ROOT application. This may be served by the changes of this PR. But an alternative is also forgoing an internal implementation a-la RBDT and just try to see if we can hook the xgboost Python API directly into e.g. the RDataFrame API. |
Awesome, thanks!
I don't think that the XGBoost API is thread safe (it doesn't have to be because of the GIL), so that might be a problem with implicit multi-threading in RDataFrame. I'm happy to continue the discussion later, but just for the record my 2 cents: I would still consider this a bugfix for 6.32. |
To summarize what was decided today in the ROOT team meeting: it is very unfortunate that this bug was discovered a week after branching, please avoid this in the future. Nevertheless, since this is fixing a feature that we advertise and the feature is very specific, this should be merged and backported to 6.32 if the CI can be made green (even disabling the failing test would be acceptable, since n - 1 tested platforms are better than zero). |
Consolidate RBDT as specified in the ROOT plan of work 2024. The backends of RBDT are replaced with a single new backend: the logic from the FastForest library: https://github.com/guitargeek/XGBoost-FastForest The logic in that library was originally taken from the GBRForest in CMSSW: https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h The interface remains the same, only that the template parameter specifying the backend is gone. This change adds support for unbalanced trees.
I have now installed the final review comments about the Furthermore, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks a lot Jonas for this nice contribution
Consolidate RBDT as specified in the ROOT plan of work 2024.
The backends of RBDT are replaced with a single new backend:
the logic from the FastForest library:
https://github.com/guitargeek/XGBoost-FastForest
The logic in that library was originally taken from the GBRForest in
CMSSW:
https://github.com/cms-sw/cmssw/blob/master/CommonTools/MVAUtils/interface/GBRForestTools.h
The interface remains the same, only that the template parameter
specifying the backend is gone.
This change adds support for unbalanced trees.
To see the fixed logic, as one run a modified version of the
tmva101_Training.py
tutorial (remember to runtutorials/tmva/tmva100_DataPreparation.py
first):Before, the max difference was
0.379578
, and with this PR it is5.9604645e-08
, compatible with floating point precision.FYI @martamaja10, @lmoneta, @vepadulano