-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add feature to print forest shape in FIL upon importing #3763
Conversation
Should this feature be incorporated into Treelite? |
Yes, that would be great. My understanding is treelite would only have a model dumping function, and simpler stats? In which case, the code to make leaf/branch depth histogram would need to live somewhere. Hence, I thought FIL would serve. |
I also don't know whether the parameters should be part of |
cpp/src/fil/fil.cu
Outdated
int min_depth = -1, leaves_times_depth = 0, total_branches = 0, | ||
total_leaves = 0; | ||
// 64-bit Fowler/Noll/Vo | ||
size_t fingerprint = 14695981039346656037l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fingerprinting isn't necessary here.
However, if you still want to do it, I would suggest the following approach:
- Build the data structure containing the forest statistics.
- Convert it into an array of bytes by successively appending the bytes representing each integer in the data structure.
- Compute a standard hash function on the array. A standard cryptographic hash function is definitely preferred.
In this way, you can split this code into multiple functions, each performing a single piece of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to do this, because when sweeping models for benchmarking, it let me quickly understand which models are different and which are the same. Same with putting the results into a spreadsheet - easier to compare runs to each other if the fingerprint is just equal.
Co-authored-by: Andy Adinets <adinetz@gmail.com>
Waiting until #3800 is merged to push the conflict-resolved version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved provided the comments are addressed.
cpp/src/fil/fil.cu
Outdated
// model_inner is of the concrete type tl::ModelImpl<T, L> | ||
from_treelite(handle, pforest, model_inner, tl_params); | ||
}); | ||
} | ||
|
||
// allocates caller-owned char* using malloc() | ||
template <typename T, typename L, typename N> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a separate type required for nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a function template and node_t is part of its signature. auto is not allowed here.
{{3}, 0xaf63bd4c8601b7dc, 0xaf63bd4c ^ 0x8601b7dc}, | ||
{{1, 2}, 0x08328707b4eb6e38, 0x08328707 ^ 0xb4eb6e38}, // test #5 | ||
{{2, 1}, 0x08328607b4eb6c86, 0x08328607 ^ 0xb4eb6c86}, | ||
{{1, 2, 3}, 0xd949aa186c0c492b, 0xd949aa18 ^ 0x6c0c492b}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there any other test inputs available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are, but those do not exercise the properties that FIL needs. If you want to build an FNV library to cater to all kinds of applications, we can start a design doc and discuss the scope there?
std::vector<fnv_vec_t> fnv_vecs = { | ||
{{}, 14695981039346656037ull, 0xcbf29ce4 ^ 0x84222325}, // test #0 | ||
{{0}, 0xaf63bd4c8601b7df, 0xaf63bd4c ^ 0x8601b7df}, | ||
{{1}, 0xaf63bd4c8601b7de, 0xaf63bd4c ^ 0x8601b7de}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the 32-bit output always provided as a bit-wise xor of 2 different constants? A single constant would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's intended to be the xor of two halves of the 64-bit output. This is obvious if you look at the numbers in this format, else the test vector will need to be verified manually. Shall I leave a comment about this in the code?
Removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had 1 very minor comment about a header and one question/comment regarding how the API looks like from a python perspective
Codecov Report
@@ Coverage Diff @@
## branch-21.06 #3763 +/- ##
===============================================
Coverage ? 85.41%
===============================================
Files ? 227
Lines ? 17317
Branches ? 0
===============================================
Hits ? 14791
Misses ? 2526
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
Include cuML header directory that is currently nested one layer too deep (to be fixed by rapidsai/cuml#3901) Ignore unused variable warnings due to unused variable in fil.h introduced by rapidsai/cuml#3763
* Temporary fix for upstream cuML issues Include cuML header directory that is currently nested one layer too deep (to be fixed by rapidsai/cuml#3901) Ignore unused variable warnings due to unused variable in fil.h introduced by rapidsai/cuml#3763 * Remove extra include path following upstream fix * Update to CalVer for cuML
When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version? This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics. This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc. A sample output looks like this: ``` Depth hist: depth branches leaves nodes 0 5 0 5 1 10 0 10 2 20 0 20 3 19 21 40 4 15 23 38 5 12 18 30 6 2 22 24 7 0 4 4 Total: branches: 83 leaves: 88 nodes: 171 Avg nodes per tree: 34.2 Leaf depth: min: 3 avg 4.6 max 7 Depth histogram fingerprint: 4752b3a7893d5e22 DENSE model size 0.01 MB ``` The feature is breaking because one new C parameter is added to `treelite_params_t`. Python API is not broken, because it sets the backward-compatible defaults. Authors: - https://github.com/levsnv Approvers: - Andy Adinets (https://github.com/canonizer) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3763
When benchmarking forests, it's important to know what we're benchmarking, regardless of the source of the forest. E.g. are we still importing an equivalent-sized forest after updating xgboost version?
This is also specific to understanding FIL performance on a model, as opposed to generic forest statistics.
This allows to obtain a couple of numbers from FIL, both pertaining to the model and the chosen storage type, post-import size, etc.
A sample output looks like this:
The feature is breaking because one new C parameter is added to
treelite_params_t
. Python API is not broken, because it sets the backward-compatible defaults.