-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: faster GridsearchCV with XGBoost #605
Comments
What if we allow you to supply a DMatrix as X to the sklearn fit call? |
That will clean up the sklearn/xgboost side nicely, but right now dask ml can't support a dmatrix earlier in the pipeline for caching. In the PoC I posted, I think it would eliminate the ugly XGBoostWrapper but not the caching changes. |
Let me take a look into dask ml s implementation. |
XGBoost is an important case, and if something simple enough can be done to support it, then that might be possible. It's a bit of a trade off with special-casing, but certainly worth investigating. Alternatively, we might expand this request and add some generality. Right now the logic for hyper-parameter optimization is fairly tied to Numpy-style slicing. This is great if you're using Numpy, but can introduce challenges if you're using anything else like Pandas, RAPIDS, CuPy, XGBoost, or whatever comes next. We might consider looking into what a minimal set of pluggable operations would look like that would allow for supporting any of these systems. If we can easily separate the cross validation and hyper-parameter selection logic from the slicing logic then that would probably make it easy for other teams to add support for additional libraries somewhat cleanly, and in a way that optimized performance for that library. Naively this seems doable, but I don't have much recent exposure to this codebase. cc'ing @TomAugspurger @stsievert , who might know more. |
Consider also the case where you need a non-standard CV split, like a time-based split for series data. Allowing a user-provided splitter (that can populate the cache) would allow all of those weird cases without having to implement them natively. |
I will try to improve the scikit learn interface so users don't have to construct the DMatrix. For performance, wouldn't caching all this splited matrices blow up GPU memory? Lastly as we are refactoring the DMatrix, it will only contain the sketches in the future, so for splitting dataset, relying on cudf/cupy is the better way to go. No need to place specialized code for DMatrix in dask. |
@trivialfis what can we do to cache the sketches from run to run on the same data if the dmatrix is not reused? |
Hi, @TomAugspurger - I'm wondering if you have any thoughts on generalizing the caching API to allow custom user splitter functions to support non-numpy-style caching items. Should we go ahead an come out with a detailed PR or are you opposed to the idea overall? Thanks! |
Apologies for the delay, I'm a bit busy with pandas things currently. Will
take some time to digest this later in the week.
…On Wed, Feb 5, 2020 at 8:32 AM John Zedlewski ***@***.***> wrote:
Hi, @TomAugspurger <https://github.com/TomAugspurger> - I'm wondering if
you have any thoughts on generalizing the caching API to allow custom user
splitter functions to support non-numpy-style caching items. Should we go
ahead an come out with a detailed PR or are you opposed to the idea
overall? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#605?email_source=notifications&email_token=AAKAOIXGAFNXPHLLYSI3JVLRBLEXPA5CNFSM4KN4NNZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK3UETA#issuecomment-582435404>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUXWIZA7CDYMGZ7ZTTRBLEXPANCNFSM4KN4NNZQ>
.
|
So to be clear you're now suggesting not that we allow pluggable |
So I wonder if we might be able to do something like the following: from sklearn.model_selection._split import StratifiedKFold
class XGBoostStratifiedKFold(StratifiedKFold):
def split(self, X, y):
... # custom code
search = dask_ml.model_selection.GridSearchCV(cv=XGBoostStatifiedKFold()) Or something like that If that works then maybe we could also build a few of these out in dask-ml and dispatch on them based on the input type of X? (we do similar dispatching in dask.dataframe and dask.array) |
Sorry for being slow here. Let me try to summarize the problem here and please correct me if I'm wrong. The problem is, we want to have a cache managed by dask, but for user defined types. Is this correct? I think we encountered something similar before as we want a place to store the concatenated partitions in each worker. So is it possible to design such a data independent cache in dask? |
I don't think that core Dask will ever manage any sort of cache. That's probably out of scope. XGBoost seems to already cache histograms on the DMatrix objects though, so that should be fine. Things would be ok if we can make it so that dask-ml is comfortable consuming DMatrix objects. But this would be incredible scope creep in the library, so instead we're looking into how we can make things pluggable, so that we can easily extend support for different input types. @JohnZed is this also where your head is? |
I thought there's a caching API (like an API one can use to cache stuffs ... ). Apologies... |
@mrocklin - I wrote sloppily when I said custom splitters, sorry about that. The custom extract is what is needed. GridsearchCV already takes a custom |
I'm against the idea of making the
Making a contract around It seems like we might be able to remove the dask-ml/dask_ml/model_selection/methods.py Lines 156 to 158 in ff1b0af
And then we accept a However if we do this then we're expanding the API contract around caching, so now someone would need to think somewhat hard about what the possible use cases for this might be, and make sure that the API we're publishing is unlikely to make any of those future use cases difficult. To be clear, I'm not green-lighting anything (I don't have enough on-the-ground ML experience to do that confidently). I'm just trying to give a sense of what things look like to me at the moment. |
XGBoost constructs a DMatrix object from the input X, y data it
receives. On its first time calling
train
with this DMatrix, itcaches statistics (histograms) about each column in the DMatrix. For
wide datasets, computing these stats can be very
time-consuming. Running one example grid search with a 1000-column
matrix (x 100k rows), using GridsearchCV was 2.8x longer than building
a DMatrix once manually for each fold and calling xgboost.train in a
loop. Profiling shows that the time goes to both DMatrix building and
stats computation.
GridsearchCV CV caching essentially works with pandas or numpy-style
objects with separate X and y. DMatrix includes both X and y in the
same object, so it needs a slight hack to fit into the current framework.
One option is to allow users to supply a custom replacement for the
_extract
method in methods.py. For XGBoost, a user would supply something like:This returns None for the y and a DMatrix with X and y with is_x, so
it is abusing the interface a bit, though it seem to work. Note that
this also requires a custom XGBoost wrapper with a
fit
methodaccepting a DMatrix, but that is very short.
Thoughts? Better ways to do this and cache DMatrix objects? Thanks!
This relates to Issue #443, which is also looking for advice on XGBoost gridsearch.
Tagging @trivialfis for possible better ideas...
The text was updated successfully, but these errors were encountered: