Feature request: faster GridsearchCV with XGBoost #605

JohnZed · 2020-01-30T20:31:01Z

XGBoost constructs a DMatrix object from the input X, y data it
receives. On its first time calling train with this DMatrix, it
caches statistics (histograms) about each column in the DMatrix. For
wide datasets, computing these stats can be very
time-consuming. Running one example grid search with a 1000-column
matrix (x 100k rows), using GridsearchCV was 2.8x longer than building
a DMatrix once manually for each fold and calling xgboost.train in a
loop. Profiling shows that the time goes to both DMatrix building and
stats computation.

GridsearchCV CV caching essentially works with pandas or numpy-style
objects with separate X and y. DMatrix includes both X and y in the
same object, so it needs a slight hack to fit into the current framework.

One option is to allow users to supply a custom replacement for the
_extract method in methods.py. For XGBoost, a user would supply something like:

    def _extract_dmatrix(self, X, y, n, is_x=True, is_train=True):
        if self.cache is not None and (n, is_x, is_train) in self.cache:
            return self.cache[n, is_x, is_train]
        if not is_x:
            return None

        inds = self.splits[n][0] if is_train else self.splits[n][1]
        x_part = _safe_indexing(X, inds)
        y_part = _safe_indexing(y, inds)

        import xgboost as xgb
        # TODO: in practice, there may be additional params like weights
        result = xgb.DMatrix(x_part, y_part)
        print("Converted to dmatrix for cache with _extract_dmatrix")

        if self.cache is not None:
            self.cache[n, is_x, is_train] = result
        return result

This returns None for the y and a DMatrix with X and y with is_x, so
it is abusing the interface a bit, though it seem to work. Note that
this also requires a custom XGBoost wrapper with a fit method
accepting a DMatrix, but that is very short.

Thoughts? Better ways to do this and cache DMatrix objects? Thanks!
This relates to Issue #443, which is also looking for advice on XGBoost gridsearch.

Tagging @trivialfis for possible better ideas...

The text was updated successfully, but these errors were encountered:

RAMitchell · 2020-01-31T23:05:22Z

What if we allow you to supply a DMatrix as X to the sklearn fit call?

JohnZed · 2020-02-01T00:20:58Z

That will clean up the sklearn/xgboost side nicely, but right now dask ml can't support a dmatrix earlier in the pipeline for caching. In the PoC I posted, I think it would eliminate the ugly XGBoostWrapper but not the caching changes.

trivialfis · 2020-02-01T06:31:47Z

Let me take a look into dask ml s implementation.

mrocklin · 2020-02-01T17:39:42Z

XGBoost is an important case, and if something simple enough can be done to support it, then that might be possible. It's a bit of a trade off with special-casing, but certainly worth investigating.

Alternatively, we might expand this request and add some generality. Right now the logic for hyper-parameter optimization is fairly tied to Numpy-style slicing. This is great if you're using Numpy, but can introduce challenges if you're using anything else like Pandas, RAPIDS, CuPy, XGBoost, or whatever comes next. We might consider looking into what a minimal set of pluggable operations would look like that would allow for supporting any of these systems.

If we can easily separate the cross validation and hyper-parameter selection logic from the slicing logic then that would probably make it easy for other teams to add support for additional libraries somewhat cleanly, and in a way that optimized performance for that library.

Naively this seems doable, but I don't have much recent exposure to this codebase. cc'ing @TomAugspurger @stsievert , who might know more.

JohnZed · 2020-02-03T04:56:22Z

Consider also the case where you need a non-standard CV split, like a time-based split for series data. Allowing a user-provided splitter (that can populate the cache) would allow all of those weird cases without having to implement them natively.

trivialfis · 2020-02-03T05:06:33Z

I will try to improve the scikit learn interface so users don't have to construct the DMatrix.

For performance, wouldn't caching all this splited matrices blow up GPU memory?

Lastly as we are refactoring the DMatrix, it will only contain the sketches in the future, so for splitting dataset, relying on cudf/cupy is the better way to go. No need to place specialized code for DMatrix in dask.

JohnZed · 2020-02-03T22:49:26Z

@trivialfis what can we do to cache the sketches from run to run on the same data if the dmatrix is not reused?

JohnZed · 2020-02-05T14:32:21Z

Hi, @TomAugspurger - I'm wondering if you have any thoughts on generalizing the caching API to allow custom user splitter functions to support non-numpy-style caching items. Should we go ahead an come out with a detailed PR or are you opposed to the idea overall? Thanks!

TomAugspurger · 2020-02-05T15:10:16Z

Apologies for the delay, I'm a bit busy with pandas things currently. Will take some time to digest this later in the week.

…

On Wed, Feb 5, 2020 at 8:32 AM John Zedlewski ***@***.***> wrote: Hi, @TomAugspurger <https://github.com/TomAugspurger> - I'm wondering if you have any thoughts on generalizing the caching API to allow custom user splitter functions to support non-numpy-style caching items. Should we go ahead an come out with a detailed PR or are you opposed to the idea overall? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#605?email_source=notifications&email_token=AAKAOIXGAFNXPHLLYSI3JVLRBLEXPA5CNFSM4KN4NNZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK3UETA#issuecomment-582435404>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUXWIZA7CDYMGZ7ZTTRBLEXPANCNFSM4KN4NNZQ> .

mrocklin · 2020-02-05T15:40:59Z

Allowing a user-provided splitter (that can populate the cache) would allow all of those weird cases without having to implement them natively

So to be clear you're now suggesting not that we allow pluggable _extract functions, but that we allow for user-defined splitting objects. Is this the same as the cv= input or is it something different? (Please excuse my ignorance here, I haven't logged a ton of time looking at the internals here).

mrocklin · 2020-02-05T15:47:36Z

So I wonder if we might be able to do something like the following:

from sklearn.model_selection._split import StratifiedKFold

class XGBoostStratifiedKFold(StratifiedKFold):
    def split(self, X, y):
        ... # custom code

search = dask_ml.model_selection.GridSearchCV(cv=XGBoostStatifiedKFold())

Or something like that

If that works then maybe we could also build a few of these out in dask-ml and dispatch on them based on the input type of X? (we do similar dispatching in dask.dataframe and dask.array)

trivialfis · 2020-02-05T17:14:19Z

Sorry for being slow here. Let me try to summarize the problem here and please correct me if I'm wrong. The problem is, we want to have a cache managed by dask, but for user defined types. Is this correct? I think we encountered something similar before as we want a place to store the concatenated partitions in each worker. So is it possible to design such a data independent cache in dask?

mrocklin · 2020-02-05T17:30:59Z

I don't think that core Dask will ever manage any sort of cache. That's probably out of scope. XGBoost seems to already cache histograms on the DMatrix objects though, so that should be fine. Things would be ok if we can make it so that dask-ml is comfortable consuming DMatrix objects. But this would be incredible scope creep in the library, so instead we're looking into how we can make things pluggable, so that we can easily extend support for different input types.

@JohnZed is this also where your head is?

trivialfis · 2020-02-05T17:43:07Z

on generalizing the caching API

I thought there's a caching API (like an API one can use to cache stuffs ... ). Apologies...

JohnZed · 2020-02-06T00:28:27Z

@mrocklin - I wrote sloppily when I said custom splitters, sorry about that. The custom extract is what is needed. GridsearchCV already takes a custom cv parameter that can yield a custom splitting strategy. However, the _extract method in methods.py assumes numpy-style indexing where X and y are to be cached separately. (This is the part that interacts poorly with dmatrix, the splitting strategy is fine.)
So I think a custom extractor that returns a custom-extracted object (as in the PoC) is not too intrusive but provides reasonable user experience with an understandable API. And it would use exactly the cache mechanism that lives in dask-ml gridsearch now, just caching a user-returned object like custom_extract_fn(X, y, indices) rather than the result of safe_indexing(X, indices)

mrocklin · 2020-02-06T01:08:54Z

I'm against the idea of making the _extract method in particular dispatchable, just because it is buried so deeply within a bunch of different systems.

User calls GridSearchCV
That calls build_graph_cv
That uses the cv_split function in a graph
Which does one check, and then creates a CVCache object
Which calls extract
Which calls _extract # <--- the thing you want to change

Making a contract around _extract seems unwise, given all of the internal details around it. We should probably attempt to avoid making really deep systems like this when we can, just because it makes things hard to generalize and modify in the future.

It seems like we might be able to remove the cv_split function entirely, and use the CVCache class instead (perhaps moving the checking within the constructor)

dask-ml/dask_ml/model_selection/methods.py

Lines 156 to 158 in ff1b0af

    
           def cv_split(cv, X, y, groups, is_pairwise, cache): 
        
               check_consistent_length(X, y, groups) 
        
               return CVCache(list(cv.split(X, y, groups)), is_pairwise, cache, _num_samples(X))

And then we accept a CVCache class as an optional input? That's only about one step away from the user, and they can modify a bunch of stuff if they want to, not just this one part of it. We're adding more systemic flexibility than a patch for one use case.

However if we do this then we're expanding the API contract around caching, so now someone would need to think somewhat hard about what the possible use cases for this might be, and make sure that the API we're publishing is unlikely to make any of those future use cases difficult.

To be clear, I'm not green-lighting anything (I don't have enough on-the-ground ML experience to do that confidently). I'm just trying to give a sense of what things look like to me at the moment.

JohnZed mentioned this issue Jan 31, 2020

Demo/PoC only - xgboost/generalized caching support for model_selection #606

Open

JohnZed mentioned this issue Feb 5, 2020

Settle on a design for smarter caching of DMatrix objects in dask-ml gridsearch rapidsai/cuml#1636

Open

This was referenced Feb 22, 2020

disambiguation between xgboost.dask vs dask-ml xgboost #618

Closed

[Roadmap] DMatrix refactoring dmlc/xgboost#5143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: faster GridsearchCV with XGBoost #605

Feature request: faster GridsearchCV with XGBoost #605

JohnZed commented Jan 30, 2020 •

edited by TomAugspurger

Loading

RAMitchell commented Jan 31, 2020

JohnZed commented Feb 1, 2020

trivialfis commented Feb 1, 2020

mrocklin commented Feb 1, 2020 •

edited by jrbourbeau

Loading

JohnZed commented Feb 3, 2020

trivialfis commented Feb 3, 2020

JohnZed commented Feb 3, 2020

JohnZed commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020 via email

mrocklin commented Feb 5, 2020

mrocklin commented Feb 5, 2020

trivialfis commented Feb 5, 2020

mrocklin commented Feb 5, 2020

trivialfis commented Feb 5, 2020 •

edited

Loading

JohnZed commented Feb 6, 2020

mrocklin commented Feb 6, 2020

Feature request: faster GridsearchCV with XGBoost #605

Feature request: faster GridsearchCV with XGBoost #605

Comments

JohnZed commented Jan 30, 2020 • edited by TomAugspurger Loading

RAMitchell commented Jan 31, 2020

JohnZed commented Feb 1, 2020

trivialfis commented Feb 1, 2020

mrocklin commented Feb 1, 2020 • edited by jrbourbeau Loading

JohnZed commented Feb 3, 2020

trivialfis commented Feb 3, 2020

JohnZed commented Feb 3, 2020

JohnZed commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020 via email

mrocklin commented Feb 5, 2020

mrocklin commented Feb 5, 2020

trivialfis commented Feb 5, 2020

mrocklin commented Feb 5, 2020

trivialfis commented Feb 5, 2020 • edited Loading

JohnZed commented Feb 6, 2020

mrocklin commented Feb 6, 2020

JohnZed commented Jan 30, 2020 •

edited by TomAugspurger

Loading

mrocklin commented Feb 1, 2020 •

edited by jrbourbeau

Loading

trivialfis commented Feb 5, 2020 •

edited

Loading