Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: faster GridsearchCV with XGBoost #605

Open
JohnZed opened this issue Jan 30, 2020 · 16 comments
Open

Feature request: faster GridsearchCV with XGBoost #605

JohnZed opened this issue Jan 30, 2020 · 16 comments

Comments

@JohnZed
Copy link

JohnZed commented Jan 30, 2020

XGBoost constructs a DMatrix object from the input X, y data it
receives. On its first time calling train with this DMatrix, it
caches statistics (histograms) about each column in the DMatrix. For
wide datasets, computing these stats can be very
time-consuming. Running one example grid search with a 1000-column
matrix (x 100k rows), using GridsearchCV was 2.8x longer than building
a DMatrix once manually for each fold and calling xgboost.train in a
loop. Profiling shows that the time goes to both DMatrix building and
stats computation.

GridsearchCV CV caching essentially works with pandas or numpy-style
objects with separate X and y. DMatrix includes both X and y in the
same object, so it needs a slight hack to fit into the current framework.

One option is to allow users to supply a custom replacement for the
_extract method in methods.py. For XGBoost, a user would supply something like:

    def _extract_dmatrix(self, X, y, n, is_x=True, is_train=True):
        if self.cache is not None and (n, is_x, is_train) in self.cache:
            return self.cache[n, is_x, is_train]
        if not is_x:
            return None

        inds = self.splits[n][0] if is_train else self.splits[n][1]
        x_part = _safe_indexing(X, inds)
        y_part = _safe_indexing(y, inds)

        import xgboost as xgb
        # TODO: in practice, there may be additional params like weights
        result = xgb.DMatrix(x_part, y_part)
        print("Converted to dmatrix for cache with _extract_dmatrix")

        if self.cache is not None:
            self.cache[n, is_x, is_train] = result
        return result

This returns None for the y and a DMatrix with X and y with is_x, so
it is abusing the interface a bit, though it seem to work. Note that
this also requires a custom XGBoost wrapper with a fit method
accepting a DMatrix, but that is very short.

Thoughts? Better ways to do this and cache DMatrix objects? Thanks!
This relates to Issue #443, which is also looking for advice on XGBoost gridsearch.

Tagging @trivialfis for possible better ideas...

@RAMitchell
Copy link

What if we allow you to supply a DMatrix as X to the sklearn fit call?

@JohnZed
Copy link
Author

JohnZed commented Feb 1, 2020

That will clean up the sklearn/xgboost side nicely, but right now dask ml can't support a dmatrix earlier in the pipeline for caching. In the PoC I posted, I think it would eliminate the ugly XGBoostWrapper but not the caching changes.

@trivialfis
Copy link

Let me take a look into dask ml s implementation.

@mrocklin
Copy link
Member

mrocklin commented Feb 1, 2020

XGBoost is an important case, and if something simple enough can be done to support it, then that might be possible. It's a bit of a trade off with special-casing, but certainly worth investigating.

Alternatively, we might expand this request and add some generality. Right now the logic for hyper-parameter optimization is fairly tied to Numpy-style slicing. This is great if you're using Numpy, but can introduce challenges if you're using anything else like Pandas, RAPIDS, CuPy, XGBoost, or whatever comes next. We might consider looking into what a minimal set of pluggable operations would look like that would allow for supporting any of these systems.

If we can easily separate the cross validation and hyper-parameter selection logic from the slicing logic then that would probably make it easy for other teams to add support for additional libraries somewhat cleanly, and in a way that optimized performance for that library.

Naively this seems doable, but I don't have much recent exposure to this codebase. cc'ing @TomAugspurger @stsievert , who might know more.

@JohnZed
Copy link
Author

JohnZed commented Feb 3, 2020

Consider also the case where you need a non-standard CV split, like a time-based split for series data. Allowing a user-provided splitter (that can populate the cache) would allow all of those weird cases without having to implement them natively.

@trivialfis
Copy link

I will try to improve the scikit learn interface so users don't have to construct the DMatrix.

For performance, wouldn't caching all this splited matrices blow up GPU memory?

Lastly as we are refactoring the DMatrix, it will only contain the sketches in the future, so for splitting dataset, relying on cudf/cupy is the better way to go. No need to place specialized code for DMatrix in dask.

@JohnZed
Copy link
Author

JohnZed commented Feb 3, 2020

@trivialfis what can we do to cache the sketches from run to run on the same data if the dmatrix is not reused?

@JohnZed
Copy link
Author

JohnZed commented Feb 5, 2020

Hi, @TomAugspurger - I'm wondering if you have any thoughts on generalizing the caching API to allow custom user splitter functions to support non-numpy-style caching items. Should we go ahead an come out with a detailed PR or are you opposed to the idea overall? Thanks!

@TomAugspurger
Copy link
Member

TomAugspurger commented Feb 5, 2020 via email

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2020

Allowing a user-provided splitter (that can populate the cache) would allow all of those weird cases without having to implement them natively

So to be clear you're now suggesting not that we allow pluggable _extract functions, but that we allow for user-defined splitting objects. Is this the same as the cv= input or is it something different? (Please excuse my ignorance here, I haven't logged a ton of time looking at the internals here).

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2020

So I wonder if we might be able to do something like the following:

from sklearn.model_selection._split import StratifiedKFold

class XGBoostStratifiedKFold(StratifiedKFold):
    def split(self, X, y):
        ... # custom code

search = dask_ml.model_selection.GridSearchCV(cv=XGBoostStatifiedKFold())

Or something like that

If that works then maybe we could also build a few of these out in dask-ml and dispatch on them based on the input type of X? (we do similar dispatching in dask.dataframe and dask.array)

@trivialfis
Copy link

Sorry for being slow here. Let me try to summarize the problem here and please correct me if I'm wrong. The problem is, we want to have a cache managed by dask, but for user defined types. Is this correct? I think we encountered something similar before as we want a place to store the concatenated partitions in each worker. So is it possible to design such a data independent cache in dask?

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2020

I don't think that core Dask will ever manage any sort of cache. That's probably out of scope. XGBoost seems to already cache histograms on the DMatrix objects though, so that should be fine. Things would be ok if we can make it so that dask-ml is comfortable consuming DMatrix objects. But this would be incredible scope creep in the library, so instead we're looking into how we can make things pluggable, so that we can easily extend support for different input types.

@JohnZed is this also where your head is?

@trivialfis
Copy link

trivialfis commented Feb 5, 2020

on generalizing the caching API

I thought there's a caching API (like an API one can use to cache stuffs ... ). Apologies...

@JohnZed
Copy link
Author

JohnZed commented Feb 6, 2020

@mrocklin - I wrote sloppily when I said custom splitters, sorry about that. The custom extract is what is needed. GridsearchCV already takes a custom cv parameter that can yield a custom splitting strategy. However, the _extract method in methods.py assumes numpy-style indexing where X and y are to be cached separately. (This is the part that interacts poorly with dmatrix, the splitting strategy is fine.)
So I think a custom extractor that returns a custom-extracted object (as in the PoC) is not too intrusive but provides reasonable user experience with an understandable API. And it would use exactly the cache mechanism that lives in dask-ml gridsearch now, just caching a user-returned object like custom_extract_fn(X, y, indices) rather than the result of safe_indexing(X, indices)

@mrocklin
Copy link
Member

mrocklin commented Feb 6, 2020

I'm against the idea of making the _extract method in particular dispatchable, just because it is buried so deeply within a bunch of different systems.

  1. User calls GridSearchCV
  2. That calls build_graph_cv
  3. That uses the cv_split function in a graph
  4. Which does one check, and then creates a CVCache object
  5. Which calls extract
  6. Which calls _extract # <--- the thing you want to change

Making a contract around _extract seems unwise, given all of the internal details around it. We should probably attempt to avoid making really deep systems like this when we can, just because it makes things hard to generalize and modify in the future.

It seems like we might be able to remove the cv_split function entirely, and use the CVCache class instead (perhaps moving the checking within the constructor)

def cv_split(cv, X, y, groups, is_pairwise, cache):
check_consistent_length(X, y, groups)
return CVCache(list(cv.split(X, y, groups)), is_pairwise, cache, _num_samples(X))

And then we accept a CVCache class as an optional input? That's only about one step away from the user, and they can modify a bunch of stuff if they want to, not just this one part of it. We're adding more systemic flexibility than a patch for one use case.

However if we do this then we're expanding the API contract around caching, so now someone would need to think somewhat hard about what the possible use cases for this might be, and make sure that the API we're publishing is unlikely to make any of those future use cases difficult.

To be clear, I'm not green-lighting anything (I don't have enough on-the-ground ML experience to do that confidently). I'm just trying to give a sense of what things look like to me at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants