Skip to content

Case Study: Criteo dataset #295

Open
@mrocklin

Description

@mrocklin

The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.

There are some things that we might want to do with this dataset that are representative of other problems:

  1. Logistic regression on large sparse data. This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work. It would be useful to compare the effectiveness of the algorithms above
  2. We could also add hyper parameter optimization
  3. Gradient boosted trees, presumably with the dask-xgboost connection. This raises a couple of questions. Can XGBoost support categorical data or scipy.sparse arrays? Or perhaps we have to provide a column of integers

As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.

This came out of conversation with @ogrisel

Metadata

Metadata

Assignees

No one assigned

    Labels

    Case StudyLarge-scale example as stress tests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions