Case Study: Criteo dataset

The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad.  It has a both dense and categorical/sparse data.  I believe that the data is freely available on Azure.  

There are some things that we might want to do with this dataset that are representative of other problems:

1.  Logistic regression on large sparse data.  This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work.  It would be useful to compare the effectiveness of the algorithms above
2.  We could also add hyper parameter optimization
3.  Gradient boosted trees, presumably with the dask-xgboost connection.  This raises a couple of questions.  Can XGBoost support categorical data or scipy.sparse arrays?  Or perhaps we have to provide a column of integers

As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.

This came out of conversation with @ogrisel 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Case Study: Criteo dataset #295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Case Study: Criteo dataset #295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions