Open
Description
The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.
There are some things that we might want to do with this dataset that are representative of other problems:
- Logistic regression on large sparse data. This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work. It would be useful to compare the effectiveness of the algorithms above
- We could also add hyper parameter optimization
- Gradient boosted trees, presumably with the dask-xgboost connection. This raises a couple of questions. Can XGBoost support categorical data or scipy.sparse arrays? Or perhaps we have to provide a column of integers
As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.
This came out of conversation with @ogrisel