-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Fixing alpha_grid for sparse matrices #2788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @agramfort Please review! |
looks good at first sight can you bench the speed and that it fixes the problem reported in #2526 ? how much slower is it? is it also slower with a really sparse problem? |
@agramfort Fitting the correct alpha_grid doesn't resolve the problem :( I think the slowdown it due to the Speed profile
In master:
In my branch:
See ElasticNet() & Lasso profiling for dense and sparse data.
Hope this convinces you that, it is not due to the centering alone. It maybe due to the sparse and dense data. But still this PR can be merged because it fixes the alpha_grid. (I've added a test for that) |
if Xy is None: | ||
X_sparse = False | ||
if sparse.isspmatrix(X): | ||
X_sparse = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_sparse = sparse.isspmatrix(X)
is enough rather than these 3 lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats embarassing. Sorry.
it would be great if you could profile the code to check where the time is spent. I would also try to bench using a really sparse X matrix as if X is large and super sparse |
@agramfort : There seems to be a bug for |
@agramfort (Seems the cd solver is slow)
|
ok so sparse CD is actually faster when X is really a sparse matrix. good ! |
@agramfort : No it isn't it is almost ten times slower. according to the profile. |
can you test with a real problemY? ie. Thousands for features and hundreds of samples. Even text data if the dense case still fits in memory? |
@agramfort : You are right. I tested it with X = (20000, 500) with 61762 elements. Sparse CD takes 46.7s and CD takes 57.2s. This can be merged. |
# X can be touched inplace thanks to the above line | ||
X, y, _, _, _ = center_data(X, y, fit_intercept, | ||
X, y, _, _, X_std = center_data(X, y, fit_intercept, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the X_std with dense data.
Lasso and ElasticNet are more relevant in the case of n_features >> n_samples ie. X.shape = (500, 20000) can you test this also? a good test would also be on text data where X is super sparse and huge. You would be able to spot unnecessary memory allocations and identify bottlenecks |
@agramfort : Well I did it for (500, 20000). https://gist.github.com/Manoj-Kumar-S/8631580 |
@agramfort : I also tried for the to_dense case using the fetch 20 newsgroups . for around 1k features, my computer became unresponsive. |
what is weird is that on this sparse X and y ElasticNet is faster on sparse |
@agramfort I profiled for cv = 3 First Case: X is dense, normal coordinate descent
Second case: X is sparse
|
Also do check this https://gist.github.com/Manoj-Kumar-S/8633769 EDIT: Time spent in cd solver for each alpha |
manoj you do like my students :) I ask them to do something and come back to me with numbers. I am much more |
@agramfort : The numbers are just to prove a point :) . From the profile, I suppose it can be assumed that the time taken for sparse cd is greater then that of just cd. isn't it? |
run this
1 loops, best of 3: 409 ms per loop now replace ElasticNet by ElasticNetCV and the opposite happens. Why? |
my bad... with ElasticNetCV I get: 1 loops, best of 3: 4.72 s per loop so the question is why it's slower although there is less computation to do? |
the answer seem to be that the sparse CD does some random accesses + does not use blas calls... |
Crappy access patterns to the memory? |
So the bottomline is fitting the alpha grid properly doesn't speed up anything much right? |
great
that could definitely be part of a project.
yes they exist for E-Net.
@fabianp any chance you can have a look?
the idea is to avoid the loop of features but to experiment with
indeed... @mblondel should really pitch in here. @mblondel what is your take on porting some of lightning into sklearn? |
@agramfort "the idea is to avoid the loop of features but to experiment with |
E-Net can be seen as a Lasso with a design X that is np.concatenate((X,
I would have to search the literature too. A |
I managed to have a detailed look at the research paper (http://www-stat.stanford.edu/~jbien/jrssb2011strong.pdf) , I again have a few questions.
According to the paper, shouldn't it be strong_active_set everywhere? i.e Is it because we are assuming the non-zero coefficients are same across all alphas, but that might not make sense right?
|
@mblondel : Would you please be able to comment on this, about removing Liblinear and porting lightning for Logistic Regression? I would like to do this as part of my project. |
liblinear implements three solvers for logistic regression:
For 1) you could start from @GaelVaroquaux 's branch (https://github.com/GaelVaroquaux/scikit-learn/blob/logistic_cv/sklearn/linear_model/logistic.py). The nice thing is that the Hessian doesn't need to be entirely materialized, you just need to implement the Hessian vector product. lightning only supports coodinate descent, which is usually not preferred for L2 regularization. For 2) you could do a straight port of the C++ to Cython. lightning doesn't support this solver at all. For 3), you could reuse lightning's code but lightning uses coordinate descent only (not the mix of CD and glmnet) and so might be slower. I'm not sure whether working on strong rules is a good idea. This is a difficult project for little gains. I am not very enthusiastic about merging a lot of stuff from lightning. I spent a lot of time on it and I don't want it to be rendered obsolete. Now if you only take the L1 logistic regression solver, that's fine. Last but not least, I'll be increasingly busy this year so I can't mentor this (I can of course answer occasional questions). |
|
could fly for me. Adding strong rules to Lasso and ElasticNet should not be that hard anymore what do others think ping @GaelVaroquaux @fabianp @larsmans ? |
@agramfort I definitely think I would be able to do the strong rules. But my fear is that it might break some peoples code since some coordinates are discarded on the basis of the rules. May be we can have an option strong=True in the init? As for the remaining I haven't yet had a deep look at the Logistic Regression codebase yet, (I will this week). I can happily finish Gael's and @larsmans 's work as part of GSoC (if they allow me too :) ) |
the solution with strong rules should be exactly the same (up to
@GaelVaroquaux and @larsmans ? |
+1. I guess my work could best be merged into Gaël's. |
On Fri, Feb 07, 2014 at 11:58:09AM -0800, Alexandre Gramfort wrote:
In my experience, tron isn't very useful, and it is quite costly in terms
Me too.
I would really push for the strong rules. When I talked to Hastie about |
I am absolutely happy to see my work being finished by somebody else :). |
@manoj-kumar-s : I just stumbled on a reference that might be useful to your GSOC proposal: Alsonn the Yuan 2010 reference mentioned in the above paper is relevant. |
This is the CD + glmnet mix I mentioned above. This is the solver currently used by liblinear for L1-regularized logistic regression. However, it is fairly complicated. If you choose this solver over CD, I'd recommend a straight port of the C to Cython.
lightning implements the CDN algo of this paper. But lightning also supports a variant without line search. The line search is tricky to implement and adds a lot of code for no speed benefit in my experience. The step sizes for the variant without line search are given in Table 10 of And implemented in lightning here: (The variable |
@agramfort , @GaelVaroquaux Summing up, apart from Strong rules, these are the following ideas.
Multinomial LR Is this too ambitious, to do in one summer. Which are the 3-4 which have the most priority? |
I would add the easiest thing to try. Random order in coordinate descent. I don't think that what you propose is to ambitious. I terms of order I would stay by the random order then move on to l2 models then strong rules then the rest. Make sure that you include parameter selection with regularization path. Sounds exciting to me. -------- Original message -------- From: Manoj Kumar notifications@github.com Date:13/02/2014 08:44 (GMT+01:00) To: scikit-learn/scikit-learn scikit-learn@noreply.github.com Cc: Gael Varoquaux gael.varoquaux@normalesup.org Subject: Re: [scikit-learn] Fixing alpha_grid for sparse matrices (#2788) Multinomial LR Is this too ambitious, to do in one summer. Which are the 3-4 which have the most priority? — |
Just to clarify, CD + glmnet is not in lightning, only CD (with or without line search). |
@mblondel great. @GaelVaroquaux In the randomised coordinate descent method, for every iteration, we randomly pick a feature to update instead of doing a transversal across all features (according to the wiki page). Is there anything more to this method? |
+100 for what gael said """ doing great what is already started and ok is ambitious enough |
Pure random coordinate selection (with replacement) will need a pure Python function call at every iteration to get the coordinate index. In lightning, I have backported the Cython interface of a subset of numpy.random. An easier solution is to use cyclic coordinate selection with permutation of the order before every outer iteration. This is similar to random but without replacement. |
@mblondel : Is there anything wrong with using rand? For example:
There will be modulo bias, but that can also be taken care of I think. |
Indeed that's possible but you will need a clean way to handle random_state. You could also reuse some code by the tree growers: |
The problems with
(The replacement in the tree code has modulo bias, but that's not nearly as bad as the other problems with |
@larsmans Then i can use the tree code? |
Sure. Just copy it over for now; we can refactor into a (I planned to do a big refactor of |
@agramfort So finally my exams are over. I'll submit a first draft proposal in a few days. By the way coming back to what this PR was about does it make sense to profile the sparse_enet_coordinate_descent method to find where exactly the time is spent since it seems to be really slow. |
clfs = LassoCV(max_iter=100, cv=10, normalize=normalize) | ||
ignore_warnings(clfs.fit)(X, y) | ||
clfd = LassoCV(max_iter=100, cv=10, normalize=normalize) | ||
ignore_warnings(clfd.fit)(X.todense(), y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming late to the party, sorry for that.
Nice test, however @manoj-kumar-s could please avoid using X.todense()
but rather use X.toarray()
instead in the future? todense
returns a numpy.matrix
instance which has a very confusing matlab-like API and we rather want to only deal with numpy.ndarray
instances throughout the code base.
Here this is just a test so this is not a big deal, but for the sake of consistency it's better to take good habits as early as possible.
@ogrisel Thanks for the heads-up. ! |
Fixes #2526 partially.
Centering of the sparse matrices is done indirectly. The correct output and marginally faster than in master, though still much slower then than dense matrices.