-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant Stochastic Gradient Descent #2544
Conversation
I am unable to understand what is meant by the statement I want to replicate the results from Fig 1, 2 and 3. |
This is a great start @shkr. To do the PCA projection you simply run SVD on the covariance matrix of the posterior. That gives you M, V, and M.T. You can just take the first and last columns of the V matrix and make that your projection matrix. Let me know if that's not clear. |
Okay. Sigma = { P - mean(P) } * { P - mean(P) }.T of shape (Q x Q) Then I selected the first and last row from V.h from the SVD decomposition { Sigma = U S V.H } This gives me a S number of 2 dimensional vectors. Is that what is being done ? I am trying to interpret these projections, but I think I am doing something incorrect. Can you confirm the ^above steps ? |
d16a6cc
to
3ce1537
Compare
^ @twiecki any comments ? |
@shkr Sorry, I've been on vacation, will try to take a look soon. |
@shkr I think that's close. However, I think they just compute the principle components of the posterior once (e.g. on the NUTS samples) and then project the individual traces into that. |
It does seem to work much better than SGFS, what are your conclusions? |
I actually used this as an example of extending PyMC3 in a presentation last week. It worked really well! |
@twiecki I have updated the notebook. The posterior from the CSG does generate a good approximation for the posterior, figures are falling in line with the paper. I expected independent projections on the largest and smallest eigenvectors of the sample covariance matrix, however it is not true for SGFS and it is possibly because of two untuned hyper parameters in SGFS. No such tuning is required for CSG since there are theoretical optimal values for all the hyper parameters. I want to try using CSG for hyper parameter optimization of the lasso model before making final conclusions and request for merge. |
@shkr I somewhat forgot about this but I'm fairly excited about the work you've put in here. Do you think we should include the sampler in the code base? Seems like it's preferable over SGFS. |
@twiecki Yes. I was busy with some other work, so I was unable to push the update here. I will be pushing a commit, this weekend at the latest. It will be ready for review/merge. |
And yes, I agree the CSG is preferable over SGFS |
@twiecki @fonnesbeck some debugging help required. I am unable to understand, why theano is throwing a disconnected input error here. As per the model |
That question is non-blocking to this PR. I just ran into that error while trying to implement the hyper parameter section of the paper. But, having thought about few use cases, it does not make sense, since hyper parameters such as the the number of nodes in a neural network are non-differentiable. So I am a bit unclear, how the EM routine is helpful for general problems. |
I have updated sgmcmc.py and I have created a new notebook showing usage of ConstantStochasticGradient. This PR is ready for merge. |
pymc3/step_methods/sgmcmc.py
Outdated
Theano variables, default continuous vars | ||
kwargs: passed to BaseHMC | ||
""" | ||
super(ConstantStochasticGradient, self).__init__(vars, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add an experimental warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The experimental warning is present in this line https://github.com/shkr/pymc3/blob/45a45ab0f78b480b8accb27f168164bb213cd280/pymc3/step_methods/sgmcmc.py#L112
Can you also add a note to the |
pymc3/step_methods/sgmcmc.py
Outdated
@@ -298,3 +311,98 @@ def competence(var, has_grad): | |||
if var.dtype in continuous_types and has_grad: | |||
return Competence.COMPATIBLE | |||
return Competence.INCOMPATIBLE | |||
|
|||
|
|||
class ConstantStochasticGradient(BaseStochasticGradient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a shorter name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it to CSG, just like what I did with SGFS. Is that okay ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not great but for consistency probably the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it
@twiecki I have inserted a line in the RELEASE-NOTES and added my name as a community member. Let me know if thats what you wanted. |
RELEASE-NOTES.md
Outdated
@@ -249,6 +250,7 @@ Patricio Benavente <patbenavente@gmail.com> | |||
Raymond Roberts | |||
Rodrigo Benenson <rodrigo.benenson@gmail.com> | |||
Sergei Lebedev <superbobry@gmail.com> | |||
Shashank Shekhar <shashank.f1@gmail.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add yourself instead to a new section Contributors
for the upcoming 3.3 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Can you also add the NB to the docs? And make sure you only have one top-level heading |
@twiecki Done ! I added the 3 notebooks I have created for stochastic algorithms to the examples doc. |
Great job @shkr! Just a nitpick: in constant_stochastic_gradient.ipynb you still have
at the end. You should do |
@junpenglao done ! |
Would also be curious how CSG does on the neural network, but this doesn't have to be part of this PR. |
@twiecki Yes. I agree. I will put up a follow up PR, with CSG on the neural net and other notebook updates to the stochastic gradient docs. |
RELEASE-NOTES.md
Outdated
@@ -7,7 +7,8 @@ | |||
|
|||
- Improve NUTS initialization `advi+adapt_diag_grad` and add `jitter+adapt_diag_grad` (#2643) | |||
- Update loo, new improved algorithm (#2730) | |||
|
|||
- New CSG (Constant Stochastic Gradient) approximate posterior sampling | |||
algorithm added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to PR like above.
docs/source/examples.rst
Outdated
|
||
|
||
Stochastic Gradient | ||
===================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra ==
pymc3/step_methods/sgmcmc.py
Outdated
https://github.com/pymc-devs/pymc3/tree/master/docs/source/notebooks/constant_stochastic_gradient.ipynb | ||
|
||
Parameters | ||
----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make line as long as text above it
Thanks @shkr, this is a significant contribution! |
Just tried running this on a larger NN but |
* add csg * Fig 1 and likelihood plotted * posterior comparison * csg nb and python file updated * ConstantStochasticGradient renamed as CSG * inserted update in RELEASE-NOTES * nb updated and added to examples
Hey,
I recently came across this publication Stochastic Gradient Descent as Approximate Bayesian Inference https://arxiv.org/pdf/1704.04289v1.pdf which I found interesting.
In comparison to Stochastic Gradient Fisher Scoring which uses a preconditioning matrix to sample
from the posterior even with decreasing learning rates, this work uses optimal constant learning rates such that Kullback-Leibler divergence between the stationary distribution of SGD and the posterior is minimized.
It approximates stochastic variational inference while SGFS and many other MCMC techniques
converge towards the exact posterior. In comparison to SGFS the paper
discusses a proof for the optimal preconditioning matrix
based on variational inference, so the preconditioning matrix is not an input.
I have implemented it by extending the BaseStochasticGradient class introduced in the SGFS PR.
I am submitting this PR before it is complete, to get feedback on this algorithm.