Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant Stochastic Gradient Descent #2544

Merged
merged 8 commits into from
Dec 2, 2017
Merged

Constant Stochastic Gradient Descent #2544

merged 8 commits into from
Dec 2, 2017

Conversation

shkr
Copy link
Contributor

@shkr shkr commented Sep 5, 2017

Hey,

I recently came across this publication Stochastic Gradient Descent as Approximate Bayesian Inference https://arxiv.org/pdf/1704.04289v1.pdf which I found interesting.

In comparison to Stochastic Gradient Fisher Scoring which uses a preconditioning matrix to sample
from the posterior even with decreasing learning rates, this work uses optimal constant learning rates such that Kullback-Leibler divergence between the stationary distribution of SGD and the posterior is minimized.

It approximates stochastic variational inference while SGFS and many other MCMC techniques
converge towards the exact posterior. In comparison to SGFS the paper
discusses a proof for the optimal preconditioning matrix
based on variational inference, so the preconditioning matrix is not an input.

I have implemented it by extending the BaseStochasticGradient class introduced in the SGFS PR.

I am submitting this PR before it is complete, to get feedback on this algorithm.

@shkr
Copy link
Contributor Author

shkr commented Sep 5, 2017

I am unable to understand what is meant by the statement We show projections on the smallest and largest principal component of the posterior in Figure 1. Any help on how to calculate these projections of a posterior ?

I want to replicate the results from Fig 1, 2 and 3.

@twiecki
Copy link
Member

twiecki commented Sep 5, 2017

This is a great start @shkr.

To do the PCA projection you simply run SVD on the covariance matrix of the posterior. That gives you M, V, and M.T. You can just take the first and last columns of the V matrix and make that your projection matrix. Let me know if that's not clear.

@shkr
Copy link
Contributor Author

shkr commented Sep 10, 2017

Okay.
I calculated the covariance of the posterior [ P ] of shape (Q x S)
Q : number of parameters
S : number of samples in trace

Sigma = { P - mean(P) } * { P - mean(P) }.T of shape (Q x Q)

Then I selected the first and last row from V.h from the SVD decomposition { Sigma = U S V.H }
Afterwards I projected S samples of size (Q x 1) from the trace to the first and last component.

This gives me a S number of 2 dimensional vectors.

Is that what is being done ? I am trying to interpret these projections, but I think I am doing something incorrect. Can you confirm the ^above steps ?

@shkr shkr force-pushed the csgb branch 3 times, most recently from d16a6cc to 3ce1537 Compare September 16, 2017 22:08
@shkr
Copy link
Contributor Author

shkr commented Sep 20, 2017

^ @twiecki any comments ?

@twiecki
Copy link
Member

twiecki commented Sep 25, 2017

@shkr Sorry, I've been on vacation, will try to take a look soon.

@twiecki
Copy link
Member

twiecki commented Sep 26, 2017

@shkr I think that's close. However, I think they just compute the principle components of the posterior once (e.g. on the NUTS samples) and then project the individual traces into that.

@twiecki
Copy link
Member

twiecki commented Sep 28, 2017

It does seem to work much better than SGFS, what are your conclusions?

@fonnesbeck
Copy link
Member

I actually used this as an example of extending PyMC3 in a presentation last week. It worked really well!

@shkr
Copy link
Contributor Author

shkr commented Oct 3, 2017

@twiecki I have updated the notebook. The posterior from the CSG does generate a good approximation for the posterior, figures are falling in line with the paper. I expected independent projections on the largest and smallest eigenvectors of the sample covariance matrix, however it is not true for SGFS and it is possibly because of two untuned hyper parameters in SGFS. No such tuning is required for CSG since there are theoretical optimal values for all the hyper parameters. I want to try using CSG for hyper parameter optimization of the lasso model before making final conclusions and request for merge.

@twiecki
Copy link
Member

twiecki commented Nov 9, 2017

@shkr I somewhat forgot about this but I'm fairly excited about the work you've put in here. Do you think we should include the sampler in the code base? Seems like it's preferable over SGFS.

@shkr
Copy link
Contributor Author

shkr commented Nov 9, 2017

@twiecki Yes. I was busy with some other work, so I was unable to push the update here. I will be pushing a commit, this weekend at the latest. It will be ready for review/merge.

@shkr
Copy link
Contributor Author

shkr commented Nov 9, 2017

And yes, I agree the CSG is preferable over SGFS

@shkr
Copy link
Contributor Author

shkr commented Nov 10, 2017

@twiecki @fonnesbeck some debugging help required.

screen shot 2017-11-09 at 10 20 22 pm

screen shot 2017-11-09 at 10 20 30 pm

I am unable to understand, why theano is throwing a disconnected input error here.

As per the modelmu is the laplacian, and s is the regularizer parameter, for the distribution.
I would expect the gradient of the ohs_var which is dependent on mu to have a gradient wrt to s.

@shkr
Copy link
Contributor Author

shkr commented Nov 14, 2017

That question is non-blocking to this PR. I just ran into that error while trying to implement the hyper parameter section of the paper. But, having thought about few use cases, it does not make sense, since hyper parameters such as the the number of nodes in a neural network are non-differentiable. So I am a bit unclear, how the EM routine is helpful for general problems.

@shkr
Copy link
Contributor Author

shkr commented Nov 14, 2017

I have updated sgmcmc.py and I have created a new notebook showing usage of ConstantStochasticGradient. This PR is ready for merge.

Theano variables, default continuous vars
kwargs: passed to BaseHMC
"""
super(ConstantStochasticGradient, self).__init__(vars, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add an experimental warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twiecki
Copy link
Member

twiecki commented Nov 16, 2017

Can you also add a note to the RELEASE-NOTES?

@@ -298,3 +311,98 @@ def competence(var, has_grad):
if var.dtype in continuous_types and has_grad:
return Competence.COMPATIBLE
return Competence.INCOMPATIBLE


class ConstantStochasticGradient(BaseStochasticGradient):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a shorter name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it to CSG, just like what I did with SGFS. Is that okay ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not great but for consistency probably the best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed it

@shkr
Copy link
Contributor Author

shkr commented Nov 27, 2017

@twiecki I have inserted a line in the RELEASE-NOTES and added my name as a community member. Let me know if thats what you wanted.

RELEASE-NOTES.md Outdated
@@ -249,6 +250,7 @@ Patricio Benavente <patbenavente@gmail.com>
Raymond Roberts
Rodrigo Benenson <rodrigo.benenson@gmail.com>
Sergei Lebedev <superbobry@gmail.com>
Shashank Shekhar <shashank.f1@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add yourself instead to a new section Contributors for the upcoming 3.3 release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@twiecki
Copy link
Member

twiecki commented Nov 28, 2017

Can you also add the NB to the docs? And make sure you only have one top-level heading # in the NB.

@shkr
Copy link
Contributor Author

shkr commented Nov 30, 2017

@twiecki Done ! I added the 3 notebooks I have created for stochastic algorithms to the examples doc.

@junpenglao
Copy link
Member

Great job @shkr! Just a nitpick: in constant_stochastic_gradient.ipynb you still have

top-level heading #

at the end. You should do # Result --> ## Result

@shkr
Copy link
Contributor Author

shkr commented Nov 30, 2017

@junpenglao done !

@twiecki
Copy link
Member

twiecki commented Nov 30, 2017

************* Module pymc3.sampling

pymc3/sampling.py:13: [W0611(unused-import), ] Unused CSG imported from step_methods

pymc3/sampling.py:13: [W0611(unused-import), ] Unused SGFS imported from step_methods

Would also be curious how CSG does on the neural network, but this doesn't have to be part of this PR.

@shkr
Copy link
Contributor Author

shkr commented Nov 30, 2017

@twiecki Yes. I agree. I will put up a follow up PR, with CSG on the neural net and other notebook updates to the stochastic gradient docs.

RELEASE-NOTES.md Outdated
@@ -7,7 +7,8 @@

- Improve NUTS initialization `advi+adapt_diag_grad` and add `jitter+adapt_diag_grad` (#2643)
- Update loo, new improved algorithm (#2730)

- New CSG (Constant Stochastic Gradient) approximate posterior sampling
algorithm added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to PR like above.



Stochastic Gradient
=====================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra ==

https://github.com/pymc-devs/pymc3/tree/master/docs/source/notebooks/constant_stochastic_gradient.ipynb

Parameters
-----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make line as long as text above it

@twiecki twiecki merged commit 0a72bca into pymc-devs:master Dec 2, 2017
@twiecki
Copy link
Member

twiecki commented Dec 2, 2017

Thanks @shkr, this is a significant contribution!

@twiecki
Copy link
Member

twiecki commented Dec 2, 2017

Just tried running this on a larger NN but output, _ = theano.scan(lambda i, logX=logL, v=var: theano.grad(logX[i], v).flatten(),\ ---> 56 sequences=[tt.arange(logL.shape[0])]) seems to take forever. Is there no way to vectorize this?

jordan-melendez pushed a commit to jordan-melendez/pymc3 that referenced this pull request Feb 6, 2018
* add csg

* Fig 1 and likelihood plotted

* posterior comparison

* csg nb and python file updated

* ConstantStochasticGradient renamed as CSG

* inserted update in RELEASE-NOTES

* nb updated and added to examples
@shkr shkr deleted the csgb branch May 13, 2018 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants