Skip to content

Added Momentum and Nesterov modifications #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 14, 2023

Conversation

Spnetic-5
Copy link
Collaborator

@Spnetic-5 Spnetic-5 commented Jun 22, 2023

Solves #137

@Spnetic-5 Spnetic-5 requested a review from milancurcic June 23, 2023 00:00
@Spnetic-5
Copy link
Collaborator Author

@milancurcic Please review the implementation of suggested changes to the code, and I see the difference in MSE of SGD after adding momentum and nesterov arguments.

@Spnetic-5 Spnetic-5 marked this pull request as ready for review June 23, 2023 21:36
@milancurcic
Copy link
Member

@Spnetic-5 I refactored the subroutines in the quadratic_fit example to accept xtest and ytest. This allows the subroutines to report RMSE every some number of epochs. For example, this is what the output looks like now:

...
Stochastic gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.132655
Epoch:  200/1000, RMSE =  0.131642
Epoch:  300/1000, RMSE =  0.131308
Epoch:  400/1000, RMSE =  0.131156
Epoch:  500/1000, RMSE =  0.131071
Epoch:  600/1000, RMSE =  0.131018
Epoch:  700/1000, RMSE =  0.130981
Epoch:  800/1000, RMSE =  0.130953
Epoch:  900/1000, RMSE =  0.130932
Epoch: 1000/1000, RMSE =  0.130915
 
Batch gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.474867
Epoch:  200/1000, RMSE =  0.406169
Epoch:  300/1000, RMSE =  0.356077
Epoch:  400/1000, RMSE =  0.320101
Epoch:  500/1000, RMSE =  0.294075
Epoch:  600/1000, RMSE =  0.274908
Epoch:  700/1000, RMSE =  0.260488
Epoch:  800/1000, RMSE =  0.249400
Epoch:  900/1000, RMSE =  0.240700
Epoch: 1000/1000, RMSE =  0.233746

Minibatch gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.189657
Epoch:  200/1000, RMSE =  0.186826
Epoch:  300/1000, RMSE =  0.185389
Epoch:  400/1000, RMSE =  0.183885
Epoch:  500/1000, RMSE =  0.181658
Epoch:  600/1000, RMSE =  0.177690
Epoch:  700/1000, RMSE =  0.170329
Epoch:  800/1000, RMSE =  0.159604
Epoch:  900/1000, RMSE =  0.149710
Epoch: 1000/1000, RMSE =  0.143441
 
RMSProp optimizer
----------------------------------
Epoch:  100/1000, RMSE =  0.198733
Epoch:  200/1000, RMSE =  0.167841
Epoch:  300/1000, RMSE =  0.138085
Epoch:  400/1000, RMSE =  0.134165
Epoch:  500/1000, RMSE =  0.132821
Epoch:  600/1000, RMSE =  0.132069
Epoch:  700/1000, RMSE =  0.131591
Epoch:  800/1000, RMSE =  0.131288
Epoch:  900/1000, RMSE =  0.131106
Epoch: 1000/1000, RMSE =  0.130999
 

This allows us to see how the RMSE evolves during the training.

While doing this, I also discovered a small but serious error in the calculation of xtest. Before the fix, it was evaluated as an integer expression, so xtest values were zeros and ones and no values in between. That means that we were so far only evaluating at these two points, which gave us a false idea about the RMSE.

@milancurcic
Copy link
Member

I'll continue with the remainder of the review over the weekend.

@milancurcic
Copy link
Member

I've added the necessary data structures in optimizer type definitions to allow keeping track of velocity in case of SGD and RMS gradients in case of RMSProp. The code doesn't work yet--we need to correctly initialize these structures and correctly reference them during each successive minimization calls. The main challenge is that currently we have the minimize method which is an elemental subroutine that works on a per-layer basis. Thus, it's not aware of on which layer it's working on. I'm not yet clear on the best approach but here are some ideas I've been thinking about:

  1. Rather than minimizing at a layer level, we minimize on the network level; We loop over layers and have all the information we need to reference the velocity/RMS values correctly; basically the RMSProp subroutine approach in examples/quadratic.f90.
  2. Pass the layer number/name to optimizer % minimize() to let it know on which layer we are. The subroutine can then correctly address the velocity/RMS arrays.
  3. Possibly the simplest approach: Treat all model parameters and gradients as 1-d arrays, get their size via net % get_num_params(), and update their weights via call net % set_params(). The downside is that getting the gradients from all layers in one place will require a new allocation (maybe, let's see).

@milancurcic
Copy link
Member

I took the approach 3 above for the time being. We also need layer % get_gradients() and network % get_gradients() methods which are not implemented yet, but they will be essentially the same procedures as layer % get_params() and network % get_params().

Because of velocity in SGD and rms_gradient in RMSProp being allocatable arrays, they can't be used in the arithmetic inside the elemental minimize methods, so minimize is no longer elemental.

Once we have getting the layer gradients implemented and everything is hooked up correctly, it should just work for both SGD (with or without the classic or Nesterov momentum) and RMSProp. Once it does, this huge PR must be merged and we'll make a new minor release.

@milancurcic
Copy link
Member

We're almost there. Now we just need to add a test suite for the optimizer classes to ensure that they both converge on a simple problem, e.g. like the one in examples/simple.f90.

@Spnetic-5
Copy link
Collaborator Author

Sure, I'll try to implement test suite

@milancurcic
Copy link
Member

@Spnetic-5 Thanks! To clarify: since the default SGD is already being tested implicitly in e.g. examples/test_dense_network.f90 and elsewhere (simple and sine examples), for this test suite let's focus on:

  • SGD with classic momentum
  • SGD with momentum + Nesterov
  • RMSProp
  • Minimal training example, like the one in examples/simple.f90 which fits 2 constant inputs to 3 constant outputs.

For ideas on how to write the test suite program, follow how other test programs are written.

@milancurcic
Copy link
Member

BTW, I wrote in d27bf09 that tests pass in debug but segfault in release mode. After I pushed it, both debug and release modes passed in CI. Then I went back and saw that I was working locally with GFortran v9. When I switched to v10, both modes passed tests.

@Spnetic-5
Copy link
Collaborator Author

Spnetic-5 commented Jul 12, 2023

@milancurcic I have added draft test suite, please confirm if I am on a right track, any suggestions before proceeding further.

Also are we continuing test_suite in the same PR or should I open another explicit PR?

@milancurcic
Copy link
Member

Thanks @Spnetic-5, yes, the structure of the program is good. I see that you're not actually testing for convergence, I assume because it's on a TODO.

I realized that for optimizers that need to store state (velocity in SGD w/ momentum and RMSProp in general), we actually need to store the optimizer instance as part of the network type so that it can preserve state between successive net % update() calls. Currently we always pass a freshly created optimizer instance, so there is no storing of state. As you're working on the test suite, can you also add a

class(optimizer_base_type), allocatable :: optimizer

member to the network type? And then in train and update we need to store it on first call, but leave it untouched on successive calls.

You can use the allocated(self % optimizer) to check if it's been allocated or not, as a test to determine whether it needs to be set for the first time or not.

@milancurcic
Copy link
Member

@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.

@milancurcic milancurcic added the enhancement New feature or request label Jul 12, 2023
@Spnetic-5
Copy link
Collaborator Author

@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.

Thanks @milancurcic, sure I'm working on convergence tests.

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Spnetic-5; I left a few comments. We're almost there.

@Spnetic-5 Spnetic-5 requested a review from milancurcic July 14, 2023 05:49
Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Spnetic-5, this is now in good shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants