Skip to content

Apply optimizer to model weights without data copy #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

milancurcic
Copy link
Member

@milancurcic milancurcic commented May 28, 2025

  • Layers:
    • dense
    • conv1d
    • conv2d
    • locally_connected1d
  • Optimizer:
    • SGD without memory (e.g. momentum)
    • SGD and other methods (RMSProp, Adam, Adagrad) that keep memory

Alternative approach to #184

Currently implemented only for dense layer so a lot of other stuff is not working. We now send the pointers to each layer's weights and biases to the optimizer where they are updated in-place.

Since this approach runs the optimizer on a layer-by-layer basis, optimizer memory such as velocity, rms gradients, etc. are not implemented yet, as they originally assumed whole-network update. Additional bookkeeping is needed there. A possible approach to this bookkeeping is to require the caller of optimizer % minimize() to pass explicit start and end indices over which the optimizer will run. This may seem tedious but helper functions can make it easy.

On MNIST training (examples/dense_mnist), the difference is only ~1% in run time. However, running the profiler on both the main branch dense_mnist and this branch's dense_mnist shows that the data copies (mostly in get_params; two redundant copies of all model parameters happening there) are now gone:

main branch dense_mnist:

$ head -15 dense_mnist_orig_fast_analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 89.75      5.43     5.43   998400     0.00     0.00  __nf_dense_layer_MOD_backward
  3.14      5.62     0.19  1418400     0.00     0.00  __nf_dense_layer_MOD_forward
  2.15      5.75     0.13   499200     0.00     0.00  __nf_random_MOD_shuffle
  1.32      5.83     0.08     3900     0.00     0.00  __nf_network_MOD_get_params
  0.66      5.87     0.04     7800     0.00     0.00  __nf_layer_MOD_get_params
  0.33      5.89     0.02  1707600     0.00     0.00  __nf_activation_MOD_eval_1d_softmax
  0.33      5.91     0.02     7800     0.00     0.00  __nf_dense_layer_MOD_set_params
  0.33      5.93     0.02     3900     0.00     0.00  __nf_network_MOD_update
  0.33      5.95     0.02     3900     0.00     0.00  __nf_optimizers_MOD_minimize_sgd
  0.33      5.97     0.02                             _init

This PR dense_mnist:

$ head -15 dense_mnist_new_fast_analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 93.84      4.72     4.72   998400     0.00     0.00  __nf_dense_layer_MOD_backward
  2.98      4.87     0.15  1418400     0.00     0.00  __nf_dense_layer_MOD_forward
  0.99      4.92     0.05  1707600     0.00     0.00  __nf_activation_MOD_eval_1d_softmax
  0.99      4.97     0.05   499200     0.00     0.00  __nf_random_MOD_shuffle
  0.40      4.99     0.02     7800     0.00     0.00  __nf_optimizers_MOD_minimize_sgd_2d
  0.40      5.01     0.02     3900     0.00     0.00  __nf_network_MOD_update
  0.20      5.02     0.01   709200     0.00     0.00  __nf_network_MOD_forward_1d
  0.20      5.03     0.01                             _init
  0.00      5.03     0.00  2127600     0.00     0.00  __nf_layer_MOD_forward
  0.00      5.03     0.00  1497600     0.00     0.00  __nf_layer_MOD_backward_1d

The above runs were compiled using gfortran-14.2.0 and -pg -Ofast flags.

@jvdp1
Copy link
Collaborator

jvdp1 commented May 30, 2025

Since this approach runs the optimizer on a layer-by-layer basis, optimizer memory such as velocity,
rms gradients, etc. are not implemented yet, as they originally assumed whole-network update.

Another issue I just found is that I think that arrays of the optimizer, e.g., m and v in adam must be also defined per layer and for each weight and bias arrays in a layer.

@milancurcic
Copy link
Member Author

@jvdp1 Indeed, that's the extra bookkeeping needed. However, I'm thinking now if we should instantiate an optimizer per layer rather than the whole network. This would take care of the bookkeeping altogether, because now the "memory" arrays like m and v would remember the state per layer, which is what is needed. This approach would only be problematic if if in the future we would need to implement an optimizer that needs to do some kind of global reduction of "memory" across layers. I'm not aware of such optimizers but I'll do some research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants