Skip to content

Commit d8c6dd2

Browse files
committed
examples/finetune -opt SGD (stochastic gradient descent) memory opt
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result: observed 11gb gpu ram (45 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (getting the right learning rate for SGD is trickier than for adamw - too high and you overshoot+oscillate, too low and you waste compute slowly approaching convergence) quickly reach 99%+ train accuracy on a tiny wikipedia train (~58% token accuracy on held-out eval - reasonable) note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence also, note that logical batch size > physical batch (gradient accumulation) seems unsupported for optimization (limited to physical , unlike in ppx - also limited to ctx-size). training quality/convergence could be improved by implementing (at cost of some memory, but you can make that up by using a much smaller physical batch for a net memory savings). presumably it's physical batch that should be limited to ctx-size? see llama_context::opt_epoch new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct cache computed per-epoch optimizer opts (formerly were computed twice per) add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. make ggml_opt_init aware of the optimization method since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no arg to set such a policy yet) 100 lines: train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05 val: ... loss=3.91926±nan acc=58.40±2.18% on more training data (500 lines), additional catastrophic forgetting before train reaches 99.9% accuracy: train: data=0000140/0000140 loss=0.02611±0.00077 acc=99.82±0.02% t=00:00:45 val: data=0000008/0000008 loss=4.11112±0.22526 acc=46.36±0.78% increasing batch+ctx sizes to 1536 (double what fits in memory for adamw) gets apparently better validation but that could be an artifact of continuing training from previous weights, i.e. what's train vs val probably depends on batch size. also amusing - faster due to larger batch even though larger context would be slower?: train: data=0000045/0000045 loss=0.02010±0.00138 acc=99.85±0.01% t=00:00:40 val: data=0000003/0000003 loss=1.96829±1.09488 acc=72.44±0.66%
1 parent aa59aa3 commit d8c6dd2

File tree

20 files changed

+1731
-1654
lines changed

20 files changed

+1731
-1654
lines changed

.clang-format

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ AllowShortLambdasOnASingleLine: Inline
2323
AllowShortLoopsOnASingleLine: false
2424
AlwaysBreakBeforeMultilineStrings: true
2525
BinPackArguments: true
26-
BinPackParameters: true # OnePerLine
26+
BinPackParameters: true
2727
BitFieldColonSpacing: Both
2828
BreakBeforeBraces: Custom # Attach
2929
BraceWrapping:
@@ -45,7 +45,6 @@ BraceWrapping:
4545
SplitEmptyFunction: false
4646
SplitEmptyRecord: false
4747
SplitEmptyNamespace: false
48-
# BreakAdjacentStringLiterals: true
4948
BreakAfterAttributes: Never
5049
BreakBeforeBinaryOperators: None
5150
BreakBeforeInlineASMColon: OnlyMultiline
@@ -158,4 +157,3 @@ TabWidth: 4
158157
UseTab: Never
159158
WhitespaceSensitiveMacros: ['STRINGIZE']
160159
...
161-

CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
1212
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
1313
endif()
1414

15+
message("CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}")
16+
1517
# Add path to modules
1618
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")
1719

0 commit comments

Comments
 (0)