Skip to content

finetune.cpp command-line arg #13873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

graehl
Copy link

@graehl graehl commented May 28, 2025

add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD

these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)

perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h

as proposed in
#13835

@graehl graehl requested a review from JohannesGaessler as a code owner May 28, 2025 20:26
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels May 28, 2025
@graehl
Copy link
Author

graehl commented May 28, 2025

perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:

  1. is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?
  2. is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.
  3. were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.
  4. could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

Copy link
Contributor

@WilliamTambellini WilliamTambellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should better keep that change as it time to get more feedbacks/approval.

@JohannesGaessler
Copy link
Collaborator

is it ok to make small retouches to ggml/ sources in this (llama.cpp) project with the expectation of getting the changes into the actual ggml repo later? are there any plans to submodule a ggml-in-llama branch to keep things straight(er)?

Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.

is what i've got hee the expected way to add example-specific command line arguments? for finetune we definitely at least want to be able to vary the learning rate, which was formerly hard-coded.

My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.

were the PERPLEXITY args which i blindly added to the new FINETUNE example actually doing anything interesting? perhaps some should be dropped from finetune.

I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.

could you direct me to a .clang-format style file that might save me from accidentally re-indenting? i know i can set up clang-format to operate only on regions i've already changed ...

I don't know, sorry.

@WilliamTambellini
Copy link
Contributor

@ggerganov

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the previous perplexity-specific arguments are needed.

@JohannesGaessler
Copy link
Collaborator

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

@graehl
Copy link
Author

graehl commented May 29, 2025

For adding an SDG optimizer, add a new ggml op like OPT_STEP_SDG. Add a CPU implementation as a fallback for any backend without an implementation. Add a CUDA implementation since that is (I assume) the backend which you intend to use in production. Add a test to tests/test_backend_ops.cpp to assert that the CPU and CUDA backends produce consistent results. Extend ggml-opt.cpp to conditionally use the new SDG optimizer step, condition the allocation of the optimizer momenta on the optimizer type.

yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req?

@JohannesGaessler
Copy link
Collaborator

Either way would be fine with me as long as there are at no point broken or unfinished features on master.

@graehl graehl force-pushed the finelayer branch 2 times, most recently from e752031 to e689af8 Compare May 29, 2025 17:07
Copy link
Contributor

@matiaslin matiaslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to the next PR(s).

add to ggml-opt learning rate (adamw alpha) cmdline arg,
and an optimizer enum defaulting to adamw, including
string->id mapping,
preparatory to work to support SGD

these are in common args a set of optimizer options active
only for the new FINETUNE example (but we drop all the
previous finetune.cpp PERPLEXITY options which we're told
are unused/accidental)

perhaps breaking with precedent, the ggml_opt_optimizer_params
struct is included directly as args - if desired, we can instead
just add learning rate and optimizer type to a struct independent
of ggml-opt.h

as proposed in
ggml-org#13835
@graehl
Copy link
Author

graehl commented May 29, 2025

you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it

@JohannesGaessler
Copy link
Collaborator

Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR.

@JohannesGaessler
Copy link
Collaborator

Sorry, I didn't read the

only on lines changed in the PR

part.

@github-actions github-actions bot added build Compilation issues testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels May 30, 2025
@graehl graehl force-pushed the finelayer branch 3 times, most recently from 7534bbf to 48a16bf Compare May 30, 2025 16:57
@graehl
Copy link
Author

graehl commented May 30, 2025

Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :)
pretty new to the github interface honestly so let me know if this needs to be two separate PRs one for each commit or if it's reasonable to just review both commits here (obv. better to merge separately, first doesn't break any behavior, second impacts the finetune cmdline default learning rate but that should hurt no one)

@graehl
Copy link
Author

graehl commented May 30, 2025

Second (actual usable SGD) commit is 48a16bf (also shows above here)

Copy link
Contributor

@WilliamTambellini WilliamTambellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?

@graehl graehl force-pushed the finelayer branch 13 times, most recently from 914f336 to d8c6dd2 Compare May 31, 2025 00:27
@graehl
Copy link
Author

graehl commented May 31, 2025

ok, per request we are back to calling get_opt_pars(ud) twice per epoch - shouldn't be noticeable and i apologize for the churn

@graehl graehl force-pushed the finelayer branch 2 times, most recently from 62f86f9 to 51867fe Compare May 31, 2025 00:47
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result:
observed 11gb gpu ram (45 sec/epoch) when using SGD instead of
19gb (55 sec/epoch) using adamw.

(getting the right learning rate for SGD is trickier than for adamw -
too high and you overshoot+oscillate, too low and you waste compute
slowly approaching convergence)

SGD (or adamw) quickly reach 99%+ train accuracy.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

also, note that logical batch size > physical batch (gradient
accumulation) seems unsupported for optimization (limited to physical
, unlike in ppx - also limited to ctx-size).
training quality/convergence could be improved
by implementing (at cost of some memory, but you can make that up
by using a much smaller physical batch for a net memory savings).
presumably it's physical batch that should be limited to ctx-size?
see llama_context::opt_epoch

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit

cache computed per-epoch optimizer opts
(formerly were computed twice per)

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors. make ggml_opt_init aware of the optimization method

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no arg
to set such a policy yet)

100 lines of wikipedia train:
train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05
val:   ... loss=3.91926±nan acc=58.40±2.18%

on more training data (500 lines), additional catastrophic forgetting before train reaches
99.9% accuracy:
train: data=0000140/0000140 loss=0.02611±0.00077 acc=99.82±0.02% t=00:00:45
val:   data=0000008/0000008 loss=4.11112±0.22526 acc=46.36±0.78%

increasing batch+ctx sizes to 1536 (double what fits in memory for
adamw) gets apparently better validation but that could be an artifact
of continuing training from previous weights, i.e. what's train vs val
probably depends on batch size. also amusing - faster due to larger
batch even though larger context would be slower?:
train: data=0000045/0000045  loss=0.01722±0.00103 acc=99.90±0.01% t=00:00:40
val:   data=0000003/0000003 loss=1.96829±1.09488 acc=72.44±0.66%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants