Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0cc4m/vulkan-op-opt-step-sgd
Choose a head ref
  • 4 commits
  • 25 files changed
  • 3 contributors

Commits on Jul 22, 2025

  1. examples/finetune -opt SGD (stochastic gradient descent) memory opt

    add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
    m, v tensors.
    
    support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
    
    llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
    when using SGD instead of 19gb (55 sec/epoch) using adamw.
    (wikipedia 100 lines finetune)
    
    (
    using the same GPU memory, adamw can only do before OOM 512
    batch/context, reaching:
    train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
    val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
    
    SGD is superior, though it converges slower, with max before OOM 1728
    batch/context (esp see the better validation perf):
    train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
    val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
    )
    
    note: when finetuning long enough (or w/ enough -lr),
    validation accuracy *eventually* drops ('catastrophic forgetting')
    
    -lr-half (halflife) option useful for SGD to avoid oscillation or
    super slow underdamped learning (makes setting -lr more forgiving).
    terminal -lr for now is set by lr-halvings i.e. if you want at most
    1/8 the inital -lr you set -lr-halvings 3.
    
    note: objective loss not directly comparable between adamw, sgd? -
    check perplexity or accuracy or consider relative improvements
    for convergence
    
    new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
    and max -epochs N (default 2 as before)
    
    cache (1 - wd*alpha) in 'adamw' opt struct -
    no noticeable perf benefit, disabled (still done
    for new SGD though)
    
    since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
    would probably be able to change between SGD and AdamW with each epoch
    but would need to use adamw for the first (unconfirmed - no cmdline arg
    to set such a policy yet)
    
    test-opt checks adamw as before and now sgd (except for a few disabled
    tests for sgd only; probably just needs logging values and adding
    alternate reference values);  tolerance on the 'regression'
    test is broader for sgd (so we don't need many more epochs)
    graehl committed Jul 22, 2025
    Configuration menu
    Copy the full SHA
    bc39aa6 View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2025

  1. Configuration menu
    Copy the full SHA
    50e83ea View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2025

  1. Configuration menu
    Copy the full SHA
    9d03124 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2ec70c9 View commit details
    Browse the repository at this point in the history
Loading