Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train mem usage and other improvements #2439

Merged
merged 104 commits into from
Aug 28, 2023

Commits on Jul 28, 2023

  1. Configuration menu
    Copy the full SHA
    5d124d0 View commit details
    Browse the repository at this point in the history
  2. remove unnecessary Adam(W) optimizer tensors.

    reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
    
    additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
    
    bumps training checkpoint file version, but old checkpoints can still be read.
    new version with less tensors is saved.
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    d39c8e6 View commit details
    Browse the repository at this point in the history
  3. add gradient clipping to AdamW

    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    d395b19 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d7003a9 View commit details
    Browse the repository at this point in the history
  5. implement gradient checkpointing for training

    reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
    
    as explained in readme of https://github.com/cybertronai/gradient-checkpointing
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    6e3f95b View commit details
    Browse the repository at this point in the history
  6. remove unused compute buffer 3

    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    e05e441 View commit details
    Browse the repository at this point in the history
  7. add and use function ggml_build_backward_expand to avoid stack overfl…

    …ows with large maximum number of nodes
    
    GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    ed4319e View commit details
    Browse the repository at this point in the history
  8. change AdamW decay parameter to work like the torch AdamW decay param…

    …eter
    
    It is now relative to Adam learning rate `alpha*sched`.
    Before that it was relative to `sched` only.
    
    `alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    a80f184 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    f175ead View commit details
    Browse the repository at this point in the history
  10. change default AdamW weight decay parameter defined in ggml to 0.0, m…

    …aking Adam default instead of AdamW
    
    btw: the default weight decay parameter for torch.optim.AdamW is 0.01
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    97964a4 View commit details
    Browse the repository at this point in the history
  11. bug fixes for cross entropy loss

    ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
    ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
    
    guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
    
    cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
    so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    2c6985f View commit details
    Browse the repository at this point in the history
  12. fix test-grad0 for cross_entropy_loss

    the second argument to cross_entropy_loss must sum up to 1 for each row
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    2d1e6e0 View commit details
    Browse the repository at this point in the history
  13. fix test-grad0 for soft_max

    dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
    instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    864e7e3 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    87febee View commit details
    Browse the repository at this point in the history
  15. change cross_entropy_loss to output average over all rows

    this helps keeping the loss and gradients in a sane range
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    51dc770 View commit details
    Browse the repository at this point in the history
  16. improve gradient checkpointing

    sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
    since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
    
    ```
      given: n, u, v
      objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
      b=n/a
      minimize(a*u+v*n/a)
      diff(a*u+v*n/a, a) = u - (v*n/a)/a
      diff(a*u+v*n/a, a) == 0
      u - (v*n/a)/a == 0
      u == v*n/(a*a)
      u*a*a = v*n
      a*a = v*n/u
      a = sqrt(n*v/u)
    ```
    
    this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    3744a9b View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    fc379a2 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    d0fbb7d View commit details
    Browse the repository at this point in the history
  19. add more training parameters:

    --enable-restart N         Only for Adam optimizer. Enable restarts of cos-decay
    --disable-restart N        Only for Adam optimizer. Disable restarts of cos-decay
    --opt-past N               Number of optimization iterations to track for delta convergence test. Disabled when zero.
    --opt-delta N              Maximum delta for delta convergence test. Disabled when <= zero.
    --opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
    --adam-epsf N              AdamW epsilon for convergence test. Disabled when <= zero.
    --adam-min-alpha N         Adam minimum learning rate alpha, usually 0.1 * alpha
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    c6a18e1 View commit details
    Browse the repository at this point in the history
  20. replace memcpy with reshape operation so that the graph is not cut at…

    … the input
    
    this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    ce937bc View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    ff759d9 View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    e843d6e View commit details
    Browse the repository at this point in the history
  23. add optimization callback to ggml_opt_resume_g

    this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
    
    can be used for dynamic learning schedule and setting input data for batches before each iteration
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    bfc3119 View commit details
    Browse the repository at this point in the history
  24. use optimization callback in training

    allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
    
    reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    d7aa4d9 View commit details
    Browse the repository at this point in the history
  25. add minimum number of tensor dimensions to apply weight decay (defaul…

    …t 2)
    
    this allows to not apply weight decay to bias parameters
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    e6ff072 View commit details
    Browse the repository at this point in the history
  26. rename training parameter cos-decay-alpha to cos-decay-min and clarif…

    …y that adam-min-alpha also applies to warmup
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    58024d3 View commit details
    Browse the repository at this point in the history
  27. fix increase of model.train_samples and model.train_tokens

    now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    17a0898 View commit details
    Browse the repository at this point in the history
  28. change sampling parameters for prediction after training to defaults …

    …of common.h
    
    and clarify what is context for prediction and what are generated tokens
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    24a4b09 View commit details
    Browse the repository at this point in the history
  29. Configuration menu
    Copy the full SHA
    1065c3b View commit details
    Browse the repository at this point in the history
  30. add conditional compilation of using F16 exp in flash attention

    uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    dbbc263 View commit details
    Browse the repository at this point in the history
  31. Configuration menu
    Copy the full SHA
    47055c9 View commit details
    Browse the repository at this point in the history
  32. Configuration menu
    Copy the full SHA
    0f6a8ab View commit details
    Browse the repository at this point in the history
  33. remove out-commented vectorized code of opt_adam

    the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    87035b9 View commit details
    Browse the repository at this point in the history
  34. Configuration menu
    Copy the full SHA
    ecdc161 View commit details
    Browse the repository at this point in the history
  35. Configuration menu
    Copy the full SHA
    c1a5e11 View commit details
    Browse the repository at this point in the history
  36. remove trailing whitespace

    xaedes committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    22cb368 View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2023

  1. Configuration menu
    Copy the full SHA
    d43af4b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2bf422e View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2023

  1. in train function replace add_inplace by regular add

    because using add_inplace seems to result in different gradients
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    fc826c8 View commit details
    Browse the repository at this point in the history
  2. don't use allocate hash_map on context

    because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    d437415 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cfddc36 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    0dd496c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    52c92c0 View commit details
    Browse the repository at this point in the history
  6. correctly clone view tensors by setting data pointers

    without this the checkpointing would only work when being used together with memory allocator
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    345f516 View commit details
    Browse the repository at this point in the history
  7. fix variable names

    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    5a11b75 View commit details
    Browse the repository at this point in the history
  8. swap arguments to commutative ops to be the same as in `forward_batch…

    …_wo_cache_flash_attn`
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    b2f1310 View commit details
    Browse the repository at this point in the history
  9. add input tensors as checkpoints

    so that recursive tensor cloning of gradient checkpointing terminates on input tensors
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    5884b43 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    9716eb8 View commit details
    Browse the repository at this point in the history
  11. make sure some tensors are not reallocated by inserting new temporary…

    … nodes depending on them:
    
    output and parameter gradient tensors need to be available at the end of the graph execution
    
    parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
    
    checkpoint tensors are allocated all together to reduce memory allocator fragmentation
    
    afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    38f4438 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    d6c5b03 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    4ed096c View commit details
    Browse the repository at this point in the history
  14. integrate unified training function which may use memory allocator

    the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    865c4cd View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    3e99a8d View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    75baed2 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    fe788a1 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    c954f41 View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    271e4d6 View commit details
    Browse the repository at this point in the history
  20. remove trailing whitespace

    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    6f161c7 View commit details
    Browse the repository at this point in the history
  21. remove unused train params: mem_compute1_gb & mem_compute2_gb

    mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
    mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    3794dce View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    6e280b2 View commit details
    Browse the repository at this point in the history
  23. add debug asserts in ggml_allocr_alloc to some common pitfalls when u…

    …sing this function directly
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    faf3e21 View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    098654c View commit details
    Browse the repository at this point in the history
  25. fix test when to create temporary backward graph

    temporary backward graph is only necessary when using checkpointing
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    3e6468b View commit details
    Browse the repository at this point in the history
  26. fix memory "leak" in optimizers

    each iteration a new cplan with new memory for work data was allocated.
    now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    5622846 View commit details
    Browse the repository at this point in the history
  27. reverse order of for loop in ggml_build_backward_expand to save memor…

    …y when using gradient checkpointing and allocator
    
    with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
    
    the computation results are the same
    xaedes committed Aug 14, 2023
    Configuration menu
    Copy the full SHA
    3b5515b View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2023

  1. Merge branch 'master' into pr-train-mem-usage-improvements

    # Conflicts:
    #	examples/train-text-from-scratch/train-text-from-scratch.cpp
    xaedes committed Aug 24, 2023
    Configuration menu
    Copy the full SHA
    0c52c65 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4072f20 View commit details
    Browse the repository at this point in the history
  3. implement llama model file saving using gguf

    checkpoint loading and saving disabled, to be replaced by loading and saving via gguf
    xaedes committed Aug 24, 2023
    Configuration menu
    Copy the full SHA
    f51c5d7 View commit details
    Browse the repository at this point in the history

Commits on Aug 25, 2023

  1. Configuration menu
    Copy the full SHA
    5407981 View commit details
    Browse the repository at this point in the history

Commits on Aug 26, 2023

  1. bug fixes

    xaedes committed Aug 26, 2023
    Configuration menu
    Copy the full SHA
    6a20f7a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    167dd2d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2978e03 View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2023

  1. Configuration menu
    Copy the full SHA
    0c494cc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3a91c97 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a6f3a47 View commit details
    Browse the repository at this point in the history
  4. add gguf arch and ftype

    xaedes committed Aug 27, 2023
    Configuration menu
    Copy the full SHA
    cb42324 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    495a62a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    ef899fb View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    d71069c View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    91a4cca View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    0b2c85b View commit details
    Browse the repository at this point in the history
  10. fix memory corruption bug in gguf

    ctx->kv and ctx->infos was reallocated using not-aligned realloc, but freed with aligned free.
    to fix this a GGML_ALIGNED_REALLOC was added, but there is no posix_memalign_realloc function.
    so on non-windows and non-mingw32 platforms we fall back to aligned malloc, followed by copying
    and freeing the old data.
    xaedes committed Aug 27, 2023
    Configuration menu
    Copy the full SHA
    ca5b344 View commit details
    Browse the repository at this point in the history
  11. add gguf example cmake file

    xaedes committed Aug 27, 2023
    Configuration menu
    Copy the full SHA
    5d94997 View commit details
    Browse the repository at this point in the history
  12. bug fixes in tokenize_file

    xaedes committed Aug 27, 2023
    Configuration menu
    Copy the full SHA
    76d2794 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    4882ff0 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    152cfaa View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2023

  1. bug fix in read_tensor_by_name

    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    1f83343 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3d8d884 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e86b3e3 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    daa0b6c View commit details
    Browse the repository at this point in the history
  5. remove trailing whitespace

    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    f97f92b View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    c690c20 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    5f27ade View commit details
    Browse the repository at this point in the history
  8. temporarily add code to write old checkpoint files

    used to verify that old checkpoint files are correctly converted to gguf
    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    e8df9e6 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    31c093c View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    63bf200 View commit details
    Browse the repository at this point in the history
  11. remove trailing whitespace

    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    3155019 View commit details
    Browse the repository at this point in the history
  12. remove prediction related code

    use main for prediction, it is better optimized
    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    3e7dfd0 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    17ab46d View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    12c4e5b View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    a925e93 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    440d221 View commit details
    Browse the repository at this point in the history
  17. remove GGML_ALIGNED_REALLOC and use normal malloc/realloc/free for gg…

    …uf ctx->kv & ctx->infos
    xaedes committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    f6828cb View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    93535a4 View commit details
    Browse the repository at this point in the history