-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train mem usage and other improvements #2439
Train mem usage and other improvements #2439
Commits on Jul 28, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 5d124d0 - Browse repository at this point
Copy the full SHA 5d124d0View commit details -
remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize. additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t. bumps training checkpoint file version, but old checkpoints can still be read. new version with less tensors is saved.
Configuration menu - View commit details
-
Copy full SHA for d39c8e6 - Browse repository at this point
Copy the full SHA d39c8e6View commit details -
Configuration menu - View commit details
-
Copy full SHA for d395b19 - Browse repository at this point
Copy the full SHA d395b19View commit details -
Configuration menu - View commit details
-
Copy full SHA for d7003a9 - Browse repository at this point
Copy the full SHA d7003a9View commit details -
implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer)) as explained in readme of https://github.com/cybertronai/gradient-checkpointing
Configuration menu - View commit details
-
Copy full SHA for 6e3f95b - Browse repository at this point
Copy the full SHA 6e3f95bView commit details -
Configuration menu - View commit details
-
Copy full SHA for e05e441 - Browse repository at this point
Copy the full SHA e05e441View commit details -
add and use function ggml_build_backward_expand to avoid stack overfl…
…ows with large maximum number of nodes GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
Configuration menu - View commit details
-
Copy full SHA for ed4319e - Browse repository at this point
Copy the full SHA ed4319eView commit details -
change AdamW decay parameter to work like the torch AdamW decay param…
…eter It is now relative to Adam learning rate `alpha*sched`. Before that it was relative to `sched` only. `alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
Configuration menu - View commit details
-
Copy full SHA for a80f184 - Browse repository at this point
Copy the full SHA a80f184View commit details -
change default AdamW weight decay parameter used in training to 0.1 a…
…s used in nanoGPT
Configuration menu - View commit details
-
Copy full SHA for f175ead - Browse repository at this point
Copy the full SHA f175eadView commit details -
change default AdamW weight decay parameter defined in ggml to 0.0, m…
…aking Adam default instead of AdamW btw: the default weight decay parameter for torch.optim.AdamW is 0.01
Configuration menu - View commit details
-
Copy full SHA for 97964a4 - Browse repository at this point
Copy the full SHA 97964a4View commit details -
bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16 cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup. so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
Configuration menu - View commit details
-
Copy full SHA for 2c6985f - Browse repository at this point
Copy the full SHA 2c6985fView commit details -
fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
Configuration menu - View commit details
-
Copy full SHA for 2d1e6e0 - Browse repository at this point
Copy the full SHA 2d1e6e0View commit details -
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
Configuration menu - View commit details
-
Copy full SHA for 864e7e3 - Browse repository at this point
Copy the full SHA 864e7e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 87febee - Browse repository at this point
Copy the full SHA 87febeeView commit details -
change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
Configuration menu - View commit details
-
Copy full SHA for 51dc770 - Browse repository at this point
Copy the full SHA 51dc770View commit details -
improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal. since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different: ``` given: n, u, v objective: minimize(a*u+b*v) where a*b=n, a>0, b>0 b=n/a minimize(a*u+v*n/a) diff(a*u+v*n/a, a) = u - (v*n/a)/a diff(a*u+v*n/a, a) == 0 u - (v*n/a)/a == 0 u == v*n/(a*a) u*a*a = v*n a*a = v*n/u a = sqrt(n*v/u) ``` this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
Configuration menu - View commit details
-
Copy full SHA for 3744a9b - Browse repository at this point
Copy the full SHA 3744a9bView commit details -
Configuration menu - View commit details
-
Copy full SHA for fc379a2 - Browse repository at this point
Copy the full SHA fc379a2View commit details -
Configuration menu - View commit details
-
Copy full SHA for d0fbb7d - Browse repository at this point
Copy the full SHA d0fbb7dView commit details -
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay --disable-restart N Only for Adam optimizer. Disable restarts of cos-decay --opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero. --opt-delta N Maximum delta for delta convergence test. Disabled when <= zero. --opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero. --adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero. --adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
Configuration menu - View commit details
-
Copy full SHA for c6a18e1 - Browse repository at this point
Copy the full SHA c6a18e1View commit details -
replace memcpy with reshape operation so that the graph is not cut at…
… the input this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
Configuration menu - View commit details
-
Copy full SHA for ce937bc - Browse repository at this point
Copy the full SHA ce937bcView commit details -
Configuration menu - View commit details
-
Copy full SHA for ff759d9 - Browse repository at this point
Copy the full SHA ff759d9View commit details -
Configuration menu - View commit details
-
Copy full SHA for e843d6e - Browse repository at this point
Copy the full SHA e843d6eView commit details -
add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)). can be used for dynamic learning schedule and setting input data for batches before each iteration
Configuration menu - View commit details
-
Copy full SHA for bfc3119 - Browse repository at this point
Copy the full SHA bfc3119View commit details -
use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
Configuration menu - View commit details
-
Copy full SHA for d7aa4d9 - Browse repository at this point
Copy the full SHA d7aa4d9View commit details -
add minimum number of tensor dimensions to apply weight decay (defaul…
…t 2) this allows to not apply weight decay to bias parameters
Configuration menu - View commit details
-
Copy full SHA for e6ff072 - Browse repository at this point
Copy the full SHA e6ff072View commit details -
rename training parameter cos-decay-alpha to cos-decay-min and clarif…
…y that adam-min-alpha also applies to warmup
Configuration menu - View commit details
-
Copy full SHA for 58024d3 - Browse repository at this point
Copy the full SHA 58024d3View commit details -
fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
Configuration menu - View commit details
-
Copy full SHA for 17a0898 - Browse repository at this point
Copy the full SHA 17a0898View commit details -
change sampling parameters for prediction after training to defaults …
…of common.h and clarify what is context for prediction and what are generated tokens
Configuration menu - View commit details
-
Copy full SHA for 24a4b09 - Browse repository at this point
Copy the full SHA 24a4b09View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1065c3b - Browse repository at this point
Copy the full SHA 1065c3bView commit details -
add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
Configuration menu - View commit details
-
Copy full SHA for dbbc263 - Browse repository at this point
Copy the full SHA dbbc263View commit details -
Configuration menu - View commit details
-
Copy full SHA for 47055c9 - Browse repository at this point
Copy the full SHA 47055c9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0f6a8ab - Browse repository at this point
Copy the full SHA 0f6a8abView commit details -
remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
Configuration menu - View commit details
-
Copy full SHA for 87035b9 - Browse repository at this point
Copy the full SHA 87035b9View commit details -
Configuration menu - View commit details
-
Copy full SHA for ecdc161 - Browse repository at this point
Copy the full SHA ecdc161View commit details -
Configuration menu - View commit details
-
Copy full SHA for c1a5e11 - Browse repository at this point
Copy the full SHA c1a5e11View commit details -
Configuration menu - View commit details
-
Copy full SHA for 22cb368 - Browse repository at this point
Copy the full SHA 22cb368View commit details
Commits on Aug 6, 2023
-
Configuration menu - View commit details
-
Copy full SHA for d43af4b - Browse repository at this point
Copy the full SHA d43af4bView commit details -
add train function using automatic gradient checkpointing backward pa…
…ss and allocator
Configuration menu - View commit details
-
Copy full SHA for 2bf422e - Browse repository at this point
Copy the full SHA 2bf422eView commit details
Commits on Aug 14, 2023
-
in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
Configuration menu - View commit details
-
Copy full SHA for fc826c8 - Browse repository at this point
Copy the full SHA fc826c8View commit details -
don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
Configuration menu - View commit details
-
Copy full SHA for d437415 - Browse repository at this point
Copy the full SHA d437415View commit details -
Configuration menu - View commit details
-
Copy full SHA for cfddc36 - Browse repository at this point
Copy the full SHA cfddc36View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0dd496c - Browse repository at this point
Copy the full SHA 0dd496cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 52c92c0 - Browse repository at this point
Copy the full SHA 52c92c0View commit details -
correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
Configuration menu - View commit details
-
Copy full SHA for 345f516 - Browse repository at this point
Copy the full SHA 345f516View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5a11b75 - Browse repository at this point
Copy the full SHA 5a11b75View commit details -
swap arguments to commutative ops to be the same as in `forward_batch…
…_wo_cache_flash_attn`
Configuration menu - View commit details
-
Copy full SHA for b2f1310 - Browse repository at this point
Copy the full SHA b2f1310View commit details -
add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
Configuration menu - View commit details
-
Copy full SHA for 5884b43 - Browse repository at this point
Copy the full SHA 5884b43View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9716eb8 - Browse repository at this point
Copy the full SHA 9716eb8View commit details -
make sure some tensors are not reallocated by inserting new temporary…
… nodes depending on them: output and parameter gradient tensors need to be available at the end of the graph execution parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration checkpoint tensors are allocated all together to reduce memory allocator fragmentation afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
Configuration menu - View commit details
-
Copy full SHA for 38f4438 - Browse repository at this point
Copy the full SHA 38f4438View commit details -
Configuration menu - View commit details
-
Copy full SHA for d6c5b03 - Browse repository at this point
Copy the full SHA d6c5b03View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4ed096c - Browse repository at this point
Copy the full SHA 4ed096cView commit details -
integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
Configuration menu - View commit details
-
Copy full SHA for 865c4cd - Browse repository at this point
Copy the full SHA 865c4cdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3e99a8d - Browse repository at this point
Copy the full SHA 3e99a8dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 75baed2 - Browse repository at this point
Copy the full SHA 75baed2View commit details -
Configuration menu - View commit details
-
Copy full SHA for fe788a1 - Browse repository at this point
Copy the full SHA fe788a1View commit details -
Configuration menu - View commit details
-
Copy full SHA for c954f41 - Browse repository at this point
Copy the full SHA c954f41View commit details -
Configuration menu - View commit details
-
Copy full SHA for 271e4d6 - Browse repository at this point
Copy the full SHA 271e4d6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6f161c7 - Browse repository at this point
Copy the full SHA 6f161c7View commit details -
remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
Configuration menu - View commit details
-
Copy full SHA for 3794dce - Browse repository at this point
Copy the full SHA 3794dceView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6e280b2 - Browse repository at this point
Copy the full SHA 6e280b2View commit details -
add debug asserts in ggml_allocr_alloc to some common pitfalls when u…
…sing this function directly
Configuration menu - View commit details
-
Copy full SHA for faf3e21 - Browse repository at this point
Copy the full SHA faf3e21View commit details -
Configuration menu - View commit details
-
Copy full SHA for 098654c - Browse repository at this point
Copy the full SHA 098654cView commit details -
fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
Configuration menu - View commit details
-
Copy full SHA for 3e6468b - Browse repository at this point
Copy the full SHA 3e6468bView commit details -
fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated. now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
Configuration menu - View commit details
-
Copy full SHA for 5622846 - Browse repository at this point
Copy the full SHA 5622846View commit details -
reverse order of for loop in ggml_build_backward_expand to save memor…
…y when using gradient checkpointing and allocator with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory. the computation results are the same
Configuration menu - View commit details
-
Copy full SHA for 3b5515b - Browse repository at this point
Copy the full SHA 3b5515bView commit details
Commits on Aug 24, 2023
-
Merge branch 'master' into pr-train-mem-usage-improvements
# Conflicts: # examples/train-text-from-scratch/train-text-from-scratch.cpp
Configuration menu - View commit details
-
Copy full SHA for 0c52c65 - Browse repository at this point
Copy the full SHA 0c52c65View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4072f20 - Browse repository at this point
Copy the full SHA 4072f20View commit details -
implement llama model file saving using gguf
checkpoint loading and saving disabled, to be replaced by loading and saving via gguf
Configuration menu - View commit details
-
Copy full SHA for f51c5d7 - Browse repository at this point
Copy the full SHA f51c5d7View commit details
Commits on Aug 25, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 5407981 - Browse repository at this point
Copy the full SHA 5407981View commit details
Commits on Aug 26, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 6a20f7a - Browse repository at this point
Copy the full SHA 6a20f7aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 167dd2d - Browse repository at this point
Copy the full SHA 167dd2dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2978e03 - Browse repository at this point
Copy the full SHA 2978e03View commit details
Commits on Aug 27, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 0c494cc - Browse repository at this point
Copy the full SHA 0c494ccView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3a91c97 - Browse repository at this point
Copy the full SHA 3a91c97View commit details -
Configuration menu - View commit details
-
Copy full SHA for a6f3a47 - Browse repository at this point
Copy the full SHA a6f3a47View commit details -
Configuration menu - View commit details
-
Copy full SHA for cb42324 - Browse repository at this point
Copy the full SHA cb42324View commit details -
Configuration menu - View commit details
-
Copy full SHA for 495a62a - Browse repository at this point
Copy the full SHA 495a62aView commit details -
Configuration menu - View commit details
-
Copy full SHA for ef899fb - Browse repository at this point
Copy the full SHA ef899fbView commit details -
Configuration menu - View commit details
-
Copy full SHA for d71069c - Browse repository at this point
Copy the full SHA d71069cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 91a4cca - Browse repository at this point
Copy the full SHA 91a4ccaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0b2c85b - Browse repository at this point
Copy the full SHA 0b2c85bView commit details -
fix memory corruption bug in gguf
ctx->kv and ctx->infos was reallocated using not-aligned realloc, but freed with aligned free. to fix this a GGML_ALIGNED_REALLOC was added, but there is no posix_memalign_realloc function. so on non-windows and non-mingw32 platforms we fall back to aligned malloc, followed by copying and freeing the old data.
Configuration menu - View commit details
-
Copy full SHA for ca5b344 - Browse repository at this point
Copy the full SHA ca5b344View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5d94997 - Browse repository at this point
Copy the full SHA 5d94997View commit details -
Configuration menu - View commit details
-
Copy full SHA for 76d2794 - Browse repository at this point
Copy the full SHA 76d2794View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4882ff0 - Browse repository at this point
Copy the full SHA 4882ff0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 152cfaa - Browse repository at this point
Copy the full SHA 152cfaaView commit details
Commits on Aug 28, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 1f83343 - Browse repository at this point
Copy the full SHA 1f83343View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3d8d884 - Browse repository at this point
Copy the full SHA 3d8d884View commit details -
Configuration menu - View commit details
-
Copy full SHA for e86b3e3 - Browse repository at this point
Copy the full SHA e86b3e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for daa0b6c - Browse repository at this point
Copy the full SHA daa0b6cView commit details -
Configuration menu - View commit details
-
Copy full SHA for f97f92b - Browse repository at this point
Copy the full SHA f97f92bView commit details -
Configuration menu - View commit details
-
Copy full SHA for c690c20 - Browse repository at this point
Copy the full SHA c690c20View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5f27ade - Browse repository at this point
Copy the full SHA 5f27adeView commit details -
temporarily add code to write old checkpoint files
used to verify that old checkpoint files are correctly converted to gguf
Configuration menu - View commit details
-
Copy full SHA for e8df9e6 - Browse repository at this point
Copy the full SHA e8df9e6View commit details -
bug fixes for convert-train-checkpoint-to-gguf.py loading checkpoints…
… with opt_version=0
Configuration menu - View commit details
-
Copy full SHA for 31c093c - Browse repository at this point
Copy the full SHA 31c093cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 63bf200 - Browse repository at this point
Copy the full SHA 63bf200View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3155019 - Browse repository at this point
Copy the full SHA 3155019View commit details -
remove prediction related code
use main for prediction, it is better optimized
Configuration menu - View commit details
-
Copy full SHA for 3e7dfd0 - Browse repository at this point
Copy the full SHA 3e7dfd0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 17ab46d - Browse repository at this point
Copy the full SHA 17ab46dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 12c4e5b - Browse repository at this point
Copy the full SHA 12c4e5bView commit details -
Configuration menu - View commit details
-
Copy full SHA for a925e93 - Browse repository at this point
Copy the full SHA a925e93View commit details -
Configuration menu - View commit details
-
Copy full SHA for 440d221 - Browse repository at this point
Copy the full SHA 440d221View commit details -
remove GGML_ALIGNED_REALLOC and use normal malloc/realloc/free for gg…
…uf ctx->kv & ctx->infos
Configuration menu - View commit details
-
Copy full SHA for f6828cb - Browse repository at this point
Copy the full SHA f6828cbView commit details -
Configuration menu - View commit details
-
Copy full SHA for 93535a4 - Browse repository at this point
Copy the full SHA 93535a4View commit details