Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #4791 alloc causes compute_size to be calculated incorrectly in train-text-from-scratch, end result core dump #5033

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

bzuzo
Copy link
Contributor

@bzuzo bzuzo commented Jan 19, 2024

between release b1618 and b1680 train-text-from-scratch broke.
compute_size returns exactly the same size as ggml_allocr_max_size which of course is not possible.

I experienced this when trying to train and saw that it has been reported a few places - one being - #4791

current master as of jan 18 2024
main: compute_size = 140477909098560 bytes (133970176.0 MB)
main: evaluation order = RIGHT_TO_LEFT
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

After the fix back to normal and reasonable size

main: input_size = 131076128 bytes (125.0 MB)
main: compute_size = 701759840 bytes (669.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 27520
main: number of training tokens: 27584
main: begin training
main: work_size = 768376 bytes (0.7 MB)

@bzuzo bzuzo changed the title Fix issue with alloc causing max_compute_size to be calculated Fix issue #4791 with alloc causing max_compute_size to be calculated Jan 19, 2024
@bzuzo bzuzo changed the title Fix issue #4791 with alloc causing max_compute_size to be calculated Fix issue #4791 alloc causes compute_size to be calculated incorrectly in train-text-from-scratch, end result core dump Jan 19, 2024
@ggerganov
Copy link
Owner

Instead of this, let's implement the fix from this comment: #4791 (comment)

If you can confirm it works in your case, we can merge

@bzuzo
Copy link
Contributor Author

bzuzo commented Jan 19, 2024

okay, I'll try that suggestion and

Instead of this, let's implement the fix from this comment: #4791 (comment)

If you can confirm it works in your case, we can merge

I implemented the fix as suggested, it works.

main: compute_size = 701759840 bytes (669.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 755165
main: number of training tokens: 755229
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter= 256 sample=4113/755165 sched=0.862144 loss=0.000000 |->
train_opt_callback: iter= 257 sample=4129/755165 sched=0.861124 loss=5.805915 dt=00:00:04 eta=00:19:10 |->

@ggerganov ggerganov merged commit 381ee19 into ggerganov:master Jan 19, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue ggerganov#4791
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue ggerganov#4791
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants