Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp : split llama_context_params into model and context params #3301

Merged
merged 17 commits into from
Sep 28, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Sep 21, 2023

Currently, llama_load_model_from_file and llama_new_context_with_model receive the same parameters, but it is not specified which parameters are relevant for each object. This change splits the parameters into llama_model_params and llama_context_params so that it is clear what parameters can be modified per-context.

Additionally:

  • Uses the model n_ctx_train as the default context size when llama_context_params::n_ctx == 0 or -c 0
  • Moves n_threads to llama_context_params, adds n_threads_batch and the corresponding command line argument -tb, --threads-batch for batches/prompt processing
  • Removes the low_vram parameter and the corresponding -lv, --low-vram command line argument

@slaren slaren added breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. refactoring Refactoring labels Sep 21, 2023
@slaren slaren marked this pull request as ready for review September 21, 2023 23:33
@slaren
Copy link
Collaborator Author

slaren commented Sep 22, 2023

Despite what the prints say, it looks like the KV cache is offloaded regardless of the value of low_vram. Am I missing something here, or did this functionality get lost somewhere along the way? Tagging @JohannesGaessler.

llama.cpp/llama.cpp

Lines 2323 to 2338 in bc9d3e3

if (n_gpu_layers > (int) hparams.n_layer + 1) {
if (low_vram) {
LLAMA_LOG_INFO("%s: cannot offload v cache to GPU due to low VRAM option\n", __func__);
} else {
LLAMA_LOG_INFO("%s: offloading v cache to GPU\n", __func__);
vram_kv_cache += hparams.kv_size() / 2;
}
}
if (n_gpu_layers > (int) hparams.n_layer + 2) {
if (low_vram) {
LLAMA_LOG_WARN("%s: cannot offload k cache to GPU due to low VRAM option\n", __func__);
} else {
LLAMA_LOG_INFO("%s: offloading k cache to GPU\n", __func__);
vram_kv_cache += hparams.kv_size() / 2;
}
}

llama.cpp/llama.cpp

Lines 1229 to 1236 in bc9d3e3

#ifdef GGML_USE_CUBLAS
if (n_gpu_layers > n_layer + 1) {
ggml_cuda_assign_buffers_no_scratch(cache.v);
}
if (n_gpu_layers > n_layer + 2) {
ggml_cuda_assign_buffers_no_scratch(cache.k);
}
#endif // GGML_USE_CUBLAS

@JohannesGaessler
Copy link
Collaborator

Both the --low-vram option and KV cache offloading were added in #1827 . It seems that the program logic and the prints were inconsistent from the start. On master the results are still correct when combining --low-vram with KV cache offloading. However, the performance degrades relative to offloading only the repeating and non-repeating layers due to a much higher number of CPU<->GPU copies. I don't remember for certain what I did several months ago but I think the reason the prints say that the combination of --low-vram and KV cache offloading is not allowed is because supporting the combination would not make sense in terms of performance and I wanted to reduce the work needed for implementation/maintenance.

I'm not 100% sure how to proceed. The results happen to be correct, but as I said: supporting the combination does not make sense and I would not consider it to be in-scope. So I personally would still prefer to disallow it and to adjust the program logic to match the prints.

@slaren
Copy link
Collaborator Author

slaren commented Sep 22, 2023

The reason this is relevant to this PR is that currently I have duplicated low_vram in llama_model_params and llama_context_params, since this parameter is supposed to be used both when loading the model and when creating the KV cache. I don't think this is good, and I would prefer to remove it from llama_context_params. It would still be possible to do this and to match the logic of the prints by storing the value of low_vram in llama_model, but I am not convinced that it is worth to do that, since the same behavior can be achieved simply by using a smaller value of n_gpu_layers.

There is also the issue that the only documentation of low_vram (in the command line help) is that it disables the VRAM scratch buffer, however I don't think that disabling the scratch buffer at the moment makes a lot of sense, since it can be reduced to almost nothing by using a low batch size, so the cost that this has on the performance is probably disproportionate compared to the amount of VRAM that it saves. I am not even sure that disabling the VRAM scracth buffer does anything now, since this memory will still be necessary to store the results of the operations, but it will be allocated from the CUDA memory pool instead.

@JohannesGaessler
Copy link
Collaborator

Alright. I do not feel strongly about this and my line of reasoning is revolving around what is easier to implement and maintain anyways (assuming we want to keep --low-vram that would be to disallow the combination with KV cache offloading). Intuitively I would agree that thanks to your improvements regarding VRAM allocation (and performance optimizations in general) the use of --low-vram probably no longer makes sense and I would be fine with just removing the option. I cannot 100% guarantee that there isn't still some use case where --low-vram helps but if there are any they should be very rare.

@slaren
Copy link
Collaborator Author

slaren commented Sep 22, 2023

I ran some tests with low_vram, and besides some unintended consequences of disabling the scratch buffer while still offloading the KV (some mat muls during attention wouldn't be offloaded anymore), I couldn't observe any meaningful differences in VRAM usage. After removing the effect of low_vram on the VRAM scratch buffer, the only remaining effect was preventing the offloading of the output norm weight, which is tiny anyway, just 16 KB on 7B. So I went ahead and removed the low_vram option entirely.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this together with #3228 to make the API change in one go

@ggerganov ggerganov added the high priority Very important issue label Sep 22, 2023
@netrunnereve
Copy link
Contributor

netrunnereve commented Sep 22, 2023

Thanks for completing #2534! It looks and runs fine on my end.

Note there are a few things from my PR that you may want to add here, such as README updates and including the thread count in the system info line.

@slaren
Copy link
Collaborator Author

slaren commented Sep 26, 2023

While we are changing the API, should we also remove all the _with_model functions, and make the regular functions take a llama_model instead? I have already added a llama_get_model API to obtain the model of a context, so that users that only have a llama_context for whatever reason can still use these functions.

@ggerganov
Copy link
Owner

While we are changing the API, should we also remove all the _from_model functions

Do you mean _with_model?
Yes, we should do that. Lets resolve conflicts and merge

Comment on lines +6568 to +6569
/*.n_threads =*/ GGML_DEFAULT_N_THREADS, // TODO: better default
/*.n_threads_batch =*/ GGML_DEFAULT_N_THREADS,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to something like std::thread::hardware_concurrency(), or get_num_physical_cores() from common.cpp? It shouldn't affect most of our examples, but a better default could help downstream users.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best option would be to move get_num_physical_cores() to ggml and use it.
But can be done in separate PR

@slaren
Copy link
Collaborator Author

slaren commented Sep 28, 2023

If the API changes look good I think this can be merged.

@ggerganov ggerganov merged commit 16bc66d into master Sep 28, 2023
33 checks passed
@ggerganov ggerganov deleted the llama-model-params branch September 28, 2023 19:42
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 2, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412)
  llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401)
  train : fix KQ_pos allocation (ggerganov#3392)
  llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206)
  readme : update hot topics + model links (ggerganov#3399)
  readme : add link to grammars app (ggerganov#3388)
  swift : fix build on xcode 15 (ggerganov#3387)
  build : enable more non-default compiler warnings (ggerganov#3200)
  ggml_tensor: update the structure comments. (ggerganov#3283)
  ggml : release the requested thread pool resource (ggerganov#3292)
  llama.cpp : split llama_context_params into model and context params (ggerganov#3301)
  ci : multithreaded builds (ggerganov#3311)
  train : finetune LORA (ggerganov#2632)
  gguf : basic type checking in gguf_get_* (ggerganov#3346)
  gguf : make token scores and types optional (ggerganov#3347)
  ci : disable freeBSD builds due to lack of VMs (ggerganov#3381)
  llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228)
  docs : mark code as Bash (ggerganov#3375)
  readme : add Mistral AI release 0.1 (ggerganov#3362)
  ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
…gerganov#3301)

* llama.cpp : split llama_context_params into model and context params

ggml-ci

* fix metal build

* fix freq_base/scale default to model value

* llama-bench : keep the same model between tests when possible

* move n_threads to llama_context_params, add n_threads_batch

* fix mpi build

* remove kv_size(), cuda scratch fixes

* remove low-vram option

* add n_threads_batch to system info, refactor to get_system_info()

* add documentation about --threads-batch to the READMEs

* llama-bench fix

* main : fix rope freq/scale warning

* llama.cpp : add llama_get_model
common : add llama_tokenize from model

* remove duplicated ctx/model functions

ggml-ci

* cuda : print total VRAM used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. high priority Very important issue refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants