- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
llama : refactor llama_context, llama_kv_cache, llm_build_context (v2) #12181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
766edbf    to
    62ba774      
    Compare
  
    | 
           Planning to merge this tomorrow unless there are any suggestions for improvements.  | 
    
62ba774    to
    a170669      
    Compare
  
    ggml-ci
ggml-ci
ggml-ci
…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci
| 
           @ggerganov I noticed that T5 models no longer work correctly after merging this PR so I investigated possible causes. I see that you removed  Line 1381 in 01e8f21 
 but you pass "3D" V tensor here: Line 9566 in 01e8f21 
 This results in  I found that removing this single line fixed the problem. I'd correct it myself, but I'm not sure how do you intend to handle   | 
    
          
 Thanks for catching that. I broke this in this commit: 70ef653. The reason was because I wanted to make the PR to produce the same graphs as on  
 Maybe the user code should explicitly set the attention type? Btw, this probably explains the differences that I referred to in this #12181 (comment).  | 
    
          
 Do you mean something like this? I just tested it and it works fine. Maybe an extra assert in encode() that would print some info if causal_attn is set true would be good to, otherwise existing code will silently stop working correctly for unknown reason.  | 
    
| 
           Yes, that's what I have in mind. But it is too cumbersome and error prone. Maybe temporary we should set  Ideally, we need to have separate contexts for the encoder and the decoder of such models so that we can configure them independently, but this is not ready yet.  | 
    
          
 @ggerganov I guess the "cleanest" solution would be to add  It could always create non-causal mask since I don't know of any models that use causal attention in encoder. If any appears, handling it would be a matter of adding new   | 
    
| 
           It's hard to decide how to do it exactly. For now, here is a simple patch that should work:  | 
    
| 
           @ggerganov There seem to be another problem with the refactor that manifests when using CUDA backend with T5 models. From what I understand the problem is that you copy the encoder output here: llama.cpp/src/llama-context.cpp Line 1149 in c6af216 
 without making sure the encoder graph finished computation. When I added ggml_synchronize() call earlier it started working correctly:  | 
    
…ml-org#12181) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci
| 
           I'm getting a segmentation fault when using  Here's a simple reproduction code: void repro() {
    llama_backend_init();
    auto model_params = llama_model_default_params();
    model_params.n_gpu_layers = 0;
    // https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
    auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf";
    auto model = llama_model_load_from_file(model_path, model_params);
    fputs("model loaded\n", stdout);
    fflush(stdout);
    // https://huggingface.co/ngxson/test_gguf_lora_adapter/blob/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf
    auto lora_path = "/home/user/models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf";
    auto lora = llama_adapter_lora_init(model, lora_path);
    fputs("lora created\n", stdout);
    fflush(stdout);
    llama_adapter_lora_free(lora);
    llama_model_free(model);
    llama_backend_free();
}Here's a stack trace from  Stack trace | 
    
| 
           @ggerganov I've run some tests and you're right, #12332 is indeed the cause. Sorry for the confusion.  | 
    
| 
           The promised API change is now available - see the  Changing the  You don't have to reload the model to change   | 
    
| 
           Thanks for the quick reply!  | 
    
| 
           Yes, you have to free it with  Note that if you want to move the existing memory data (i.e. the cache that you have in the old context) to the new context, you can use the   | 
    
alt #11213
Overview
The implementation in #11213 became too complicated, trying to make a lot of changes at once. This is an alternative implementation, which does not involve the abstraction of the
llama_context. The PR introduces some new abstractions, improves the graph build handling and is an initial step for the next changes listed in section "Next" below.llm_build_contextinto newllm_graph_contextimplemented inllama-graph.h/.cppllm_graph_input_...classes for handling graph inputs in a safer and cleaner wayllm_graph_resultfor extracting important tensors such as embeddings and logits, instead of searching for them by tensor namellm_memory_iconcept that will abstract different cache/memory mechanisms. For now we have onlyllama_kv_cacheas a type of memoryllama_io_write_iandllama_io_read_iinterfacesAPI changes
The current changes are only necessary to make the API more consistent in following the naming convention. To migrate, simply replace the old API calls with the new ones.
llama_kv_cache_...APIllama_kv_self_...APINext
class llama_kv_cache_recurrentand remove all recurrent logic from the existingclass llama_kv_cache_unified. Simplifyllama_kv_cache_unified.