How to use KV cache? #10223

manimathma · 2024-11-09T01:14:17Z

manimathma
Nov 9, 2024

Hi Team,

I couldn't find any documentation of how to use KV cache. Any pointers would help.

I assume --prompt-cache /tmp/llama3cache --prompt-cache-all would work but it didn't.

Iteration 1:

 ./llama-cli  -m ~/Downloads/llama-3.2-1b-instruct-q8_0.gguf   -p   "<|start_header_id|>system<|end_header_id|> You are an expert in controlling smart home devices. Your task is to generate the appropriate function calls to control these devices based on the user's request.\n\nHere are the available devices and their current states in JSON format:\n\n[ { 'type': 'Thermostat', 'name': 'Living Room Thermostat', 'status': 'Cool', 'current_temp': 68 }, { 'type': 'Light', 'name': 'Kitchen Light', 'status': 'Off' } ]\n\nYou are given a list of functions that can be called to control these devices. Each function has a name, description, and required parameters.\n\n[{'name': 'control_thermostat', 'description': 'Control the thermostat temperature and mode.', 'parameters': {'type': 'dict', 'required': ['device_name', 'mode'], 'properties': {'device_name': {'type': 'string', 'description': 'Name of the thermostat device.'}, 'mode': {'type': 'string', 'description': 'Mode to set, either 'cool', 'heat' or 'off'.'}, 'target_temp': {'type': 'integer', 'description': 'Target temperature in Fahrenheit.'}}}}, {'name': 'control_light', 'description': 'Turn a light on or off.', 'parameters': {'type': 'dict', 'required': ['device_name', 'state'], 'properties': {'device_name': {'type': 'string', 'description': 'Name of the light device.'}, 'state': {'type': 'string', 'description': 'State to set, either 'on' or 'off'.'}}}]\n\nBased on the user's request, you need to invoke the appropriate function(s) with the correct parameters in the following format:\n\n[func_name1(param1=value1, param2=value2), func_name2(param3=value3)]\n\nYou should not include any other text in the response.\n\n<|eot_id|><|start_header_id|>user<|end_header_id|> Its too cold. <|eot_id|><|start_header_id|>assistant<|end_header_id|>" --prompt-cache /tmp/llama3cache --prompt-cache-all

You can control the Thermostat's temperature to a higher value. Here's the function call:

[ 'control_thermostat', {'device_name': 'Living Room Thermostat', 'mode': 'heat', 'target_temp': 75 } ] [end of text]

main: saving final output to session file '/tmp/llama3cache'

llama_print_timings:        load time =    2080.24 ms
llama_print_timings:      sample time =       3.26 ms /    52 runs   (    0.06 ms per token, 15970.52 tokens per second)
llama_print_timings: prompt eval time =    2860.22 ms /   404 tokens (    7.08 ms per token,   141.25 tokens per second)
llama_print_timings:        eval time =    3453.48 ms /    51 runs   (   67.72 ms per token,    14.77 tokens per second)
llama_print_timings:       total time =    6371.01 ms /   455 tokens
Log end

Iteration 2:

Log start
main: build = 3505 (b72c20b8)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1731114386
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /Downloads/llama-3.2-1b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 1.22 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 1B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.07 MiB
llm_load_tensors:        CPU buffer size =  1252.41 MiB
.............................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =  8464.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: attempting to load saved session from '/tmp/llama3cache'
main: loaded a session with prompt size of 455 tokens
main: warning: session file has low similarity to prompt (1 / 18 tokens); will mostly be reevaluated
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1



user Decrease the temperature by 1 assistant

A temperature decrease of 1 degree Celsius would be:

Temperature = 20°C
Decrease = 1°C
New Temperature = 20°C - 1°C = 19°C [end of text]

main: saving final output to session file '/tmp/llama3cache'

llama_print_timings:        load time =    1400.77 ms
llama_print_timings:      sample time =       3.32 ms /    40 runs   (    0.08 ms per token, 12051.82 tokens per second)
llama_print_timings: prompt eval time =     207.04 ms /    17 tokens (   12.18 ms per token,    82.11 tokens per second)
llama_print_timings:        eval time =    2722.87 ms /    39 runs   (   69.82 ms per token,    14.32 tokens per second)
llama_print_timings:       total time =    3201.13 ms /    56 tokens
Log end

Also is there a way to kv cache just the query and not the responds. One of my idea is to cache the static information (system) and then use it for all llm interactions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use KV cache? #10223

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How to use KV cache? #10223

manimathma Nov 9, 2024

Replies: 0 comments

manimathma
Nov 9, 2024