Description
What happened?
Hi there.
As suggested by the documents, config -n indicates the number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled), and -c indicates the context size.
However, when I use the following command to start a server:
./llama.cpp-b3938/build_gpu/bin/llama-server -m ../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf -ngl 99 -n -2 -c 256
And Send a request with the following command:
curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "What is the meaning of life?"}'
I can get only one token of output from the response.
{"content":" I","id_slot":0,"stop":true,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","tokens_predicted":1,"tokens_evaluated":7,"generation_settings":{"n_ctx":256,"n_predict":-2,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","seed":4294967295,"seed_cur":3394087514,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","xtc","temperature"]},"prompt":"What is the meaning of life?","has_new_line":false,"truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":7,"timings":{"prompt_n":7,"prompt_ms":27.275,"prompt_per_token_ms":3.8964285714285714,"prompt_per_second":256.64527956003667,"predicted_n":1,"predicted_ms":0.005,"predicted_per_token_ms":0.005,"predicted_per_second":200000.0},"index":0}
Is there something wrong with the way I'm using it? Or is this a bug?
Name and Version
./llama.cpp-b3938/build_gpu/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 7 (d9a33c5)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response