Skip to content

Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server #9933

Open
@morgen52

Description

@morgen52

What happened?

Hi there.
As suggested by the documents, config -n indicates the number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled), and -c indicates the context size.
However, when I use the following command to start a server:

./llama.cpp-b3938/build_gpu/bin/llama-server     -m ../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf     -ngl 99 -n -2 -c 256

And Send a request with the following command:

curl --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "What is the meaning of life?"}'

I can get only one token of output from the response.

{"content":" I","id_slot":0,"stop":true,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","tokens_predicted":1,"tokens_evaluated":7,"generation_settings":{"n_ctx":256,"n_predict":-2,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","seed":4294967295,"seed_cur":3394087514,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","xtc","temperature"]},"prompt":"What is the meaning of life?","has_new_line":false,"truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":7,"timings":{"prompt_n":7,"prompt_ms":27.275,"prompt_per_token_ms":3.8964285714285714,"prompt_per_second":256.64527956003667,"predicted_n":1,"predicted_ms":0.005,"predicted_per_token_ms":0.005,"predicted_per_second":200000.0},"index":0}

Is there something wrong with the way I'm using it? Or is this a bug?

Name and Version

./llama.cpp-b3938/build_gpu/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 7 (d9a33c5)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomerslow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions