Closed

Description
Prerequisites
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Hi, thanks for the continued effort with llama.cpp. I cloned the repo, then built with make as usual.
Expected Behavior
Run ./server without error messages. This issue was not present in #2009. Unfortunately, I'm receiving errors during inference with ./server with commit #2116. I'll test other builds..
Current Behavior
Errors during ./server inference:
llama_eval_internal: first token must be BOS
llama_eval: failed to eval
It's abrupt, and cuts off response in the middle of a sentence. Here's an example:
~/ollama (master)> ./server -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -t 4 -b 10
{"timestamp":1688607679,"level":"INFO","function":"main","line":1085,"message":"build info","build":796,"commit":"31cfbb1"}
{"timestamp":1688607679,"level":"INFO","function":"main","line":1090,"message":"
system info","n_threads":4,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
llama server listening at http://127.0.0.1:8080)
{"timestamp":1688607679,"level":"INFO","function":"main","line":1305,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37212,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
llama_print_timings: load time = 2102.17 ms
llama_print_timings: sample time = 3291.18 ms / 355 runs ( 9.27 ms per token, 107.86 tokens per second)
llama_print_timings: prompt eval time = 10480.78 ms / 49 tokens ( 213.89 ms per token, 4.68 tokens per second)
llama_print_timings: eval time = 124335.87 ms / 354 runs ( 351.23 ms per token, 2.85 tokens per second)
llama_print_timings: total time = 138282.27 ms
{"timestamp":1688607964,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37214,"status":200,"method":"POST","path":"/completion","params":{}}
llama_eval_internal: first token must be BOS
llama_eval: failed to eval
{"timestamp":1688608023,"level":"ERROR","function":"nextToken","line":360,"message":"failed to eval","n_eval":10,"n_past":0,"n_threads":4,"embd":"
rare ingredients for potions, and even delved into dangerous dungeons filled with
treacherous monsters. Along the way, she made friends with other creatures who shared her passion for knowledge and
adventure, including dragons, unicorns, and even mermaids.\nAs time passed, Luna grew stronger both physically and mentally,
becoming an extraordinary creature capable of performing incredible feats. And yet,
despite all her newfound powers, she never forgot where she came from or the humble roots that first led her down this path.
For Luna always remained true to her llama nature, using her abilities only for good and spreading joy wherever she went.\n
User: Thanks. Describe Lunas appearance please.\n
llama: As a young llama, Luna was adorable with soft brown fur, long eyelashes, and a friendly smile. But as she embarked on her
journey towards greatness, her physical features began to change in mysterious ways. Her eyes
became more intense, glowing like crystals themselves, while her body developed powerful
muscles and a shimmering golden coat. She now stood taller than any ordinary ll"}
llama_print_timings: load time = 2102.17 ms
llama_print_timings: sample time = 936.31 ms / 93 runs ( 10.07 ms per token, 99.33 tokens per second)
llama_print_timings: prompt eval time = 4246.50 ms / 16 tokens ( 265.41 ms per token, 3.77 tokens per second)
llama_print_timings: eval time = 29930.84 ms / 92 runs ( 325.34 ms per token, 3.07 tokens per second)
llama_print_timings: total time = 35164.16 ms
{"timestamp":1688608023,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37216,"status":200,"method":"POST","path":"/completion","params":{}}
^C
Environment and Context
uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android
lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: Qualcomm
Model name: Kryo-4XX-Silver
Model: 14
Thread(s) per core: 1
Core(s) per socket: 49
Socket(s): 1
Stepping: 0xd
CPU(s) scaling MHz: 62%
CPU max MHz: 1785.6000
CPU min MHz: 300.0000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull
sha1 sha2 crc32 atomics f
php asimdhp cpuid asimdrdm
lrcpc dcpop asimddp
Model name: Kryo-4XX-Gold
Model: 14
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
Stepping: 0xd
CPU(s) scaling MHz: 74%
CPU max MHz: 2841.6001
CPU min MHz: 710.4000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull
sha1 sha2 crc32 atomics f
php asimdhp cpuid asimdrdm
lrcpc dcpop asimddp
$ Python 3.11.4
$ GNU Make 4.4.1
$clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin
Failure Information (for bugs)
llama_eval_internal: first token must be BOS
llama_eval: failed to eval
Steps to Reproduce
- git clone https://github.com/ggerganov/llama.cpp
- Make
- ./server -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -t 4 -b 10
- Then interact with the model 2-3 times.
git log | head -1
commit 31cfbb1013a482e89c72146e2063ac4362becae7
Thank you!
Metadata
Metadata
Assignees
Labels
No labels