llama cpp python server for llava slow token per second

Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64

command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf

curl --location 'http://localhost:9007/v1/chat/completions' \
--header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "hello"
          }
        ]
      }
    ],
    "max_tokens": 1000,
    "temperature": 0,
}'

INFO:     Started server process [71075]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)

llama_print_timings:        load time =    1491.98 ms
llama_print_timings:      sample time =       2.17 ms /    26 runs   (    0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time =    1491.90 ms /    37 tokens (   40.32 ms per token,    24.80 tokens per second)
llama_print_timings:        eval time =   66226.55 ms /    25 runs   ( 2649.06 ms per token,     0.38 tokens per second)
llama_print_timings:       total time =   67791.77 ms /    62 tokens
INFO:     ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK

can someone help? thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama cpp python server for llava slow token per second #1354

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development