Skip to content

llama cpp python server for llava slow token per second #1354

Open
@Kev1ntan

Description

Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64

command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf

curl --location 'http://localhost:9007/v1/chat/completions'
--header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92'
--header 'Content-Type: application/json'
--data '{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "hello"
}
]
}
],
"max_tokens": 1000,
"temperature": 0,
}'

INFO: Started server process [71075]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)

llama_print_timings: load time = 1491.98 ms
llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second)
llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second)
llama_print_timings: total time = 67791.77 ms / 62 tokens
INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK

can someone help? thanks

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions