Skip to content

Incredibly slow response time #49

Closed
@MartinPJB

Description

@MartinPJB

Hello.
I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.

Fyi, I am assuming it runs on my CPU, here are my specs:

  • I have 16.0Gb of RAM
  • I am using an AMD Ryzen 7 1700X Eight-Core Processor rated at 3.40Ghz
  • Just in case, my GPU is a NVIDIA GeForce RTX 2070 SUPER.

Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be).
I did a first test using the code showcased in the README.md

from llama_cpp import Llama
llm = Llama(model_path="models/7B/...")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

which returned me this:

image
The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (1124707.08ms, 18 minutes).

I did this second code in order to try a bit to see what could be causing the insanely long response time but I don't know what's going on.

from llama_cpp import Llama
import time
print("Model loading")
llm = Llama(model_path="./model/ggml-model-q4_0_new.bin")

while True:
    prompt = input("Prompt> ")
    start_time = time.time()

    prompt = f"Q: {prompt} A: "
    print("Your prompt:", prompt, "Start time:", start_time)

    output = llm(prompt, max_tokens=1, stop=["Q:", "\n"], echo=True)
    print("Output:", output)
    print("End time:", time.time())
    print("--- Prompt reply duration: %s seconds ---" % (time.time() - start_time))

I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.

PS: For those interested in the CLI output when it loads the model:

llama_model_load: loading model from './model/ggml-model-q4_0_new.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './model/ggml-model-q4_0_new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I apologize in advance if my english doesn't make sense sometimes, it is not my native language.
Thanks in advance for the help, regards. 👋

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions