Skip to content

Bug: sample time becomes very long when using Llama-3 #7554

Closed
@kooWZ

Description

@kooWZ

What happened?

I was running Llama-3 on 3090 and I encountered the same performance problem in #1376.
When using grammar files, sample time becomes very long and GPU utilization dropped from 70%+(when not using grammar) to 10%.
I tried two different fine-tuned version of Llama-3 and the problem remains.
With Llama-2 there is no such problem. So I believe it is due to some kind of bug in llama.cpp
I offloaded all layers to GPU and I believe I have llama.cpp properly configured.

Name and Version

version: 2998 (9588f19)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Llama-3-8B-Instruct with grammar:
llama_print_timings:        load time =     195.81 ms
llama_print_timings:      sample time =    7656.05 ms /    90 runs   (   85.07 ms per token,    11.76 tokens per second)
llama_print_timings: prompt eval time =     192.27 ms /   410 tokens (    0.47 ms per token,  2132.44 tokens per second)
llama_print_timings:        eval time =     944.78 ms /    89 runs   (   10.62 ms per token,    94.20 tokens per second)
llama_print_timings:       total time =    9298.97 ms /   499 tokens

Llama3-8B-Instruct without grammar:
llama_print_timings:        load time =     193.30 ms
llama_print_timings:      sample time =     387.66 ms /   233 runs   (    1.66 ms per token,   601.04 tokens per second)
llama_print_timings: prompt eval time =     192.93 ms /   410 tokens (    0.47 ms per token,  2125.09 tokens per second)
llama_print_timings:        eval time =    2355.86 ms /   232 runs   (   10.15 ms per token,    98.48 tokens per second)
llama_print_timings:       total time =    3277.20 ms /   642 tokens

Llama-2-8B with grammar:
llama_print_timings:        load time =     210.30 ms
llama_print_timings:      sample time =     354.68 ms /    54 runs   (    6.57 ms per token,   152.25 tokens per second)
llama_print_timings: prompt eval time =     209.69 ms /   464 tokens (    0.45 ms per token,  2212.84 tokens per second)
llama_print_timings:        eval time =     492.42 ms /    53 runs   (    9.29 ms per token,   107.63 tokens per second)
llama_print_timings:       total time =    1128.22 ms /   517 tokens

Llama-2-8B without grammar:
llama_print_timings:        load time =     194.85 ms
llama_print_timings:      sample time =     153.25 ms /   367 runs   (    0.42 ms per token,  2394.76 tokens per second)
llama_print_timings: prompt eval time =     194.44 ms /   464 tokens (    0.42 ms per token,  2386.38 tokens per second)
llama_print_timings:        eval time =    3512.26 ms /   366 runs   (    9.60 ms per token,   104.21 tokens per second)
llama_print_timings:       total time =    4094.80 ms /   830 tokens

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions