Skip to content

Why is llama_synchronize called? #6366

Closed
@EricLBuehler

Description

@EricLBuehler

Hello all,

I was reading through the codebase and saw llama_synchronize was being called when the logits are retrieved:

https://github.com/ggerganov/llama.cpp/blob/cfc4d75df6399b36153ef739f2c1abee4c114bb8/ggml-cuda.cu#L2492

During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!

Edit: When I run a flamegraph, I get this:
llamacpp
It seems like avoiding the sync would be very beneficial!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions