Why is `llama_synchronize` called?

Hello all,

I was reading through the codebase and saw `llama_synchronize` was being called when the logits are retrieved:
> https://github.com/ggerganov/llama.cpp/blob/cfc4d75df6399b36153ef739f2c1abee4c114bb8/ggml-cuda.cu#L2492

During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!

Edit: When I run a flamegraph, I get this:
![llamacpp](https://github.com/ggerganov/llama.cpp/assets/65165915/9e3925f0-b2c6-46de-af55-b041b5ee4faa)
It seems like avoiding the sync would be very beneficial!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is `llama_synchronize` called? #6366

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why is llama_synchronize called? #6366

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why is `llama_synchronize` called? #6366