Closed
Description
Hello all,
I was reading through the codebase and saw llama_synchronize
was being called when the logits are retrieved:
During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!
Edit: When I run a flamegraph, I get this:
It seems like avoiding the sync would be very beneficial!
Metadata
Metadata
Assignees
Labels
No labels