Description
Due diligence
- I have done my due diligence in trying to find the answer myself.
Topic
The PyTorch implementation
Question
Hello!
First of all, congrats! I've been doing some research about open-source speech-to-speech models and yours is by far the most natural one – I'm really excited to see your upcoming developments!
My question is about some high latency I'm experiencing on a L4 GPU when I start the server with python -m moshi.server
on a GCP VM instance with a L4 GPU. On the README.md, you state that Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an **L4 GPU**.
As you can see in the image below I'm experiencing latencies up to 11ms. The latency starts to increase as the conversation progresses and I reached the 11ms at about 1min and 42s of conversation.

Do you know what I'm doing wrong?
Note: I'm still a noob in these topics, but very excited and eager to learn!
Thank you in advance!