High latency on a L4 GPU

### Due diligence

- [x] I have done my due diligence in trying to find the answer myself.

### Topic

The PyTorch implementation

### Question

Hello!

First of all, congrats! I've been doing some research about open-source speech-to-speech models and yours is by far the most natural one – I'm really excited to see your upcoming developments!

My question is about some high latency I'm experiencing on a L4 GPU when I start the server with `python -m moshi.server` on a GCP VM instance with a L4 GPU. On the [README.md](https://github.com/kyutai-labs/moshi/blob/main/README.md#moshi-a-speech-text-foundation-model-for-real-time-dialogue), you state that `Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an **L4 GPU**.`
As you can see in the image below I'm experiencing latencies up to 11ms. The latency starts to increase as the conversation progresses and I reached the 11ms at about 1min and 42s of conversation.

<img width="554" alt="Image" src="https://github.com/user-attachments/assets/4cc02d23-f493-44c7-bc1a-0d4c64a40091" />

Do you know what I'm doing wrong?
> Note: I'm still a noob in these topics, but very excited and eager to learn!

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latency on a L4 GPU #229

Due diligence

Topic

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High latency on a L4 GPU #229

Description

Due diligence

Topic

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions