Low prompt processing speed with mixtral?

I am Running WizardLM2-8x22B IQ4_XS on an AWS g5.12xlarge (split across 4 A10s). I haven't run a model of this size before but I am getting around 95t/s prompt processing and 14t/s generation (fully offloaded to the GPU). What I noticed is that the ratio of prompt processing speed to prompt generation speed is much lower for this model compared to running smaller models. Can anyone explain why this is the case? Any suggestions for being able to run the model on this system faster without too much quality loss? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low prompt processing speed with mixtral? #6740

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low prompt processing speed with mixtral? #6740

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions