Skip to content

Low prompt processing speed with mixtral? #6740

Closed
@LiquidGunay

Description

@LiquidGunay

I am Running WizardLM2-8x22B IQ4_XS on an AWS g5.12xlarge (split across 4 A10s). I haven't run a model of this size before but I am getting around 95t/s prompt processing and 14t/s generation (fully offloaded to the GPU). What I noticed is that the ratio of prompt processing speed to prompt generation speed is much lower for this model compared to running smaller models. Can anyone explain why this is the case? Any suggestions for being able to run the model on this system faster without too much quality loss? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions