Closed
Description
I am Running WizardLM2-8x22B IQ4_XS on an AWS g5.12xlarge (split across 4 A10s). I haven't run a model of this size before but I am getting around 95t/s prompt processing and 14t/s generation (fully offloaded to the GPU). What I noticed is that the ratio of prompt processing speed to prompt generation speed is much lower for this model compared to running smaller models. Can anyone explain why this is the case? Any suggestions for being able to run the model on this system faster without too much quality loss? Thanks.