Skip to content

Potentially slow when running quantized versions on Desktop CPU #627

Open
@mergennachin

Description

@mergennachin

I tried the following four versions on my Macbook Pro M1.

(1) - Really slow

python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int8": {"groupsize" : 64}}' 

Average tokens/sec: 0.56

(2) - Acceptable?

python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}'

Average tokens/sec: 4.26

(3) - Slow

python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 4, "groupsize": 32}}'

Average tokens/sec: 2.31

(4) - Slow

python3 torchchat.py generate llama3 --prompt "Hello, my name is" --quantize '{"linear:int4": {"groupsize" : 256}}'

Average tokens/sec: 2.94

Setup:

git commit: 695a581
python version: 3.10.0
macbook pro M1

Internal Task: T187752023

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions