quantize to F32/F16/Q8_0 can result in a Q6_K output tensor

Running quantize with a target dtype of F32, F16, or Q8_0 can result in a Q6_K output tensor without --pure (ref https://github.com/ggerganov/llama.cpp/pull/5631#issuecomment-1965055798). This is surprising, as I would expect converting to F32 and then quantizing to F16 to produce similar results to converting directly to F16.

I suggest that the k-quant mixture logic should never attempt to *decrease* the quality of the output tensor, only increase it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize to F32/F16/Q8_0 can result in a Q6_K output tensor #5818

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

quantize to F32/F16/Q8_0 can result in a Q6_K output tensor #5818

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions