Closed
Description
Inspired by trying to push through #1409 or gptq experiments.
It's incredibly frustrating to fix quantization workflows to have new failures force you to redo hours of quantization.
Some stories:
- AWQ quantization is done on CPU then GPU, I tried to workaround [AWQ] Insane memory requirement: over 900GB for 32B model #1409 by increasing swap space but:
- When killed by Linux OOM, the whole terminal is killed so there is no trace of what sample #ID led to OOM. I.e. if left unattended, I have to do guesswork because there is no log available anymore.
- After the CPU OOM is fixed, I get a GPU OOM and so have to redo 40min of CPU computation (on overclocked Ryzen 9 9950X). It would be much much much more user-friendly to do a GPU test run at the very beginning, ensuring ahead of time that all hardware requirements are met instead of being surprised after people committed time.
- GPTQ quantization can have errors due to numerical instability and finding no solution with hessian.
- Those should definitely be logged to disk.
- It would be great time savings to be allowed to restart GPTQ from layers that converged and allow more samples or more seq lengths or a different dampening fraction for the follow-up layers that failed.