Open
Description
Feature description
Currently, computations in Attention layer are 32-bit (floating point) while rest of the layers can do integer computations (8-bit and 16-bit). It would be great if the computations in Attention layer can also happen in 8-bit.
We already have intgemm to do 8-bit integer gemm operations in other layers and the same can be used for Attention layer as well.
Some advantages of doing it:
- Faster inference
- Removal of an sgemm library dependency for consumers who only want to do 8-bit int gemm