You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update on "[ET-VK] Store weights transposed for int8 linear"
## Context
The weight tensor of a linear layer is usually stored in a transposed manner, such that when computing the matrix multiplication, the reduction traverses along the rows of the weight tensor as opposed to the columns. This results in a better memory access pattern for CPUs.
However, for GPUs, I have found that "un-transposing" the weight tensors result in better performance. This is likely due to the fact since GPUs can compute multiple output elements in parallel, reading along the columns allows for coalescing memory loads among threads in a work group.
## Changes
* Introduce the ability to transpose height and weight dims when transferring tensor data to the GPU.
* Prepackthe weight tensor "un-transposed" for the int8 quantized linear operator
Differential Revision: [D72066588](https://our.internmc.facebook.com/intern/diff/D72066588/)
[ghstack-poisoned]
0 commit comments