For different reasons I don't want to make block size bigger than 128
At the same time it looks like unpack function could use 256 bit registers to make less loads, stores and instructions.
Am I wrong? Or such idea doesn't provide speedup.
Or it wasn't the purpose?
Maybe it's not good idea to mix different registers, I'm not sure.
But at least for block with bit width that even possible to use only 256 bit instructions/registers