See #1782 for background on this request.
We would like to add support to certain CUDA kernels/ops to handle overall tensor sizes > INT_MAX.
High priority ops:
- 4bit blockwise quantization and dequantization
- 4bit GEMV
- LLM.int8() quantization
- LLM.int8() matmul and dequantization
Medium priority ops:
- 8bit dynamic blockwise quantization
Low priority ops: