Support quantizing tensors when numel() > INT_MAX

See #1782 for background on this request.

We would like to add support to certain CUDA kernels/ops to handle overall tensor sizes > INT_MAX.

High priority ops:
* 4bit blockwise quantization and dequantization
* 4bit GEMV
* LLM.int8() quantization
* LLM.int8() matmul and dequantization

Medium priority ops:
* 8bit dynamic blockwise quantization

Low priority ops:
* Optimizers