Description
Discussed in #1126
Originally posted by pkese October 28, 2023
If anyone is interested...
I made a small language model inspired by https://github.com/karpathy/nanoGPT in both PyTorch and TorchSharp.
The model has 2 layers of transformers totalling 150k parameters and is trained on Shakespeare's text.
I found out that going to smaller data types, improves training time, as does PyTorch's jit.compile
, which is not available in TorchSharp.
Here are some benchmarks of model training times (minutes and seconds) with CUDA on a small GPU (RTX 3070).
default | tf32 | bf16 | |
---|---|---|---|
TorchSharp 0.100.7 | 6:46 | 5:20 | N/A |
PyTorch 2.0.1 | 5:31 | 5:27 | 4:28 |
PyTorch+jit.compile | 4:04 | 3:57 | 2:26 |
For bf16
I used:
from torch.cuda.amp import autocast
with autocast(dtype=torch.bfloat16):
<train code>
I couldn't achieve the same bf16
functionality with TorchSharp.
I don't quite understand why default TorchSharp code is slower than default PyTorch code.
After I set torch.backends.cuda.matmul.allow_tf32 = true
in both Python and TorchSharp, I get comparable performance (see first vs second column of results).
If someone is interested I can publish the code.
(I was trying to also get TorchScript models to work on both sides which messed up the code quite a bit ... and I might wish to reverse that.)
BTW, TorchScript model was 1% slower to train on PyTorch and crashed in TorchSharp.