Fast CUDA implementation of soft-DTW for PyTorch.
Based on pytorch-softdtw but can run up to 100x faster!
Both forward()
and backward()
passes are implemented using CUDA.
My implementation is partly inspired by "Developing a pattern discovery method in time series data and its GPU acceleration" wherein a diagonal-based implementation of the Belman recursion is proposed.
This code depends on PyTorch and Numba.
Just include soft_dtw_cuda.py
in your projects, and you should be good to go!
You can also run the included profiler/test (tested with Python v3.6), and see the speedups you'd get:
git clone https://github.com/Maghoumi/pytorch-softdtw-cuda
cd pytorch-softdtw-cuda
python soft_dtw_cuda.py
A sample code is already provided in the script. Here's a quick example:
from soft_dtw_cuda import SoftDTW
# Create the sequences
batch_size, len_x, len_y, dims = 8, 15, 12, 5
x = torch.rand((batch_size, len_x, dims), requires_grad=True)
y = torch.rand((batch_size, len_y, dims))
# Transfer tensors to the GPU
x = x.cuda()
y = y.cuda()
# Create the "criterion" object
sdtw = SoftDTW(use_cuda=True, gamma=0.1)
# Compute the loss value
loss = sdtw(x, y) # Just like any torch.nn.xyzLoss()
# Aggregate and call backward()
loss.mean().backward()
Checkout DeepNAG, our deep non-adversarial gesture generator. We show that a RNN-based gesture generator trained with soft DTW can outperform the same generator trained using a GAN framework.
If you use this code in your research, please cite the following publications:
@phdthesis{maghoumi2020dissertation,
title={{Deep Recurrent Networks for Gesture Recognition and Synthesis}},
author={Mehran Maghoumi},
year={2020},
school={University of Central Florida Orlando, Florida}
}
@inproceedings{maghoumi2021deepnag,
title={DeepNAG: Deep Non-Adversarial Gesture Generation},
author={Maghoumi, Mehran and Taranta, Eugene Matthew and LaViola, Joseph},
booktitle={26th International Conference on Intelligent User Interfaces},
pages={213--223},
year={2021}
}
Consider starring this repository if you find it helpful. Also, don't forget to thank the author of pytorch-softdtw for his CPU implementation.
Also, please consider contributing to this project by improving the performance, addressing existing limitations, etc. PRs are greatly welcome!
Yes! Use the bandwitdh
argument to specify the Sakoe-Chiba bandwidth to use for pruning.
It depends on your batch size and sequence length. The longer the sequences and the larger the batch size, the faster this code runs.
Here's what I get with Intel Core-i7 12700K and Titan RTX:
Profiling forward() + backward() times for batch_size=128, seq_len_a=17, seq_len_b=15, dims=2...
CPU: 0.004228143487125635
GPU: 0.0014472737908363341
Speedup: 2.9214537801325924
Profiling forward() + backward() times for batch_size=512, seq_len_a=64, seq_len_b=64, dims=2...
CPU: 0.023894597217440604
GPU: 0.003414902277290821
Speedup: 6.997154025853163
Profiling forward() + backward() times for batch_size=512, seq_len_a=256, seq_len_b=256, dims=2...
CPU: 0.5894654761068523
GPU: 0.0343648319132626
Speedup: 17.153160463425888
Note that there are tons of opportunities for optimizing this code further (e.g. various CUDA optimizations such as the use shared memory, etc.). Contributions/improvements are greatly appreciated!
Depends on the length of your inputs. Because of the sequential nature of this code, the longer your input
sequences are, the higher numerical errors become due to accumulation. Especially in the backward()
call,
you could see floating point errors of up to 1e-3
on uniform random inputs in the range [0, 1)
in the
resulting derivative tensor.
The unit tests included in soft_dtw_cuda.py
verify the results against the CPU implementation.
Some limitations are:
- All sequences in the same batch should have the same length / number of features.
- Inputs cannot have lengths longer than 1024 (due to CUDA limitations on the maximum block size). The code will warn if your sequence length is too long, and will fall-back to the CPU implementation.
- You may run out of CUDA resources if your inputs are long (but still less than 1024). See below.
This means the length of your sequences is too long, and your GPU cannot spawn a sufficient number of threads. This is related to point 4 above in the "limitations". I'm not sure if it's possible to query the CUDA device in Numba to see if launching the kernel is possible given the number of necessary threads. In these cases consider using the CPU implementation.
This project is licensed under the MIT License.