Orthogonal Matching Pursuit, implemented using BLAS (cpu) and PyTorch (gpu).
Our implementations vastly outperform those in Scikit-Learn, with the PyTorch version on GPU being over 100 times faster.
A demo along with the implementation can be found in this Colab.
Associated Paper feat. Sebastian Praesius!: Efficient Batched CPU/GPU OMP.