🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
-
Updated
Aug 8, 2025 - Cuda
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
General Matrix Multiplication using NVIDIA Tensor Cores
Vulkan & GLSL implementation of FlashAttention-2
A benchmarking framework for correlators of FX telescope arrays
Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
TsuruTune is a comprehensive deep learning model optimization tool designed specifically for NVIDIA Jetson platforms and edge devices.. It leverages Tensor Core acceleration and memory bandwidth alignment to achieve optimal performance for deep learning inference on edge devices.
Vulkan & GLSL implementation of FlashAttention-2
Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.
To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."