Easy naive flash attention without optimization and Flash Attention V2 with optimizations.
To compile and run the main program / 编译运行主程序:
nvcc -o flash flash.cuTo run the program / 运行程序:
./flashTo compile the benchmark program that compares fa and fa_v2 / 编译比较fa和fa_v2性能的基准测试程序:
nvcc -o ben benchmark.cu flash_kernels.cuTo run the benchmark / 运行基准测试:
./benTo save the benchmark results to a report file / 将基准测试结果保存到报告文件:
./ben > reports/benchmark_result.mdTo analyze using Nsight System / 使用Nsight System进行分析:
nsys profile -o --stats=true /data/coding/flash_attn_cuda/reports ./flashTo perform further analysis using Nsight Compute / 使用Nsight Compute进行进一步分析优化:
ncu --set detailed -o /your_own_path/flash_attn_cuda/reports/ncu_result ./flashNote: You may encounter permission issues when using Nsight Compute in a Docker container / 注意:在Docker容器中使用Nsight Compute可能会遇到权限问题:
==PROF== Connected to process 1722 (/...../flash_attn_cuda/flash)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see [https://developer.nvidia.com/ERR_NVGPUCTRPERM]
The benchmark compares three implementations: 基准测试比较了三种实现:
- CPU reference implementation / CPU参考实现
- fa (Flash Attention) CUDA implementation / fa (Flash Attention) CUDA实现
- fa_v2 (Flash Attention V2) optimized CUDA implementation / fa_v2 (Flash Attention V2) 优化CUDA实现
The benchmark measures: 基准测试测量了:
- Execution time / 执行时间
- GFLOPS (Giga Floating Point Operations Per Second) / 每秒十亿浮点运算次数
- Accuracy compared to CPU reference / 与CPU参考实现的精度比较
Example results show that fa_v2 achieves approximately 16-18% better performance than fa, and both GPU implementations significantly outperform the CPU reference implementation. 示例结果显示,fa_v2比fa的性能提高约16-18%,两种GPU实现都远优于CPU参考实现。
nvcc -o flash flash.cu要运行判断正确的程序使用/To run the correct program use
./your_path/flash只进行了使用nsight system 的分析/Only analysis using nsight system was performed
nsys profile -o --stats=true /data/coding/flash_attn_cuda/reports ./flash因为在使用nsight compute 进行进一步分析优化/Use nsight compute for further analysis and optimization
ncu --set detailed -o /your_own_path/flash_attn_cuda/reports/ncu_result ./flash遇到了权限不足的问题docker 容器权限不支持/Encountered a problem of insufficient permissions. Docker container permissions are not supported.
==PROF== Connected to process 1722 (/...../flash_attn_cuda/flash)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see [https://developer.nvidia.com/ERR_NVGPUCTRPERM](https://developer.nvidia.com/ERR_NVGPUCTRPERM)