Flash Attention CUDA Implementation

Easy naive flash attention without optimization and Flash Attention V2 with optimizations.

Compilation / 编译

Main Program / 主程序

To compile and run the main program / 编译运行主程序:

nvcc -o flash flash.cu

To run the program / 运行程序:

./flash

Benchmark Program / 基准测试程序

To compile the benchmark program that compares fa and fa_v2 / 编译比较fa和fa_v2性能的基准测试程序:

nvcc -o ben benchmark.cu flash_kernels.cu

To run the benchmark / 运行基准测试:

./ben

To save the benchmark results to a report file / 将基准测试结果保存到报告文件:

./ben > reports/benchmark_result.md

Performance Analysis / 性能分析

Nsight System / 系统性能分析

To analyze using Nsight System / 使用Nsight System进行分析:

nsys profile -o --stats=true /data/coding/flash_attn_cuda/reports ./flash

Nsight Compute / 计算性能分析

To perform further analysis using Nsight Compute / 使用Nsight Compute进行进一步分析优化:

ncu --set detailed -o /your_own_path/flash_attn_cuda/reports/ncu_result ./flash

Note: You may encounter permission issues when using Nsight Compute in a Docker container / 注意：在Docker容器中使用Nsight Compute可能会遇到权限问题:

==PROF== Connected to process 1722 (/...../flash_attn_cuda/flash)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see [https://developer.nvidia.com/ERR_NVGPUCTRPERM]

Benchmark Results / 基准测试结果

The benchmark compares three implementations: 基准测试比较了三种实现:

CPU reference implementation / CPU参考实现
fa (Flash Attention) CUDA implementation / fa (Flash Attention) CUDA实现
fa_v2 (Flash Attention V2) optimized CUDA implementation / fa_v2 (Flash Attention V2) 优化CUDA实现

The benchmark measures: 基准测试测量了:

Execution time / 执行时间
GFLOPS (Giga Floating Point Operations Per Second) / 每秒十亿浮点运算次数
Accuracy compared to CPU reference / 与CPU参考实现的精度比较

Example results show that fa_v2 achieves approximately 16-18% better performance than fa, and both GPU implementations significantly outperform the CPU reference implementation. 示例结果显示，fa_v2比fa的性能提高约16-18%，两种GPU实现都远优于CPU参考实现。

nvcc -o flash flash.cu

要运行判断正确的程序使用/To run the correct program use

./your_path/flash

只进行了使用nsight system 的分析/Only analysis using nsight system was performed

nsys profile -o --stats=true /data/coding/flash_attn_cuda/reports ./flash

因为在使用nsight compute 进行进一步分析优化/Use nsight compute for further analysis and optimization

ncu --set detailed -o /your_own_path/flash_attn_cuda/reports/ncu_result ./flash

遇到了权限不足的问题docker 容器权限不支持/Encountered a problem of insufficient permissions. Docker container permissions are not supported.

==PROF== Connected to process 1722 (/...../flash_attn_cuda/flash)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see [https://developer.nvidia.com/ERR_NVGPUCTRPERM](https://developer.nvidia.com/ERR_NVGPUCTRPERM)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
common		common
reports		reports
LICENSE		LICENSE
README.md		README.md
benchmark.cu		benchmark.cu
flash.cu		flash.cu
flash_kernels.cu		flash_kernels.cu
flash_v2.cu		flash_v2.cu
p.cu		p.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flash Attention CUDA Implementation

Compilation / 编译

Main Program / 主程序

Benchmark Program / 基准测试程序

Performance Analysis / 性能分析

Nsight System / 系统性能分析

Nsight Compute / 计算性能分析

Benchmark Results / 基准测试结果

About

Uh oh!

Releases

Packages

Languages

License

XiaomingFun233/flash_attn_cuda

Folders and files

Latest commit

History

Repository files navigation

Flash Attention CUDA Implementation

Compilation / 编译

Main Program / 主程序

Benchmark Program / 基准测试程序

Performance Analysis / 性能分析

Nsight System / 系统性能分析

Nsight Compute / 计算性能分析

Benchmark Results / 基准测试结果

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages