LLM Infer, AI Infra, CUDA
-
Tsinghua University
- https://www.zhihu.com/people/mu-zi-zhi-6-28
- https://bruce-lee-ly.medium.com
Pinned Loading
-
decoding_attention
decoding_attention PublicDecoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
-
flash_attention_inference
flash_attention_inference PublicPerformance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
-
cuda_hgemm
cuda_hgemm PublicSeveral optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
cuda_hgemv
cuda_hgemv PublicSeveral optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
-
cutlass_gemm
cutlass_gemm PublicMultiple GEMM operators are constructed with cutlass to support LLM inference.
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.