📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
-
Updated
Nov 1, 2024
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Shush is an app that deploys a WhisperV3 model with Flash Attention v2 on Modal and makes requests to it via a NextJS app
Triton implementation of FlashAttention2 that adds Custom Masks.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Uses the powerful WhisperS2T and Ctranslate2 libraries to batch transcribe multiple files
Flash Attention Implementation with Multiple Backend Support and Sharding This module provides a flexible implementation of Flash Attention with support for different backends (GPU, TPU, CPU) and platforms (Triton, Pallas, JAX).
Poplar implementation of FlashAttention for IPU
Toy Flash Attention implementation in torch
Transcribe audio in minutes with OpenAI's WhisperV3 and Flash Attention v2 + Transformers without relying on third-party providers and APIs. Host it yourself or try it out.
Add a description, image, and links to the flash-attention-2 topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention-2 topic, visit your repo's landing page and select "manage topics."