Streaming-LLM with OpenAI Triton

OpenAI triton implementation of streaming LLM

You can now batch inference with streaming LLM and backward :D

Implementation

This implementation uses the Sparse COO format to store attention scores of the streaming LLM scheme.

Also, RoPE is inlined inside of the GPU kernel, so you do not need to handle the rope yourself!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
sink_attention.py		sink_attention.py