Skip to content

Latest commit

 

History

History
16 lines (10 loc) · 489 Bytes

README.md

File metadata and controls

16 lines (10 loc) · 489 Bytes

Streaming-LLM with OpenAI Triton

OpenAI triton implementation of streaming LLM

You can now batch inference with streaming LLM and backward :D

Implementation

This implementation uses the Sparse COO format to store attention scores of the streaming LLM scheme.

Also, RoPE is inlined inside of the GPU kernel, so you do not need to handle the rope yourself!

This code is not fully optimized yet ... but correctly working!

  • Utilize TensorCore
  • L2 Cache Optimization
  • and more