Skip to content

gmlwns2000/streaming-llm-triton

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

Streaming-LLM with OpenAI Triton

OpenAI triton implementation of streaming LLM

You can now batch inference with streaming LLM and backward :D

Implementation

This implementation uses the Sparse COO format to store attention scores of the streaming LLM scheme.

Also, RoPE is inlined inside of the GPU kernel, so you do not need to handle the rope yourself!

This code is not fully optimized yet ... but correctly working!

  • Utilize TensorCore
  • L2 Cache Optimization
  • and more

About

OpenAI Triton Implementation of Streaming LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages