You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.
Q1-Q2 2024
Efficiency
Adding flash decoding with paged KV cache support [Done]
Introducing attention kernel capable of supporting speculative decoding [Ongoing]
Exploring the feasibility of adopting the flashinfer library [Ongoing]
Implementing speculative decoding [Done]
Enabling CUDA graph for decoding to improve performance [Done]
Implementing dynamic split-fuse for enhanced latency [Done]
Exploring lookahead decoding support
Implementing fused FFN (Feed-Forward Network) to enhance efficiency
Introducing a ring attention mechanism for handling long contexts
Cache
Implementing stateful conversation to avoid recomputing for chat sessions [Done]
We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.
Q1-Q2 2024
Efficiency
Cache
New Models
New Devices
Usability
New GPU Architecture
Structural Decoding
Quantization
Supported Operating Systems
Misc
The text was updated successfully, but these errors were encountered: