Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScaleLLM Roadmap #84

Open
9 of 31 tasks
guocuimi opened this issue Mar 16, 2024 · 3 comments
Open
9 of 31 tasks

ScaleLLM Roadmap #84

guocuimi opened this issue Mar 16, 2024 · 3 comments
Assignees
Labels

Comments

@guocuimi
Copy link
Collaborator

guocuimi commented Mar 16, 2024

We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.

Q1-Q2 2024

Efficiency

  • Adding flash decoding with paged KV cache support [Done]
  • Introducing attention kernel capable of supporting speculative decoding [Ongoing]
    • Exploring the feasibility of adopting the flashinfer library [Ongoing]
  • Implementing speculative decoding [Done]
  • Enabling CUDA graph for decoding to improve performance [Done]
  • Implementing dynamic split-fuse for enhanced latency [Done]
  • Exploring lookahead decoding support
  • Implementing fused FFN (Feed-Forward Network) to enhance efficiency
  • Introducing a ring attention mechanism for handling long contexts

Cache

  • Implementing stateful conversation to avoid recomputing for chat sessions [Done]

New Models

  • Integrating Google Gemma [Done]
  • Integrating Llama3 [Done]
  • Incorporating the Mixtral MoE model [Ongoing]
    • Implementing MoE (Mixture of Experts) kernels
  • Introducing the Mamba model
  • Introducing multi-modal models [Ongoing]
    • LLaVA model
  • LoRA & QLoRA
    • S-LoRA: Serving thousands of LoRA adapters

New Devices

  • Adding support for Apple chips
  • Exploring other chips such as TPU, etc.

Usability

  • Developing Python wrapper for easier integration [Done]
  • Enhancing documentation for improved usability [Ongoing]

New GPU Architecture

  • Turing architecture (sm75)

Structural Decoding

  • Function Calling

Quantization

  • Supporting FP8 for both models and KV caches

Supported Operating Systems

  • Extending support to macOS and Windows platforms

Misc

  • Conducting benchmarking to compare performance with other open-source projects [Ongoing]
  • Adding more bechmarks and unittests for kernels and dependencies [Ongoing]
  • Adding more Prometheus metrics and creating a Grafana dashboard for monitoring.
  • Loosening coupling with PyTorch for easy deployment
@omarmhaimdat
Copy link

I think LLaMA 3 should be added as well, and probably should be high priority.

@guocuimi
Copy link
Collaborator Author

guocuimi commented Apr 19, 2024

I think LLaMA 3 should be added as well, and probably should be high priority.

Yes, Llama3 is supported already, please check latest release. https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.8

@omarmhaimdat
Copy link

Woow @guocuimi, thank you for your quick update! You guys rock !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants