forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 5
Closed as not planned
Labels
Description
- Port vllm/main feature to ROCm
- Support Llama/Llama-2 models for v0.2.x
- Support SqueezeLLM
- Support YARN
- Merge into upstream vllm ([Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main vllm-project/vllm#1836)
- Look into supporting multi LORA on ROCm (Add multi-LoRA support vllm-project/vllm#1804)
- Support GGML Kernel (GGUF Quantization on ROCm) (https://github.com/EmbeddedLLM/vllm/tree/ggml-rocm) ([Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm vllm-project/vllm#10254)
- Prompt https://github.com/LMCache/LMCache
- Add ROCm support to torchac_cuda (https://github.com/EmbeddedLLM/torchac_rocm)
- Validate rocTX usage in Python (skip for now)
- Support AQLM Kernel (https://github.com/EmbeddedLLM/vllm/tree/aqlm-rocm)
- Upstream Cross-Attention kernel to support Llama 3.2 Vision Model
- BLOCKER: When passing only text input, the LLM engine will crash.
- Upstream New Feature
- Add context parallelism support through Star-Attention
- Benchmark
- Real-world Distribution benchmark https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html
- Benchmark GGUF support
Interesting works:
Interesting features regarding:
- long context mechanism on vLLM:
- [Bugfix]Fix evict v2 with long context length vllm-project/vllm#5411
- [Model] Implement DualChunkAttention for Qwen2 Models vllm-project/vllm#6139
- compute kernels on vLLM:
- quantization schemes on vLLM:
- Disaggregated prefill feature on vLLM:
- vLLM v0.7.0 tracker [Release]: v0.7.0 Release Tracker vllm-project/vllm#11218