Closed
Description
We summarize the issues we received and our planned features in this issue. This issue will keep being updated.
Latest issue tracked: #677
Software Quality
- Code formater Add code formatting script & Add CI to check code format #57
- Tests for model correctness Add tests for models #101
- Tests for samplers Add tests for sampler #108
- Pypi CD Add CD to PyPI #97
- CI
Installation
- CUDA version Build failure due to CUDA version mismatch #129
- Pre-built CUDA Wheels Publish wheels with pre-built CUDA binaries #139 Request for creation of a wheel for vllm #695
- Support ROCM Installing with ROCM #621
- Windows/WSL installation Bug: Windows installation #179 WSL Ubuntu installation issue #192
- H100 Add support for H100 #199 RuntimeError: attn_bias is not correctly aligned #407
- Support CUDA 12 cuda 12 #385
- Dockerfile feature request: Dockerfile #390
- All other issues with
Installation
label
Documentation
- Documentation CD
- Documentation on LLMEngine and AsyncLLMEngine
- Documentation on user interfaces and the APIs How to set ParallelConfig and SchedulerConfig? #361 Where is the API reference? #395
- Documentation on distributed execution Documentation on distributed execution #206 When can I support multi graphics cards? #228 model parallelism #243 Multi-GPU inference and Specify which GPUs to be used during inference #250 多gpus如何使用? #581
- More detailed guide on adding a new model (possibly simplification in code). Especially how to modify the
forward
function. How integrate with hf with minial modification? #242 - Include latency benchmark results.
- On memory usage. Question regarding the nearly double GPU memory consumption. #241 GPU consumption #550
- How to specify which GPU to use How to specify which gpu to use? #691 os.environ['CUDA_VISIBLE_DEVICES'] = '1' does not work in jupyter #571
New Models
Decoder-only models
- BLOOM Support BLOOM #61
- Falcon Support for Falcon-7B / 40B models #195 Any plans to support Falcon? #197 Anyone adapting falcon 40B&7B models now? #356
- GPT-J Add support for GPTJ #198
- MPT Support for MPT-7B and MPT-30B #218 feature request: support mpt-30b #332
- LongChat Support for longchat-7b-16k #358
- Baichuan-7B why not support baichuan-7b? #303 baichuan-7b return value of apiserver is garbled #400 Support for baichuan models #428
- Baichuan-13B
- LLaMA-2 Support LLaMA-2 #501
Encoder-decoder models
- Whisper Whisper support #180
- T5 Adding support for encoder-decoder models, like T5 or BART #187 Support for fastchat-t5-3b-v1.0 #223 T5 model support #404 Finetuned Flan-T5 #434 T5 like encoder-decoder model support #668
- BART Adding support for encoder-decoder models, like T5 or BART #187
- GLM Support for chatglm-6b #231 when to support chatglm2-6b? #247
Other techniques:
- Quantized models: see Kernels/Quantized PagedAttention
- LoRA: Would it be possible to support LoRA fine-tuned models? #182
- Multi-modal models: [Question] Usage with Multimodal LLM #307
Frontend Features
vLLM demo frontends:
- List of inputs as OpenAI input Langchain passes
prompt
as alist
instead ofstr
#186 Possibility of Passing Prompts as List[str] to AsyncEngine.generate() #279 - Echo Implementing Echo in OpenAI endpoint #201
- Support
ChatCompletion
Endpoint SupportChatCompletion
Endpoint in OpenAI demo server #311 - Use soft embeddings as input does vicuna support embedding input? #369
- Support
logit_bias
[Feature] Add support forlogit_bias
#379 I want use the function prefix_allowed_tokens_fn of huggingface model.generate(), where of vllm's source code shall I modify? #415 - User-defined conversation template feature request: Support user-defined conversation template #408
- Specify GPU to run on How to specify which GPU the model inference on? #352 Specify GPUs bug (torch.distributed.all_reduce(torch.zeros(1).cuda())) #470
Integration with other frontends:
- FastChat (merged)
- Ray Serve (merged)
- NVIDIA Triton NVIDIA Triton support #541
- SkyPilot
- LangChain (Support from LangChain) LangChain and LlamaIndex support #233 Langchain passes
prompt
as alist
instead ofstr
#186 Langchain/LLAMA_INDEX #553
Engine Optimization and New Features
- Smoothen the process of adding a new model Support custom models #112 Require a "Wrapper" feature #258 Best effort support for all Hugging Face transformers models #616
- User-specified tokenizer Support custom tokenizer #111 Why vllm does not support Chinese input #246 How to mannually Set use_fast for tokenizer to False? #259 The hf-internal-testing/llama-tokenizer do not support Chinese prompt #270 garbage output from h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b #281
- Implement models in C++ to reduce Python overhead Modify the current PyTorch model to C++ #42 Tensor Parallelism vs Data Parallelism #367
- Pipeline parallel support pipeline parallel support in the future? #387
- Prefix sharing support Question about efficient memory sharing (prefix sharing) #227
- Clasifier Free Guidance Is there a way to add classifier free guidance (CFG) to vllm while maintaining super fast inference? #620
- Speculative decoding Scope for assisted generation? #439
- Distributed inference with other frameworks Remove Ray for the dependency #208 question: Is it possible to avoid ray in single machine multiple GPUs serving? #391 Support Kuberenetes for Distributed Serving #457
- Better model loading Faster model loading #474 Increase code robustness #519 Llama2 answers is noise #615
- More flexible stop criteria Support custom stop function? #551
- Random Python overheads Consider optimizing the API server #580
Kernels
- Multi-query attention How does this compare to MQA (multi-query attention)? #169
- PagedAttention kernel with multiple query positions. Fix the rushed out multi-query kernel #44
- Quantized PagedAttention GPTQ / Quantization support? #174 What is the correct way to use quantized versions of vicuna or guanco? #210
8-bit quantization
support #214 Not able to used qlora models with vllm #252 8bit support #295 support for quantized models? #316 Loading quantized models #392 - Sampling kernels Implement custom kernels for top-k and top-p sampling #125 Question about sampler. It takes too much time #249
- Condensed RotaryEmbeddings Support for Condensed RotaryEmbeddings #333 supporting superhot models? #388 RoPE scaling support? #464 Request: NTK rope support #479 Does vllm support vicuna-13b-v1.5-16k ? #674 Add AliBi context scaling into vllm for Baichuan13B #686
- Flash Attention V2 Flash Attention V2 #485
- FP8 Kernel TE FP8 support? #448
Bugs
- Floating point comparison Dangerous floating point comparison #71
- Check input length Check whether the input request is too long #113 Prompt size limits? It keeps hanging with prompts longer than 120 tokens #276 Long context will cause the vLLM stop #286 scheduler max-length #447
- Do not init process groups when using a single GPU Do not initialize process group when using a single GPU #117 How to initialize two LLMs in one service? #565 Running two different models on the same machine #654
- Ray tensor parallel bugs ray OOM in tensor parallel #322 Stuck while inferring with WizardCoder model #366 [MPT-30B] OutOfMemoryError: CUDA out of memory #372 Cuda failure 'peer access is not supported between these two devices' #406
- Performance comparison with TGI TGI performance is better than vllm on A800 #262 higher latency than TGI #335 Outdated benchmarks #381
- All other issues with
Bug
label
Metadata
Metadata
Assignees
Labels
No labels