-
Notifications
You must be signed in to change notification settings - Fork 76
Closed
Labels
Description
V0.1.5 Iteration Plan
New Model Support
Feature Support
- Remove the
pycudadependency; [Question]: Question about KV-cache storage #20 - Change the
flash_attndependency to optional; [Question]: Is A6000 supported? #23 @liyucheng09 - Add unittest. @liyucheng09
- Supported in Feature(MInference): add unittest #31
- Support multi-gpu [Question]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1) #25;
- Add end-to-end benchmark script using vLLM [Question]: MInference Pre filling is slower than the vllm original version #18;
Bugfix
- Fix the apply_rotary_pos_emb_single function. [Question]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1) #25.
- Fix the import warning; [Bug]: missing warnings import in
setup.py#28 - Fix the vLLM >= 0.4.1; [Bug]: NameError: name 'cache_ops' is not defined #42
- Fix the
is_flash_attn_2_availableissue;
Reactions are currently unavailable