-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Labels
feature requestNew feature or requestNew feature or requestquestionFurther information is requestedFurther information is requested
Description
Describe the issue
Thank you for the amazing work!
-
Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
-
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use
pip install Minferencew/t havingFlashAttention-2andTriton == 2.1.0installed, and then it outputtedERROR: Failed building wheel for pycuda.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestquestionFurther information is requestedFurther information is requested