Open
Description
openedon Jul 6, 2024
Describe the issue
Thank you for the amazing work!
-
Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
-
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use
pip install Minference
w/t havingFlashAttention-2
andTriton == 2.1.0
installed, and then it outputtedERROR: Failed building wheel for pycuda
.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment