Skip to content

[Question]: Question about KV-cache storage #20

Open

Description

Describe the issue

Thank you for the amazing work!

  1. Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?

  2. What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

feature requestNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions