Replies: 3 comments
-
Yes, that has been attempted in the past, it is very slow. |
Beta Was this translation helpful? Give feedback.
-
For generating tokens you're I/O bound. Loading the data from RAM to VRAM and then from VRAM into the GPU is going to be slower than just loading the weights from RAM into the CPU. |
Beta Was this translation helpful? Give feedback.
-
Would on-device dynamic decompression be worth looking into? Model parameters, KV etc. are fairly compressible even before quantization. So one would load a compressed model to VRAM (faster I/O) and dynamically decompress the next layer weights & cached KV in parallel to the inference step. For batches sized N, keep up to N decompressed layers in a circular buffer as values propagate, given the data hazard between layers. Futhermore, a just-in-time prefetcher thats loads RAM->VRAM the next compressed portion of a large model, overwriting stale data, could compensate for the I/O latency. Does this make any sense? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have 12GB GPU and 128GB CPU. I can do ~64 Tok/s on GPU, but as soon as one layer is on CPU it drops down to ~12 Tok/s.
I couldn't find any discussion/approaches for dynamic paging of model layers - e.g. load first 12 layers, compute 12th layer output, load next 12 layers, compute 24th layer output - all on GPU. Is doing such paging really slower than letting the CPU crunch through numbers?
Beta Was this translation helpful? Give feedback.
All reactions