You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation for the server says that -t N option is not used if model layers are offloaded to GPU. However, when some layers are offloaded to GPU I still see the load on CPU grow to 1200% with -t 12 option during inference, but the load on GPU is very small and happens in short bursts up to 10% or so. However, if the model is so small thar ALL layers can be offloaded to GPU, then the load on CPU does not exceed 100%, but the load on GPU attains 100%.
So, my guess is that the documentation is supposed to say "-t N option not used when ALL layers are offloaded to GPU", right?
The text was updated successfully, but these errors were encountered:
The documentation for the
server
says that-t N
option is not used if model layers are offloaded to GPU. However, when some layers are offloaded to GPU I still see the load on CPU grow to 1200% with-t 12
option during inference, but the load on GPU is very small and happens in short bursts up to 10% or so. However, if the model is so small thar ALL layers can be offloaded to GPU, then the load on CPU does not exceed 100%, but the load on GPU attains 100%.So, my guess is that the documentation is supposed to say "-t N option not used when ALL layers are offloaded to GPU", right?
The text was updated successfully, but these errors were encountered: