Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Description of "-t N" option for server is inaccurate #7355

Closed
tigran123 opened this issue May 17, 2024 · 1 comment · Fixed by #7362
Closed

Description of "-t N" option for server is inaccurate #7355

tigran123 opened this issue May 17, 2024 · 1 comment · Fixed by #7362

Comments

@tigran123
Copy link

The documentation for the server says that -t N option is not used if model layers are offloaded to GPU. However, when some layers are offloaded to GPU I still see the load on CPU grow to 1200% with -t 12 option during inference, but the load on GPU is very small and happens in short bursts up to 10% or so. However, if the model is so small thar ALL layers can be offloaded to GPU, then the load on CPU does not exceed 100%, but the load on GPU attains 100%.

So, my guess is that the documentation is supposed to say "-t N option not used when ALL layers are offloaded to GPU", right?

@JohannesGaessler
Copy link
Collaborator

You're right, the documentation is wrong, see #7362 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants