Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

Merged
merged 5 commits into from
Mar 19, 2024

Conversation

sambarnes
Copy link
Collaborator

@sambarnes sambarnes commented Mar 18, 2024

Details

all of em are downloaded on dev, and initial tests show they function on a single A10G

i still need to run experiments to pick an appropriate concurrent_inputs parameter for the model -- so we can see how actual performance differs from original models & config

edit: ahh this discussion has great info.. perf suffers greatly at 8+ batch size

uses the following from TheBloke:

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I agree to license this contribution under the MIT LICENSE
  • I checked the current PR for duplication.

@sambarnes sambarnes changed the title perf: serve quantized versions of phi2, neuralchat, and psyfighters perf: serve quantized versions of phi2, neuralchat, and psyfighter1 Mar 19, 2024
@sambarnes sambarnes marked this pull request as ready for review March 19, 2024 14:40
@sambarnes sambarnes merged commit 2b1cb4a into main Mar 19, 2024
3 checks passed
@sambarnes sambarnes deleted the quantize-all-others branch March 19, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant