perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

sambarnes · 2024-03-18T19:44:47Z

Details

all of em are downloaded on dev, and initial tests show they function on a single A10G

~~i still need to run experiments to pick an appropriate concurrent_inputs parameter for the model -- so we can see how actual performance differs from original models & config~~

edit: ahh this discussion has great info.. perf suffers greatly at 8+ batch size

uses the following from TheBloke:

Code of Conduct

I agree to follow this project's Code of Conduct
I agree to license this contribution under the MIT LICENSE
I checked the current PR for duplication.

sambarnes added 3 commits March 18, 2024 13:43

perf: serve quantized versions of phi2, neuralchat, and psyfighters

4c58baa

revert: psyfighter2 back to unquantized & A100

d2fcb67

refactor: remove unnecessary usage fields & code

96704ce

sambarnes changed the title ~~perf: serve quantized versions of phi2, neuralchat, and psyfighters~~ perf: serve quantized versions of phi2, neuralchat, and psyfighter1 Mar 19, 2024

perf: limit max concurrency to 5 and drop batch size to 4

f03a12d

sambarnes marked this pull request as ready for review March 19, 2024 14:40

chore: clean up dead code

bd2581b

sambarnes merged commit 2b1cb4a into main Mar 19, 2024
3 checks passed

sambarnes deleted the quantize-all-others branch March 19, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

sambarnes commented Mar 18, 2024 •

edited

Loading

perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

perf: serve quantized versions of phi2, neuralchat, and psyfighter1 #80

Conversation

sambarnes commented Mar 18, 2024 • edited Loading

Details

Code of Conduct

sambarnes commented Mar 18, 2024 •

edited

Loading