Description
First of all guys i want to thank you for bringing such great instrument into people's hands. Half of planet countries blocked from ChatGPT already, many still forgetting this.
Secondly, i advice everyone not to waste time with 7 and 13 models, real ChatGPT experience started only from 30B model, it can hold discussion pattern, have somewhat short memory of spoken earlier things and if you shame it by mistakes (like it can't determine the current time always) it can make all into joke (13B model can't do anything of this).
I need to say it's incredibly optimized, i wasn't able to run a GPT2 1.5 billionth model even on GPU for comparison, only 774million. 13B somewhat equal in amount of gibberish to 774mln GPT2 by my opinion.
Now about problems. There's certainly present a CPU limit, maybe for low-end hardware (because with faster speed it's use of Ram will grow also faster, 30B growing to 24-25Gb)? On 13B model it used 17% of CPU and on 30B model it continues to use only 17% CPU max. This limit seriously ruined all experience with 30B by making it slower x2 than 13B in response time and even writing the words speed (it's writing like some ancient IBM machine). For powerful hardware limit must be removed, i have plenty resources with 128Gb RAM in quad channel mode, 14 cores Xeon (30B uses on my machine with Google Chrome totally 20% of RAM).
But i don't see any way to remove CPU limit, your files are pure machine code.
Also there's some limit with memory which makes 30B crashing after some volume of work, it always abruptly ending the discussion like on 5-7th prompt with this message:
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 537269808, available 536870912)