Description
I've been really enjoying using both llama.cpp-python
and the original llama.cpp
. These are amazing developments here, especially for folks without massively powerful GPUs.
There's a really nice feature that was implemented in llama.cpp
in January to allow self-extend (ala LongLLM's approach)). It works well for the llama's main.cpp as well as server.cpp. It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.
It appears someone else might have been asking about this earlier here. Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend. Would you consider implementing self-extend as an option in llama.cpp-python
?