You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed). -- I searched for xglm but couldn't find any mention at all in the whole project.
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Add support for facebook's XGLM models (e.g. xglm-564M), including converting them to gguf and then running them with llama.cpp. The HuggingFace implementation docs are here, while the models specs are here, for reference.
Motivation
XGLM models have good performance on specific tasks, despite their size. For example, the 564M model is small enough to run on any device, and can understand quite well the context of a sentence and extract specific parts (a useful use-case for an assistant). I could run xglm-564M using HuggingFace's framework from within termux, however it doesn't support (efficient) quantization on CPU, and so the model ends up using 3 GB of RAM and being quite slow. Also, I would like to embed an XGLM model in an Android app, and doing so with llama.cpp would be much simpler (and more efficient) than packaging python, transformers and all other dependencies of the HuggingFace implementation.
Possible Implementation
The convert-hf-to-gguf.py script should add support for XGLMForCausalLM and probably something more needs to be implemented in llama.cpp if some new layer type needs to be implemented. I would be open to help with implementation, however I don't know much about LLM architecture in general, and about the llama.cpp project specifically. I looked at one recently merged PR that added support for a new model structure, but couldn't really understand what was going on. Do you have some documentation on how to add new model types? Would it be simple to add XGLM or does it have a nonstandard architecture?
Prerequisites
Feature Description
Add support for facebook's XGLM models (e.g.
xglm-564M
), including converting them togguf
and then running them withllama.cpp
. The HuggingFace implementation docs are here, while the models specs are here, for reference.Motivation
XGLM models have good performance on specific tasks, despite their size. For example, the 564M model is small enough to run on any device, and can understand quite well the context of a sentence and extract specific parts (a useful use-case for an assistant). I could run xglm-564M using HuggingFace's framework from within termux, however it doesn't support (efficient) quantization on CPU, and so the model ends up using 3 GB of RAM and being quite slow. Also, I would like to embed an XGLM model in an Android app, and doing so with llama.cpp would be much simpler (and more efficient) than packaging python,
transformers
and all other dependencies of the HuggingFace implementation.Possible Implementation
The
convert-hf-to-gguf.py
script should add support forXGLMForCausalLM
and probably something more needs to be implemented inllama.cpp
if some new layer type needs to be implemented. I would be open to help with implementation, however I don't know much about LLM architecture in general, and about the llama.cpp project specifically. I looked at one recently merged PR that added support for a new model structure, but couldn't really understand what was going on. Do you have some documentation on how to add new model types? Would it be simple to add XGLM or does it have a nonstandard architecture?These are the layers in the XGLM-564M model:
This is the
config.json
file for the 564M model (taken from here):The text was updated successfully, but these errors were encountered: