Support for XGLM models #5097

Stypox · 2024-01-23T10:32:47Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed). -- I searched for xglm but couldn't find any mention at all in the whole project.
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Add support for facebook's XGLM models (e.g. xglm-564M), including converting them to gguf and then running them with llama.cpp. The HuggingFace implementation docs are here, while the models specs are here, for reference.

Motivation

XGLM models have good performance on specific tasks, despite their size. For example, the 564M model is small enough to run on any device, and can understand quite well the context of a sentence and extract specific parts (a useful use-case for an assistant). I could run xglm-564M using HuggingFace's framework from within termux, however it doesn't support (efficient) quantization on CPU, and so the model ends up using 3 GB of RAM and being quite slow. Also, I would like to embed an XGLM model in an Android app, and doing so with llama.cpp would be much simpler (and more efficient) than packaging python, transformers and all other dependencies of the HuggingFace implementation.

Possible Implementation

The convert-hf-to-gguf.py script should add support for XGLMForCausalLM and probably something more needs to be implemented in llama.cpp if some new layer type needs to be implemented. I would be open to help with implementation, however I don't know much about LLM architecture in general, and about the llama.cpp project specifically. I looked at one recently merged PR that added support for a new model structure, but couldn't really understand what was going on. Do you have some documentation on how to add new model types? Would it be simple to add XGLM or does it have a nonstandard architecture?

These are the layers in the XGLM-564M model:

model.embed_tokens.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.bias
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.bias
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.bias
model.layers.0.self_attn.out_proj.weight
model.layers.0.self_attn.out_proj.bias
model.layers.0.self_attn_layer_norm.weight
model.layers.0.self_attn_layer_norm.bias
model.layers.0.fc1.weight
model.layers.0.fc1.bias
model.layers.0.fc2.weight
model.layers.0.fc2.bias
model.layers.0.final_layer_norm.weight
model.layers.0.final_layer_norm.bias
model.layers.1.self_attn.k_proj.weight
model.layers.1.self_attn.k_proj.bias
model.layers.1.self_attn.v_proj.weight
model.layers.1.self_attn.v_proj.bias
model.layers.1.self_attn.q_proj.weight
model.layers.1.self_attn.q_proj.bias
model.layers.1.self_attn.out_proj.weight
model.layers.1.self_attn.out_proj.bias
model.layers.1.self_attn_layer_norm.weight
model.layers.1.self_attn_layer_norm.bias
model.layers.1.fc1.weight
model.layers.1.fc1.bias
model.layers.1.fc2.weight
model.layers.1.fc2.bias
model.layers.1.final_layer_norm.weight
model.layers.1.final_layer_norm.bias
model.layers.2.self_attn.k_proj.weight
model.layers.2.self_attn.k_proj.bias
model.layers.2.self_attn.v_proj.weight
model.layers.2.self_attn.v_proj.bias
model.layers.2.self_attn.q_proj.weight
model.layers.2.self_attn.q_proj.bias
model.layers.2.self_attn.out_proj.weight
model.layers.2.self_attn.out_proj.bias
model.layers.2.self_attn_layer_norm.weight
model.layers.2.self_attn_layer_norm.bias
model.layers.2.fc1.weight
model.layers.2.fc1.bias
model.layers.2.fc2.weight
model.layers.2.fc2.bias
model.layers.2.final_layer_norm.weight
model.layers.2.final_layer_norm.bias
model.layers.3.self_attn.k_proj.weight
model.layers.3.self_attn.k_proj.bias
model.layers.3.self_attn.v_proj.weight
model.layers.3.self_attn.v_proj.bias
model.layers.3.self_attn.q_proj.weight
model.layers.3.self_attn.q_proj.bias
model.layers.3.self_attn.out_proj.weight
model.layers.3.self_attn.out_proj.bias
model.layers.3.self_attn_layer_norm.weight
model.layers.3.self_attn_layer_norm.bias
model.layers.3.fc1.weight
model.layers.3.fc1.bias
model.layers.3.fc2.weight
model.layers.3.fc2.bias
model.layers.3.final_layer_norm.weight
model.layers.3.final_layer_norm.bias
model.layers.4.self_attn.k_proj.weight
model.layers.4.self_attn.k_proj.bias
model.layers.4.self_attn.v_proj.weight
model.layers.4.self_attn.v_proj.bias
model.layers.4.self_attn.q_proj.weight
model.layers.4.self_attn.q_proj.bias
model.layers.4.self_attn.out_proj.weight
model.layers.4.self_attn.out_proj.bias
model.layers.4.self_attn_layer_norm.weight
model.layers.4.self_attn_layer_norm.bias
model.layers.4.fc1.weight
model.layers.4.fc1.bias
model.layers.4.fc2.weight
model.layers.4.fc2.bias
model.layers.4.final_layer_norm.weight
model.layers.4.final_layer_norm.bias
model.layers.5.self_attn.k_proj.weight
model.layers.5.self_attn.k_proj.bias
model.layers.5.self_attn.v_proj.weight
model.layers.5.self_attn.v_proj.bias
model.layers.5.self_attn.q_proj.weight
model.layers.5.self_attn.q_proj.bias
model.layers.5.self_attn.out_proj.weight
model.layers.5.self_attn.out_proj.bias
model.layers.5.self_attn_layer_norm.weight
model.layers.5.self_attn_layer_norm.bias
model.layers.5.fc1.weight
model.layers.5.fc1.bias
model.layers.5.fc2.weight
model.layers.5.fc2.bias
model.layers.5.final_layer_norm.weight
model.layers.5.final_layer_norm.bias
model.layers.6.self_attn.k_proj.weight
model.layers.6.self_attn.k_proj.bias
model.layers.6.self_attn.v_proj.weight
model.layers.6.self_attn.v_proj.bias
model.layers.6.self_attn.q_proj.weight
model.layers.6.self_attn.q_proj.bias
model.layers.6.self_attn.out_proj.weight
model.layers.6.self_attn.out_proj.bias
model.layers.6.self_attn_layer_norm.weight
model.layers.6.self_attn_layer_norm.bias
model.layers.6.fc1.weight
model.layers.6.fc1.bias
model.layers.6.fc2.weight
model.layers.6.fc2.bias
model.layers.6.final_layer_norm.weight
model.layers.6.final_layer_norm.bias
model.layers.7.self_attn.k_proj.weight
model.layers.7.self_attn.k_proj.bias
model.layers.7.self_attn.v_proj.weight
model.layers.7.self_attn.v_proj.bias
model.layers.7.self_attn.q_proj.weight
model.layers.7.self_attn.q_proj.bias
model.layers.7.self_attn.out_proj.weight
model.layers.7.self_attn.out_proj.bias
model.layers.7.self_attn_layer_norm.weight
model.layers.7.self_attn_layer_norm.bias
model.layers.7.fc1.weight
model.layers.7.fc1.bias
model.layers.7.fc2.weight
model.layers.7.fc2.bias
model.layers.7.final_layer_norm.weight
model.layers.7.final_layer_norm.bias
model.layers.8.self_attn.k_proj.weight
model.layers.8.self_attn.k_proj.bias
model.layers.8.self_attn.v_proj.weight
model.layers.8.self_attn.v_proj.bias
model.layers.8.self_attn.q_proj.weight
model.layers.8.self_attn.q_proj.bias
model.layers.8.self_attn.out_proj.weight
model.layers.8.self_attn.out_proj.bias
model.layers.8.self_attn_layer_norm.weight
model.layers.8.self_attn_layer_norm.bias
model.layers.8.fc1.weight
model.layers.8.fc1.bias
model.layers.8.fc2.weight
model.layers.8.fc2.bias
model.layers.8.final_layer_norm.weight
model.layers.8.final_layer_norm.bias
model.layers.9.self_attn.k_proj.weight
model.layers.9.self_attn.k_proj.bias
model.layers.9.self_attn.v_proj.weight
model.layers.9.self_attn.v_proj.bias
model.layers.9.self_attn.q_proj.weight
model.layers.9.self_attn.q_proj.bias
model.layers.9.self_attn.out_proj.weight
model.layers.9.self_attn.out_proj.bias
model.layers.9.self_attn_layer_norm.weight
model.layers.9.self_attn_layer_norm.bias
model.layers.9.fc1.weight
model.layers.9.fc1.bias
model.layers.9.fc2.weight
model.layers.9.fc2.bias
model.layers.9.final_layer_norm.weight
model.layers.9.final_layer_norm.bias
model.layers.10.self_attn.k_proj.weight
model.layers.10.self_attn.k_proj.bias
model.layers.10.self_attn.v_proj.weight
model.layers.10.self_attn.v_proj.bias
model.layers.10.self_attn.q_proj.weight
model.layers.10.self_attn.q_proj.bias
model.layers.10.self_attn.out_proj.weight
model.layers.10.self_attn.out_proj.bias
model.layers.10.self_attn_layer_norm.weight
model.layers.10.self_attn_layer_norm.bias
model.layers.10.fc1.weight
model.layers.10.fc1.bias
model.layers.10.fc2.weight
model.layers.10.fc2.bias
model.layers.10.final_layer_norm.weight
model.layers.10.final_layer_norm.bias
model.layers.11.self_attn.k_proj.weight
model.layers.11.self_attn.k_proj.bias
model.layers.11.self_attn.v_proj.weight
model.layers.11.self_attn.v_proj.bias
model.layers.11.self_attn.q_proj.weight
model.layers.11.self_attn.q_proj.bias
model.layers.11.self_attn.out_proj.weight
model.layers.11.self_attn.out_proj.bias
model.layers.11.self_attn_layer_norm.weight
model.layers.11.self_attn_layer_norm.bias
model.layers.11.fc1.weight
model.layers.11.fc1.bias
model.layers.11.fc2.weight
model.layers.11.fc2.bias
model.layers.11.final_layer_norm.weight
model.layers.11.final_layer_norm.bias
model.layers.12.self_attn.k_proj.weight
model.layers.12.self_attn.k_proj.bias
model.layers.12.self_attn.v_proj.weight
model.layers.12.self_attn.v_proj.bias
model.layers.12.self_attn.q_proj.weight
model.layers.12.self_attn.q_proj.bias
model.layers.12.self_attn.out_proj.weight
model.layers.12.self_attn.out_proj.bias
model.layers.12.self_attn_layer_norm.weight
model.layers.12.self_attn_layer_norm.bias
model.layers.12.fc1.weight
model.layers.12.fc1.bias
model.layers.12.fc2.weight
model.layers.12.fc2.bias
model.layers.12.final_layer_norm.weight
model.layers.12.final_layer_norm.bias
model.layers.13.self_attn.k_proj.weight
model.layers.13.self_attn.k_proj.bias
model.layers.13.self_attn.v_proj.weight
model.layers.13.self_attn.v_proj.bias
model.layers.13.self_attn.q_proj.weight
model.layers.13.self_attn.q_proj.bias
model.layers.13.self_attn.out_proj.weight
model.layers.13.self_attn.out_proj.bias
model.layers.13.self_attn_layer_norm.weight
model.layers.13.self_attn_layer_norm.bias
model.layers.13.fc1.weight
model.layers.13.fc1.bias
model.layers.13.fc2.weight
model.layers.13.fc2.bias
model.layers.13.final_layer_norm.weight
model.layers.13.final_layer_norm.bias
model.layers.14.self_attn.k_proj.weight
model.layers.14.self_attn.k_proj.bias
model.layers.14.self_attn.v_proj.weight
model.layers.14.self_attn.v_proj.bias
model.layers.14.self_attn.q_proj.weight
model.layers.14.self_attn.q_proj.bias
model.layers.14.self_attn.out_proj.weight
model.layers.14.self_attn.out_proj.bias
model.layers.14.self_attn_layer_norm.weight
model.layers.14.self_attn_layer_norm.bias
model.layers.14.fc1.weight
model.layers.14.fc1.bias
model.layers.14.fc2.weight
model.layers.14.fc2.bias
model.layers.14.final_layer_norm.weight
model.layers.14.final_layer_norm.bias
model.layers.15.self_attn.k_proj.weight
model.layers.15.self_attn.k_proj.bias
model.layers.15.self_attn.v_proj.weight
model.layers.15.self_attn.v_proj.bias
model.layers.15.self_attn.q_proj.weight
model.layers.15.self_attn.q_proj.bias
model.layers.15.self_attn.out_proj.weight
model.layers.15.self_attn.out_proj.bias
model.layers.15.self_attn_layer_norm.weight
model.layers.15.self_attn_layer_norm.bias
model.layers.15.fc1.weight
model.layers.15.fc1.bias
model.layers.15.fc2.weight
model.layers.15.fc2.bias
model.layers.15.final_layer_norm.weight
model.layers.15.final_layer_norm.bias
model.layers.16.self_attn.k_proj.weight
model.layers.16.self_attn.k_proj.bias
model.layers.16.self_attn.v_proj.weight
model.layers.16.self_attn.v_proj.bias
model.layers.16.self_attn.q_proj.weight
model.layers.16.self_attn.q_proj.bias
model.layers.16.self_attn.out_proj.weight
model.layers.16.self_attn.out_proj.bias
model.layers.16.self_attn_layer_norm.weight
model.layers.16.self_attn_layer_norm.bias
model.layers.16.fc1.weight
model.layers.16.fc1.bias
model.layers.16.fc2.weight
model.layers.16.fc2.bias
model.layers.16.final_layer_norm.weight
model.layers.16.final_layer_norm.bias
model.layers.17.self_attn.k_proj.weight
model.layers.17.self_attn.k_proj.bias
model.layers.17.self_attn.v_proj.weight
model.layers.17.self_attn.v_proj.bias
model.layers.17.self_attn.q_proj.weight
model.layers.17.self_attn.q_proj.bias
model.layers.17.self_attn.out_proj.weight
model.layers.17.self_attn.out_proj.bias
model.layers.17.self_attn_layer_norm.weight
model.layers.17.self_attn_layer_norm.bias
model.layers.17.fc1.weight
model.layers.17.fc1.bias
model.layers.17.fc2.weight
model.layers.17.fc2.bias
model.layers.17.final_layer_norm.weight
model.layers.17.final_layer_norm.bias
model.layers.18.self_attn.k_proj.weight
model.layers.18.self_attn.k_proj.bias
model.layers.18.self_attn.v_proj.weight
model.layers.18.self_attn.v_proj.bias
model.layers.18.self_attn.q_proj.weight
model.layers.18.self_attn.q_proj.bias
model.layers.18.self_attn.out_proj.weight
model.layers.18.self_attn.out_proj.bias
model.layers.18.self_attn_layer_norm.weight
model.layers.18.self_attn_layer_norm.bias
model.layers.18.fc1.weight
model.layers.18.fc1.bias
model.layers.18.fc2.weight
model.layers.18.fc2.bias
model.layers.18.final_layer_norm.weight
model.layers.18.final_layer_norm.bias
model.layers.19.self_attn.k_proj.weight
model.layers.19.self_attn.k_proj.bias
model.layers.19.self_attn.v_proj.weight
model.layers.19.self_attn.v_proj.bias
model.layers.19.self_attn.q_proj.weight
model.layers.19.self_attn.q_proj.bias
model.layers.19.self_attn.out_proj.weight
model.layers.19.self_attn.out_proj.bias
model.layers.19.self_attn_layer_norm.weight
model.layers.19.self_attn_layer_norm.bias
model.layers.19.fc1.weight
model.layers.19.fc1.bias
model.layers.19.fc2.weight
model.layers.19.fc2.bias
model.layers.19.final_layer_norm.weight
model.layers.19.final_layer_norm.bias
model.layers.20.self_attn.k_proj.weight
model.layers.20.self_attn.k_proj.bias
model.layers.20.self_attn.v_proj.weight
model.layers.20.self_attn.v_proj.bias
model.layers.20.self_attn.q_proj.weight
model.layers.20.self_attn.q_proj.bias
model.layers.20.self_attn.out_proj.weight
model.layers.20.self_attn.out_proj.bias
model.layers.20.self_attn_layer_norm.weight
model.layers.20.self_attn_layer_norm.bias
model.layers.20.fc1.weight
model.layers.20.fc1.bias
model.layers.20.fc2.weight
model.layers.20.fc2.bias
model.layers.20.final_layer_norm.weight
model.layers.20.final_layer_norm.bias
model.layers.21.self_attn.k_proj.weight
model.layers.21.self_attn.k_proj.bias
model.layers.21.self_attn.v_proj.weight
model.layers.21.self_attn.v_proj.bias
model.layers.21.self_attn.q_proj.weight
model.layers.21.self_attn.q_proj.bias
model.layers.21.self_attn.out_proj.weight
model.layers.21.self_attn.out_proj.bias
model.layers.21.self_attn_layer_norm.weight
model.layers.21.self_attn_layer_norm.bias
model.layers.21.fc1.weight
model.layers.21.fc1.bias
model.layers.21.fc2.weight
model.layers.21.fc2.bias
model.layers.21.final_layer_norm.weight
model.layers.21.final_layer_norm.bias
model.layers.22.self_attn.k_proj.weight
model.layers.22.self_attn.k_proj.bias
model.layers.22.self_attn.v_proj.weight
model.layers.22.self_attn.v_proj.bias
model.layers.22.self_attn.q_proj.weight
model.layers.22.self_attn.q_proj.bias
model.layers.22.self_attn.out_proj.weight
model.layers.22.self_attn.out_proj.bias
model.layers.22.self_attn_layer_norm.weight
model.layers.22.self_attn_layer_norm.bias
model.layers.22.fc1.weight
model.layers.22.fc1.bias
model.layers.22.fc2.weight
model.layers.22.fc2.bias
model.layers.22.final_layer_norm.weight
model.layers.22.final_layer_norm.bias
model.layers.23.self_attn.k_proj.weight
model.layers.23.self_attn.k_proj.bias
model.layers.23.self_attn.v_proj.weight
model.layers.23.self_attn.v_proj.bias
model.layers.23.self_attn.q_proj.weight
model.layers.23.self_attn.q_proj.bias
model.layers.23.self_attn.out_proj.weight
model.layers.23.self_attn.out_proj.bias
model.layers.23.self_attn_layer_norm.weight
model.layers.23.self_attn_layer_norm.bias
model.layers.23.fc1.weight
model.layers.23.fc1.bias
model.layers.23.fc2.weight
model.layers.23.fc2.bias
model.layers.23.final_layer_norm.weight
model.layers.23.final_layer_norm.bias
model.layer_norm.weight
model.layer_norm.bias
lm_head.weight

This is the config.json file for the 564M model (taken from here):

{
  "activation_dropout": 0,
  "activation_function": "gelu",
  "architectures": [
    "XGLMForCausalLM"
  ],
  "attention_dropout": 0.1,
  "attention_heads": 16,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "eos_token_id": 2,
  "ffn_dim": 4096,
  "init_std": 0.02,
  "layerdrop": 0.0,
  "max_position_embeddings": 2048,
  "model_type": "xglm",
  "num_layers": 24,
  "pad_token_id": 1,
  "scale_embedding": true,
  "transformers_version": "4.16.0.dev0",
  "use_cache": true,
  "vocab_size": 256008
}

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-18T01:33:13Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:08:32Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Stypox added the enhancement New feature or request label Jan 23, 2024

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for XGLM models #5097

Support for XGLM models #5097

Stypox commented Jan 23, 2024 •

edited

Loading

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Support for XGLM models #5097

Support for XGLM models #5097

Comments

Stypox commented Jan 23, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Stypox commented Jan 23, 2024 •

edited

Loading