Skip to content

Reading GGUF metadata with gguf-dump.py does not work for i-quants #5809

@countzero

Description

@countzero

The gguf-dump.py script in the llama.cpp release b2297 is missing support for i-quants.

Steps to reproduce

  1. Create or download a GGUF file in any IQ* format (e.g., miqu-1-70b-Requant-b2131-iMat-c32_ch400-IQ1_S_v3.gguf)
  2. Copy the file to .\models\miqu-1-70b-sf.IQ1_S.gguf
  3. Execute the following
python .\gguf-py\scripts\gguf-dump.py --no-tensors .\models\miqu-1-70b-sf.IQ1_S.gguf
  1. See the error:
ValueError: 19 is not a valid GGMLQuantizationType

Expected behaviour

I expect the Python gguf-py library to support all possible GGUF formats.

Working example for k-quants:

python .\gguf-py\scripts\gguf-dump.py --no-tensors .\models\miqu-1-70b-sf.Q5_K_M.gguf
* Loading: .\models\miqu-1-70b-sf.Q5_K_M.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 26 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 723
      3: UINT64     |        1 | GGUF.kv_count = 23
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'R:\\AI\\LLM\\source'
      6: UINT32     |        1 | llama.context_length = 32764
      7: UINT32     |        1 | llama.embedding_length = 8192
      8: UINT32     |        1 | llama.block_count = 80
      9: UINT32     |        1 | llama.feed_forward_length = 28672
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 64
     12: UINT32     |        1 | llama.attention.head_count_kv = 8
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: FLOAT32    |        1 | llama.rope.freq_base = 1000000.0
     15: UINT32     |        1 | general.file_type = 17
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    32000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    32000 | tokenizer.ggml.scores
     19: [INT32]    |    32000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     21: UINT32     |        1 | tokenizer.ggml.eos_token_id = 2
     22: UINT32     |        1 | tokenizer.ggml.padding_token_id = 0
     23: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     24: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     25: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"
     26: UINT32     |        1 | general.quantization_version = 2

Use-Case

I am extracting the metadata from any given GGUF model to automatically calculate the optimal runtime arguments for the server in the following PowerShell script: https://github.com/countzero/windows_llama.cpp/blob/v1.12.0/examples/server.ps1#L104

Question

@ggerganov Is there another way to only dump the metadata from a given GGUF model? Perhaps this could be an --inspect option of the gguf binary?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions