Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGUF file format specification #302

Merged
merged 29 commits into from
Nov 1, 2023
Merged
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d2fbcb2
docs: gguf spec first pass
philpax Jun 25, 2023
23eda2e
docs(gguf): update with review comments
philpax Jun 26, 2023
b303293
docs(gguf): update with review comments
philpax Jun 27, 2023
2bcd348
docs(gguf): quant version optional for unquant
philpax Jun 28, 2023
576e306
docs(gguf): normalize naming, add whisper
philpax Jul 9, 2023
44af6b8
Merge branch 'master' of https://github.com/ggerganov/ggml into gguf-…
philpax Jul 9, 2023
24260bf
docs(gguf): more review updates
philpax Jul 23, 2023
0133f2e
docs(gguf): add norm eps and added_tokens
philpax Jul 25, 2023
e9988f7
docs(gguf): move padding
philpax Jul 26, 2023
f4c4d6a
docs(gguf): remove migration tool
philpax Jul 27, 2023
39da254
docs(gguf): make offset base explicit
philpax Jul 27, 2023
a6d1cc1
docs(gguf): fix replace oops
philpax Jul 27, 2023
1d134ec
docs(gguf): alignment metadata+tensor name len max
philpax Aug 6, 2023
2a90bbf
docs(gguf): clarification, fixes, tensor names
philpax Aug 14, 2023
3d4507e
docs(gguf): clarify license
philpax Aug 15, 2023
39d6377
docs(gguf): minor tweaks
philpax Aug 15, 2023
e36b4ca
docs(gguf): data layout, GQA eq, no ft, LE GGUF
philpax Aug 20, 2023
d5cfb55
docs(gguf): fix magic order
philpax Aug 20, 2023
aa8d0ba
docs(gguf): match impl
philpax Aug 20, 2023
f3e7632
docs(gguf): specify fallback alignment
philpax Aug 20, 2023
2fe03e5
docs(gguf): remove TensorInfo::n_elements
philpax Aug 20, 2023
2b65fba
docs(gguf): filetype, rope base/linear scale
philpax Aug 24, 2023
b021b25
docs(gguf): v2 - uint64 all the things
philpax Aug 26, 2023
2da80c1
docs(gguf): tweak extensibility wording
philpax Aug 28, 2023
574b408
docs(gguf): fix spec discrepancies
philpax Sep 9, 2023
4ea9317
Merge branch 'master' into gguf-spec
philpax Oct 31, 2023
78faa7b
docs(gguf): v3 + other fixes
philpax Oct 31, 2023
0da010d
fix(editorconfig): use 2-space tabs for markdown
philpax Oct 31, 2023
ad95988
docs(gguf): clarify big-endian
philpax Oct 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
372 changes: 372 additions & 0 deletions docs/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,372 @@
# GGUF

GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.

It is a successor file format to GGML, GGMF and GGJT, and is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new features can be added to GGML without breaking compatibility with older models.

For more information about the motivation behind GGUF, see [Current State of Affairs](#current-state-of-affairs).

## Specification

GGUF is a format based on the existing GGJT, but makes a few changes to the format to make it more extensible and easier to use. The following features are desired:

- Single-file deployment: they can be easily distributed and loaded, and do not require any external files for additional information.
- Extensible: new features can be added to GGML without breaking compatibility with existing models.
- `mmap` compatibility: models can be loaded using `mmap` for fast loading and saving.
- Easy to use: models can be easily loaded and saved using a small amount of code, with no need for external libraries, regardless of the language used.
- Full information: all information needed to load a model is contained in the model file, and no additional information needs to be provided by the user.

The key difference between GGJT and GGUF is the use of a key-value structure for the hyperparameters (now referred to as metadata), rather than a list of untyped values. This allows for new metadata to be added without breaking compatibility with existing models, and to annotate the model with additional information that may be useful for inference or for identifying the model.

### File Structure

GGUF files are structured as follows. They assume the use of a global `ALIGNMENT` constant, which is the alignment of the model data. This is currently 64 bytes, but may change in the future. [^1] To achieve this, where relevant, the file is padded with `0x00` bytes to the next multiple of `ALIGNMENT`.

[^1]: This may be moved to a per-model key-value pair in the future.

```c
enum ggml_type {
GGML_TYPE_F32 = 0,
GGML_TYPE_F16 = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about BF16 ?

Copy link
Contributor

@klosax klosax Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if you want the highest quality you have to double the tensor sizes by using F32. My guess is that BF16 is not natively supported by many platform architectures yet. I would also like to see support for BF16 in ggml. I wonder if BF16 emulation really is slower than F32, since it is in fact a truncated version of F32. @ggerganov ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No plans for adding BF16 support - it would be too big change for what I think is too small benefit

GGML_TYPE_Q4_0 = 2,
GGML_TYPE_Q4_1 = 3,
// GGML_TYPE_Q4_2 = 4, support has been removed
// GGML_TYPE_Q4_3 (5) support has been removed
GGML_TYPE_Q5_0 = 6,
GGML_TYPE_Q5_1 = 7,
GGML_TYPE_Q8_0 = 8,
GGML_TYPE_Q8_1 = 9,
// k-quantizations
GGML_TYPE_Q2_K = 10,
GGML_TYPE_Q3_K = 11,
GGML_TYPE_Q4_K = 12,
GGML_TYPE_Q5_K = 13,
GGML_TYPE_Q6_K = 14,
GGML_TYPE_Q8_K = 15,
GGML_TYPE_I8,
GGML_TYPE_I16,
GGML_TYPE_I32,
GGML_TYPE_COUNT,
};

enum gguf_metadata_value_type: uint32_t {
/// The value is a 8-bit unsigned integer.
GGUF_METADATA_VALUE_TYPE_UINT8 = 0,
/// The value is a 8-bit signed integer.
GGUF_METADATA_VALUE_TYPE_INT8 = 1,
/// The value is a 16-bit unsigned little-endian integer.
GGUF_METADATA_VALUE_TYPE_UINT16 = 2,
/// The value is a 16-bit signed little-endian integer.
GGUF_METADATA_VALUE_TYPE_INT16 = 3,
/// The value is a 32-bit unsigned little-endian integer.
GGUF_METADATA_VALUE_TYPE_UINT32 = 4,
/// The value is a 32-bit signed little-endian integer.
GGUF_METADATA_VALUE_TYPE_INT32 = 5,
/// The value is a 32-bit IEEE754 floating point number.
GGUF_METADATA_VALUE_TYPE_FLOAT32 = 6,
/// The value is a boolean.
/// 1-byte value where 0 is false and 1 is true.
/// Anything else is invalid, and should be treated as either the model being invalid or the reader being buggy.
GGUF_METADATA_VALUE_TYPE_BOOL = 7,
/// The value is a UTF-8 non-null-terminated string, with length prepended.
GGUF_METADATA_VALUE_TYPE_STRING = 8,
/// The value is an array of other values, with the length and type prepended.
GGUF_METADATA_VALUE_TYPE_ARRAY = 9,
}

/// A string in GGUF.
struct gguf_string_t {
/// The length of the string, in bytes.
uint32_t len;
/// The string as a UTF-8 non-null-terminated string.
char string[len];
}

union gguf_metadata_value_t {
uint8_t uint8;
int8_t int8;
uint16_t uint16;
int16_t int16;
uint32_t uint32;
int32_t int32;
float float32;
bool bool_;
gguf_string_t string;
struct {
uint32_t len;
gguf_metadata_value_type type;
gguf_metadata_value_t array[len];
} array;
};

struct gguf_metadata_kv_t {
/// A standard GGUF string, with the following caveats:
/// - It must be a valid ASCII string.
/// - It must be a hierarchical key, where each segment is `lower_snake_case` and separated by a `.`.
/// - It must be at most 2^16-1 bytes long.
/// Any keys that do not follow these rules are invalid.
gguf_string_t key;

/// The length of the value, in bytes
uint32_t value_len;
/// The type of the value.
/// Must be one of the `gguf_metadata_value_type` values.
gguf_metadata_value_type value_type;
/// The value.
gguf_metadata_value_t value;
};

struct gguf_header_t {
// Magic number to announce that this is a GGUF file.
// Must be `'GGUF'`/`0x47475546`.
Copy link
Contributor

@klosax klosax Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reversed so it is written as GGUF in the model file. 0x46554747

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's written as 'GGUF' if you look at it bytewise (0x47 0x47 0x55 0x46 is G G U F), but not if you look at it as a 32-bit little-endian integer. I think it's better to keep it this way to ensure that little/big-endian isn't a concern?

Copy link
Contributor

@klosax klosax Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ggerganov/llama.cpp#2398 we already changed this. But it could possibly be reversed. @ggerganov

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...too late it seems, llama.cpp's implementation seems to do little-endian. OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this, but I think it makes more sense the way it was specified before (i.e. as a 4-byte string, not as a little-endian 4-byte integer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...okay, turns out it actually is GGUF in the resulting file right now, it's just the little endian cancels out:

# hexdump -C models/llama2/llama-2-7b-f16.gguf | head -n 1 
00000000  47 47 55 46 01 00 00 00  23 01 00 00 0c 00 00 00  |GGUF....#.......|

and this is because the header is written as a little-endian integer:
self.fout.write(struct.pack("<I", GGUF_MAGIC))

but the constant is already little-endian:
GGUF_MAGIC = 0x46554747

so in the write it gets reversed and becomes 'GGUF'. I'm going to mention that's what it is in the spec, but both the Python and C/C++ should probably be updated to not treat them as integers O_o

Copy link
Contributor

@klosax klosax Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would probably be less confusing to write the magic as 4 bytes instead.

uint32_t magic;
// The version of the format implemented.
// Must be `1` for version described in this spec.
//
// This version should only be increased for structural changes to the format.
// Changes that do not affect the structure of the file should instead update the metadata
// to signify the change.
uint32_t version;
// The number of tensors in the file.
// This is explicit, instead of being included in the metadata, to ensure it is always present
// for loading the tensors.
uint32_t tensor_count;
// The number of metadata key-value pairs.
uint32_t metadata_kv_count;
// The metadata key-value pairs.
gguf_metadata_kv_t metadata_kv[metadata_kv_count];
};

struct gguf_tensor_info_t {
/// The name of the tensor.
gguf_string_t name;
/// The number of dimensions in the tensor.
/// Currently at most two, but this may change in the future.
philpax marked this conversation as resolved.
Show resolved Hide resolved
uint32_t n_dimensions;
ggerganov marked this conversation as resolved.
Show resolved Hide resolved
/// The dimensions of the tensor.
uint32_t dimensions[n_dimensions];
/// The number of elements in the tensor.
uint32_t n_elements;
/// The type of the tensor.
ggml_type type;
/// The offset of the tensor's data in this file in bytes.
/// Must be a multiple of `ALIGNMENT`.
uint64_t offset;
philpax marked this conversation as resolved.
Show resolved Hide resolved
};

struct gguf_file_t {
// The header of the file.
gguf_header_t header;

// Padding to the nearest multiple of `ALIGNMENT`.
uint8_t _padding[ALIGNMENT - (sizeof(header) % ALIGNMENT)];
philpax marked this conversation as resolved.
Show resolved Hide resolved

// Tensor infos, which can be used to locate the tensor data.
gguf_tensor_info_t tensor_infos[header.tensor_count];

// Tensor data.
//
// This is arbitrary binary data corresponding to the weights of the model. This data should be close
// or identical to the data in the original model file, but may be different due to quantization or
// other optimizations for inference. Any such deviations should be recorded in the metadata or as
// part of the architecture definition.
//
// Each tensor's data must be stored within this array, and located through its `tensor_infos` entry.
// The offset of each tensor's data must be a multiple of `ALIGNMENT`, and the space between tensors
// should be padded to `ALIGNMENT` bytes.
uint8_t tensor_data[];
};
```

## Standardized key-value pairs

The following key-value pairs are standardized. This list may grow in the future as more use cases are discovered. Where possible, names are shared with the original model definitions to make it easier to map between the two.

Not all of these are required, but they are all recommended. Keys that are required are bolded. For omitted pairs, the reader should assume that the value is unknown and either default or error as appropriate.

### General

- **`general.architecture: string`**: describes what architecture this model implements. All lowercase ASCII, with only `[a-z0-9]+` characters allowed. Known values include:
- `llama`
- `mpt`
- `gptneox`
- `gptj`
- `gpt2`
- `bloom`
- `falcon`
- `rwkv`
- **`general.quantization_version: u32`**: version of quantization scheme
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `general.file_type: string`: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use of `string`.
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `general.license: string`: SPDX license of the model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be uint32 and an enum value instead of string. Executors may choose to generate human-readable descriptions based on that value. I can see that custom values in this can easily lead to confusions or leave this metadata as redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for this is it's what Rust/Cargo does, and it seems to work quite well and leave the door open for future expansion / non-standard licenses (as is relatively common in ML). I could be convinced otherwise, but I don't see a strong reason to install that restriction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify if the string should contain the license name / identifier, a link to the document or even the whole license document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I intended to refer to SPDX license expressions, but I didn't use that verbiage. Fixed.

- `general.description: string`: information about the model, including provenance
- `general.url.original_source: string`: path to the original model that this GGML file was created from

### LLM

In the following, `[llm]` is used to fill in for the name of a specific LLM architecture. They will be used in each architecture's section.

- `[llm].context_length: u32`: size of the maximum supported context
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `[llm].hidden_size: u32`: embedding layer size
- `[llm].num_layers: u32`: number of layers
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `[llm].num_rotary: u32`: `int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))`
- `[llm].use_parallel_residual: bool`: whether or not the parallel residual logic should be used
- `[llm].max_seq_len: u32`: Maximum sequence length
- `[llm].attention.num_heads: u32`: number of attention heads
- `[llm].attention.alibi_bias_max: f32`: The maximum bias to use for ALiBI
- `[llm].attention.clip_kqv: f32`: **TODO**: what is this?
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `[llm].num_mult: u32`: **TODO**: what is this?
philpax marked this conversation as resolved.
Show resolved Hide resolved
- `[llm].rot: u32`: **TODO**: what is this?
- `[llm].num_rot: u32`: **TODO**: what is this?
philpax marked this conversation as resolved.
Show resolved Hide resolved

#### Models

The following sections describe the metadata for each model architecture. Each key specified _must_ be present.

##### LLaMA

- `llama.context_length`
- `llama.hidden_size`
- `llama.num_layers`
- `llama.num_mult`
- `llama.rot`
- `llama.attention.num_heads`

philpax marked this conversation as resolved.
Show resolved Hide resolved
##### MPT

- `mpt.max_seq_len`
- `mpt.hidden_size`
- `mpt.num_layers`
- `mpt.attention.num_heads`
- `mpt.attention.alibi_bias_max`
- `mpt.attention.clip_kqv`

##### GPT-NeoX

- `gptneox.context_length`
- `gptneox.hidden_size`
- `gptneox.num_layers`
- `gptneox.num_rot`
- `gptneox.use_parallel_residual`
- `gptneox.attention.num_heads`

##### GPT-J

- `gptj.context_length`
- `gptj.hidden_size`
- `gptj.num_layers`
- `gptj.num_rot`
- `gptj.attention.num_heads`

##### GPT-2

- `gpt2.context_length`
- `gpt2.hidden_size`
- `gpt2.num_layers`
- `gpt2.attention.num_heads`

##### BLOOM

- `bloom.context_length`
- `bloom.hidden_size`
- `bloom.num_layers`
- `bloom.num_mult`
- `bloom.attention.num_heads`

##### Falcon

**TODO**.
philpax marked this conversation as resolved.
Show resolved Hide resolved

##### RWKV

**TODO**.
philpax marked this conversation as resolved.
Show resolved Hide resolved

#### Prompting

**TODO**: Include prompt format, and/or metadata about how it should be used (instruction, conversation, autocomplete, etc).

### Tokenizer

The following keys are used to describe the tokenizer of the model. It is recommended that model authors support as many of these as possible, as it will allow for better tokenization quality with supported executors.

#### Embedded
philpax marked this conversation as resolved.
Show resolved Hide resolved

GGML supports an embedded vocabulary that may be lossily compressed from a more complete tokenizer. This should enable inferencing of the model, but it may not fully capture the nuances of tokenization. When a more accurate tokenizer is available and supported, it should be used instead.

**TODO**: Add more details about how this works, and what kind of tokenizer it's expecting. Should this be called something more specific instead?

- `tokenizer.embedded.tokens: array[string]`: A list of tokens.
- `tokenizer.embedded.scores: array[f32]`: If present, the score/probability of each token. If not present, all tokens are assumed to have equal probability. Must be the same length as `tokens`.

#### Hugging Face

Hugging Face maintains their own `tokenizers` library that supports a wide variety of tokenizers. If your executor uses this library, it may be able to use the model's tokenizer directly.

- `tokenizer.huggingface.json: string`: the entirety of the HF `tokenizer.json` for a given model (e.g. <https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json>). Included for compatibility with executors that support HF tokenizers directly.

#### Other

Other tokenizers may be used, but are not necessarily standardized. They may be executor-specific. They will be documented here as they are discovered/further developed.

- `tokenizer.rwkv.world: string`: a RWKV World tokenizer, like [this](https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_vocab_v20230424.txt). This text file should be included verbatim.

### Computation graph

This is a future extension and still needs to be discussed, and may necessitate a new GGUF version. At the time of writing, the primary blocker is the stabilization of the computation graph format.

A sample computation graph of GGML nodes could be included in the model itself, allowing an executor to run the model without providing its own implementation of the architecture. This would allow for a more consistent experience across executors, and would allow for more complex architectures to be supported without requiring the executor to implement them.

## Migration

All existing Python conversion scripts will be consolidated to use one `gguf` library. They will take models from Hugging Face or elsewhere and produce compliant GGUF files with all of the recommended metadata.

Existing models do not have enough information to be directly converted to GGUF. Instead, a migration tool may be built that takes an existing GGML/GGMF/GGJT file and prompts the user for the missing information. This tool will be executor-agnostic, and will be able to produce a GGUF file that can be used by any executor. This tool may hardcode settings for models with known hashes to ease the migration process, such that a user can run `./migrate nous-hermes-13b.ggmlv3.q5_1.bin` and obtain a `nous-hermes-13b.ggmlv3.q5_1.gguf` file that is ready to use and consistent with uploaded models.
philpax marked this conversation as resolved.
Show resolved Hide resolved

---

## Current State of Affairs

The following information is provided for context, but is not necessary to understand the rest of this document.

### Overview

At present, there are three GGML file formats floating around for LLMs:

- **GGML** (unversioned): baseline format, with no versioning or alignment.
- **GGMF** (versioned): the same as GGML, but with versioning. Only one version exists.
philpax marked this conversation as resolved.
Show resolved Hide resolved
- **GGJT**: Aligns the tensors to allow for use with `mmap`, which requires alignment. v1, v2 and v3 are identical, but the latter versions use a different quantization scheme that is incompatible with previous versions.

GGML is primarily used by the examples in `ggml`, while GGJT is used by `llama.cpp` models. Other executors may use any of the three formats, but this is not 'officially' supported.

These formats share the same fundamental structure:

- a magic number with an optional version number
- model-specific hyperparameters, including
- metadata about the model, such as the number of layers, the number of heads, etc.
- a `ftype` that describes the type of the majority of the tensors,
- for GGML files, the quantization version is encoded in the `ftype` divided by 1000
- an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings.
- finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

Notably, this structure does not identify what model architecture the model belongs to, nor does it offer any flexibility for changing the structure of the hyperparameters. This means that the only way to add new hyperparameters is to add them to the end of the list, which is a breaking change for existing models.

### Drawbacks

Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:

- There's no way to identify which model architecture a given model is for, because that information isn't present
- Similarly, existing programs cannot intelligently fail upon encountering new architectures
- Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without using heuristics
- Each model architecture requires its own conversion script to their architecture's variant of GGML
- Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats

### Why not other formats?

There are a few other formats that could be used, but issues include:

- requiring additional dependencies to load or save the model, which is complicated in a C environment
- limited or no support for 4-bit quantization
- existing cultural expectations (e.g. whether or not the model is a directory or a file)
- lack of support for embedded vocabularies
- lack of control over direction of future development

Ultimately, it is likely that GGUF will remain necessary for the foreseeable future, and it is better to have a single format that is well-documented and supported by all executors than to contort an existing format to fit the needs of GGML.