Skip to content

Bug: JSON Schema - enum behind a $ref generates an object with unrestricted properties #8073

Closed
@cikkle

Description

What happened?

I'm using the json_schema feature in llama-server. Using a simple prompt like Write a dialog between Alice and Biff, if I send a schema like:

{
    "type": "array",
    "minItems": 15,
    "maxItems": 15,
    "items": { "$ref": "#/$defs/TALK" },

    "$defs": {
        "TALK": {
            "type": "object",
            "required": [ "character", "emote", "dialog" ],
            "properties": {
                "character": { "enum": [ "Alice", "Biff"] },
                "emote": { "enum": ["EXCLAMATION", "CONFUSION", "CHEERFUL", "LOVE", "ANGRY", "NERVOUS", "ANNOYED", "SILENCE", "INSPIRED", "SLEEPING"] },
                "dialog": {
                    "type": "string",
                    "minLength": 1,
                    "maxLength": 200
                }
            }
        }
    }
}

I get back an array of responses in the format I'd expect, like:

{ "character": "Alice", "emote": "SILENCE", "dialog": "I'm just saying, it's not like you to be so... quiet. Is everything alright?" }
{"character": "Biff", "emote": "NERVOUS", "dialog": "Yeah, everything's fine. Just... busy. You know how it is." }

Things stop working right if I try to put the enums in separate definitions. The following schema:

{
    "type": "array",
    "minItems": 15,
    "maxItems": 15,
    "items": { "$ref": "#/$defs/TALK" },

    "$defs": {
        "characters": { "enum": ["Biff", "Alice"] },
        "emotes": { "enum": ["EXCLAMATION", "CONFUSION", "CHEERFUL", "LOVE", "ANGRY"] },

        "TALK": {
            "type": "object",
            "required": [ "character", "emote", "dialog" ],
            "properties": {
                "character": { "$ref": "#/$defs/characters" },
                "emote": { "$ref": "#/$defs/emotes" },
                "dialog": {
                    "type": "string",
                    "minLength": 1,
                    "maxLength": 200
                }
            }
        }
    }
}

...gives me arbitrary things like:

{ "character": {"name": "Alice","description": "Alice, a young woman, has a bright and curious expression on her face."},
{"emotion": "curious"}
 { "character": {"name": "Biff","description": "Biff, a friendly-looking man, has a warm smile and a hint of mischief in his eyes."},
{"emotion": "amused"}

The output should follow the same format in both, but I get an object with random properties in place of the enum, and possibly more random things afterward (in this run, it was a bonus object tagging along, but it can vary).

Notably if I reorder the properties to put "dialog" before "character" I'll actually get the dialog property and string I asked for, so things only seem to go off the rails when it reaches one of the referenced enums.

I'm aware json_schema currently has some known bugs and features yet to implemented, but I didn't see anything in the readme I thought this would fall under. Terminal output from llama-server doesn't appear to show anything relevant but it's included for completeness.

Name and Version

o0@hades:/ai/llama.cpp$ ./llama-cli --version
version: 3203 (b5a5f34)
built with cc (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [                    main] build info | tid="139722331939776" timestamp=1719130838 build=3203 commit="b5a5f34e"
INFO [                    main] system info | tid="139722331939776" timestamp=1719130838 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ../models/text/L3-8B-Stheno-v3.2-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = L3-8B-Stheno-v3.2
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/L3-8B-Stheno-v3.2-GGUF/L3-8B-...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = L3-8B-Stheno-v3.2
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  3757.53 MiB
llm_load_tensors:      ROCm1 buffer size =  3847.80 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   416.50 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   367.50 MiB
llama_new_context_with_model: KV self size  =  784.00 MiB, K (q8_0):  272.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      ROCm0 compute buffer size =   640.01 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =   640.02 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    72.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
INFO [                    init] initializing slots | tid="139722331939776" timestamp=1719130849 n_slots=1
INFO [                    init] new slot | tid="139722331939776" timestamp=1719130849 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="139722331939776" timestamp=1719130849
INFO [                    main] chat template | tid="139722331939776" timestamp=1719130849 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=true
INFO [                    main] HTTP server listening | tid="139722331939776" timestamp=1719130849 n_threads_http="23" port="5000" hostname="0.0.0.0"
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719130849
INFO [   launch_slot_with_task] slot is processing task | tid="139722331939776" timestamp=1719131079 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="139722331939776" timestamp=1719131079 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =     111.88 ms /    55 tokens (    2.03 ms per token,   491.61 tokens per second) | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_prompt_processing=111.878 n_prompt_tokens_processed=55 t_token=2.0341454545454547 n_tokens_second=491.60692897620623
INFO [           print_timings] generation eval time =   61940.54 ms /  1522 runs   (   40.70 ms per token,    24.57 tokens per second) | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_token_generation=61940.538 n_decoded=1522 t_token=40.696805519053875 n_tokens_second=24.57195318516607
INFO [           print_timings]           total time =   62052.42 ms | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 t_prompt_processing=111.878 t_token_generation=61940.538 t_total=62052.416
INFO [            update_slots] slot released | tid="139722331939776" timestamp=1719131141 id_slot=0 id_task=0 n_ctx=8192 n_past=1576 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131141
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131141
INFO [            update_slots] all slots are idle | tid="139722331939776" timestamp=1719131253

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions