Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

Merged
merged 17 commits into from
Apr 12, 2024

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Apr 8, 2024

Another followup on #5978

  • Bug fixes:

    • Grammars generated for the following schemas will no longer trigger combinatorial explosions during inference (also see llama : speed-up grammar sampling #4218 (comment)):

      • {"items": {"type": "number"}, "minItems": 10, "maxItems": 1000}: this used to hang forever, it's now running smoothly

        Show command

        Before:

         ./main -m --grammar-file \
           <( echo '{"items": {"type": "number"}, "minItems": 10, "maxItems": 100}' | \
           python examples/json-schema-to-grammar.py - \
         ) -p "List of 50 numbers"           
         > [0,1,2,3,4,5,6,7,8,9, <...hangs...>

        After (notice python script uses underscores now):

         ./main -m --grammar-file \
           <( echo '{"items": {"type": "number"}, "minItems": 10, "maxItems": 100}' | \
           python examples/json_schema_to_grammar.py - \
         ) -p "List of 50 numbers"           
         > [1234, 5678, 1010, 1111, 1212, 1313, 1414, 1515, 1616, 1717, 1818, 1919, 2020, 2121, 2222, 2323, 2424, 2525, 2626, 2727, 2828, 2929, 3030, 3131, 3232, 3333, 3434, 3535, 3636, 3737, 3838, 3939, 4040, 4141, 4242, 4343, 4444, 4545, 4646, 4747, 4848, 4949, 5050]
      • {"type": "string", "pattern": "^a{10,100}$"}

    • Numbers & integers now have a capped precision (JSON itself allows arbitrary precisions numbers but there's no point in exceeding JavaScript's - roughly 15 digits; zealous LLMs may otherwise generate an infinite sequence 0.33333333333... when prompted for "one third")

    • Allow null in untyped JSON objects

  • New features:

    • Support string length constraints: {"type": "string", "minLength": 10, "maxLength": 100}

    • Python converter can be imported more easily (underscored name)

I've hopefully simplified the code by adding a simple dependencies mechanism for primitive rules, and unifying all repetition code.

I've also updated the GBNF doc to mention the performance gotchas, and have documented the server response_format parameter for schema-constrained JSON output

Copy link
Contributor

github-actions bot commented Apr 8, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 450 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10489.27ms p(95)=26481.41ms fails=, finish reason: stop=394 truncated=56
  • Prompt processing (pp): avg=110.66tk/s p(95)=487.97tk/s
  • Token generation (tg): avg=26.2tk/s p(95)=36.03tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=json-faster-repetitions2 commit=9c33ee99302caac14c79f12c43e7a61462dc0730

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 321.53, 321.53, 321.53, 321.53, 321.53, 485.41, 485.41, 485.41, 485.41, 485.41, 529.45, 529.45, 529.45, 529.45, 529.45, 552.0, 552.0, 552.0, 552.0, 552.0, 604.49, 604.49, 604.49, 604.49, 604.49, 608.04, 608.04, 608.04, 608.04, 608.04, 608.31, 608.31, 608.31, 608.31, 608.31, 612.43, 612.43, 612.43, 612.43, 612.43, 633.63, 633.63, 633.63, 633.63, 633.63, 636.1, 636.1, 636.1, 636.1, 636.1, 646.97, 646.97, 646.97, 646.97, 646.97, 648.96, 648.96, 648.96, 648.96, 648.96, 666.35, 666.35, 666.35, 666.35, 666.35, 680.62, 680.62, 680.62, 680.62, 680.62, 680.7, 680.7, 680.7, 680.7, 680.7, 652.35, 652.35, 652.35, 652.35, 652.35, 580.94, 580.94, 580.94, 580.94, 580.94, 590.48, 590.48, 590.48, 590.48, 590.48, 590.92, 590.92, 590.92, 590.92, 590.92, 590.87, 590.87, 590.87, 590.87, 590.87, 584.36, 584.36, 584.36, 584.36, 584.36, 586.9, 586.9, 586.9, 586.9, 586.9, 585.92, 585.92, 585.92, 585.92, 585.92, 592.71, 592.71, 592.71, 592.71, 592.71, 594.02, 594.02, 594.02, 594.02, 594.02, 597.3, 597.3, 597.3, 597.3, 597.3, 601.87, 601.87, 601.87, 601.87, 601.87, 589.42, 589.42, 589.42, 589.42, 589.42, 592.68, 592.68, 592.68, 592.68, 592.68, 595.43, 595.43, 595.43, 595.43, 595.43, 592.96, 592.96, 592.96, 592.96, 592.96, 592.14, 592.14, 592.14, 592.14, 592.14, 593.55, 593.55, 593.55, 593.55, 593.55, 595.16, 595.16, 595.16, 595.16, 595.16, 597.23, 597.23, 597.23, 597.23, 597.23, 600.47, 600.47, 600.47, 600.47, 600.47, 600.92, 600.92, 600.92, 600.92, 600.92, 600.79, 600.79, 600.79, 600.79, 600.79, 605.15, 605.15, 605.15, 605.15, 605.15, 610.38, 610.38, 610.38, 610.38, 610.38, 616.15, 616.15, 616.15, 616.15, 616.15, 617.35, 617.35, 617.35, 617.35, 617.35, 605.88, 605.88, 605.88, 605.88, 605.88, 606.59, 606.59, 606.59, 606.59, 606.59, 606.36, 606.36, 606.36, 606.36, 606.36, 607.03, 607.03, 607.03, 607.03, 607.03, 609.2, 609.2, 609.2, 609.2, 609.2, 612.17, 612.17, 612.17, 612.17, 612.17, 613.15, 613.15, 613.15, 613.15, 613.15, 621.8, 621.8, 621.8, 621.8, 621.8, 624.42, 624.42, 624.42, 624.42, 624.42, 626.45, 626.45, 626.45, 626.45, 626.45, 625.56, 625.56, 625.56, 625.56, 625.56, 624.53, 624.53, 624.53, 624.53, 624.53, 623.06, 623.06, 623.06, 623.06, 623.06, 622.76, 622.76, 622.76, 622.76, 622.76, 624.87, 624.87, 624.87, 624.87, 624.87, 627.09, 627.09, 627.09, 627.09, 627.09, 627.7, 627.7, 627.7, 627.7, 627.7, 629.52, 629.52, 629.52, 629.52, 629.52, 633.23, 633.23, 633.23, 633.23, 633.23, 633.23, 633.23]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.04, 36.04, 36.04, 36.04, 36.04, 38.26, 38.26, 38.26, 38.26, 38.26, 19.58, 19.58, 19.58, 19.58, 19.58, 21.12, 21.12, 21.12, 21.12, 21.12, 22.71, 22.71, 22.71, 22.71, 22.71, 23.05, 23.05, 23.05, 23.05, 23.05, 23.72, 23.72, 23.72, 23.72, 23.72, 24.78, 24.78, 24.78, 24.78, 24.78, 25.76, 25.76, 25.76, 25.76, 25.76, 25.88, 25.88, 25.88, 25.88, 25.88, 25.86, 25.86, 25.86, 25.86, 25.86, 25.65, 25.65, 25.65, 25.65, 25.65, 25.59, 25.59, 25.59, 25.59, 25.59, 25.5, 25.5, 25.5, 25.5, 25.5, 24.59, 24.59, 24.59, 24.59, 24.59, 24.7, 24.7, 24.7, 24.7, 24.7, 23.95, 23.95, 23.95, 23.95, 23.95, 23.66, 23.66, 23.66, 23.66, 23.66, 23.69, 23.69, 23.69, 23.69, 23.69, 23.8, 23.8, 23.8, 23.8, 23.8, 23.86, 23.86, 23.86, 23.86, 23.86, 23.62, 23.62, 23.62, 23.62, 23.62, 23.44, 23.44, 23.44, 23.44, 23.44, 22.96, 22.96, 22.96, 22.96, 22.96, 22.79, 22.79, 22.79, 22.79, 22.79, 23.0, 23.0, 23.0, 23.0, 23.0, 23.1, 23.1, 23.1, 23.1, 23.1, 23.19, 23.19, 23.19, 23.19, 23.19, 23.27, 23.27, 23.27, 23.27, 23.27, 23.31, 23.31, 23.31, 23.31, 23.31, 23.34, 23.34, 23.34, 23.34, 23.34, 22.97, 22.97, 22.97, 22.97, 22.97, 23.02, 23.02, 23.02, 23.02, 23.02, 23.09, 23.09, 23.09, 23.09, 23.09, 23.21, 23.21, 23.21, 23.21, 23.21, 23.22, 23.22, 23.22, 23.22, 23.22, 23.26, 23.26, 23.26, 23.26, 23.26, 23.35, 23.35, 23.35, 23.35, 23.35, 23.38, 23.38, 23.38, 23.38, 23.38, 23.4, 23.4, 23.4, 23.4, 23.4, 23.38, 23.38, 23.38, 23.38, 23.38, 23.31, 23.31, 23.31, 23.31, 23.31, 23.3, 23.3, 23.3, 23.3, 23.3, 23.06, 23.06, 23.06, 23.06, 23.06, 23.08, 23.08, 23.08, 23.08, 23.08, 23.06, 23.06, 23.06, 23.06, 23.06, 23.08, 23.08, 23.08, 23.08, 23.08, 23.11, 23.11, 23.11, 23.11, 23.11, 23.12, 23.12, 23.12, 23.12, 23.12, 23.19, 23.19, 23.19, 23.19, 23.19, 23.17, 23.17, 23.17, 23.17, 23.17, 23.12, 23.12, 23.12, 23.12, 23.12, 22.73, 22.73, 22.73, 22.73, 22.73, 22.57, 22.57, 22.57, 22.57, 22.57, 21.87, 21.87, 21.87, 21.87, 21.87, 21.86, 21.86, 21.86, 21.86, 21.86, 21.24, 21.24, 21.24, 21.24, 21.24, 21.23, 21.23, 21.23, 21.23, 21.23, 21.32, 21.32, 21.32, 21.32, 21.32, 21.37, 21.37, 21.37, 21.37, 21.37, 21.46, 21.46, 21.46, 21.46, 21.46, 21.5, 21.5]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.33, 0.33, 0.33, 0.33, 0.33, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.09, 0.09, 0.09, 0.09, 0.09, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.37, 0.37, 0.37, 0.37, 0.37, 0.43, 0.43, 0.43, 0.43, 0.43, 0.53, 0.53, 0.53, 0.53, 0.53, 0.55, 0.55, 0.55, 0.55, 0.55, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 450 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712674161 --> 1712674793
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0]
                    
Loading

@ochafik ochafik marked this pull request as ready for review April 9, 2024 08:44
@ochafik ochafik changed the title JSON schema conversion: faster repetitions, min/maxLength for strings, cap number length JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length Apr 9, 2024
@HanClinto
Copy link
Collaborator

How much more effort would it be to benchmark these improvements?

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 10, 2024

How much more effort would it be to benchmark these improvements?

@HanClinto It can quickly become an unfair fight depending on the schema and the model's generation choices (in some cases the speed may appear the same, but the worst case is now bounded).

If you expand the "Show command" drawer in the PR's description, the simple example I gave goes from being essentially stuck until the end of the universe to something very smooth (Edit) still sluggish but making interactive progress.

Here's how to benchmark any schema you'd like (using hyperfine):

git clone https://github.com/ochafik/llama.cpp --branch json-faster-repetitions2 llama.cpp-faster-rep
cd llama.cpp-faster-rep && git pull

echo '{"items": {"type": "number"}, "maxItems": 10}' > schema.json && \
  git checkout json-faster-repetitions2 && \
  python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
  git checkout master && \
  python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
  make clean && make -j LLAMA_CURL=1 main && \
  mkdir -p models/7B && \
  hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234'

# The warmup run will download the model, will take a while & use up 5.2GB

It gives 8x speedup for that specific seed & model (other values may not show improvements, or the master branch may timeout)

Show output
Benchmark 1: ./main --grammar-file fast.grammar -p "List of 10 numbers" --seed 1234
  Time (mean ± σ):      2.645 s ±  0.058 s    [User: 1.113 s, System: 0.232 s]
  Range (min … max):    2.604 s …  2.800 s    10 runs
 
Benchmark 2: ./main --grammar-file slow.grammar -p "List of 10 numbers" --seed 1234
  Time (mean ± σ):     20.999 s ±  0.285 s    [User: 16.764 s, System: 2.612 s]
  Range (min … max):   20.656 s … 21.480 s    10 runs
 
Summary
  ./main --grammar-file fast.grammar -p "List of 10 numbers" --seed 1234 ran
    7.94 ± 0.20 times faster than ./main --grammar-file slow.grammar -p "List of 10 numbers" --seed 1234

Lemme know if you'd like me to test any specific kind of schema

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm on board with the rename to use underscores -- while there are a few other files with underscores (such as pydantic_models_to_grammar.py), most seem to use hyphens (pydantic-models-to-grammar-examples.py, etc), and it seems like the old filename is possibly better?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I wanted all filenames in the repo to use hyphens. But later I found out that Python does not work well when there are hyphens in the filenames (e.g. I think you cannot include a Python file that has hyphens). So I think it's better to eventually rename all Python files to use underscores in their filenames

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I did all this as a prerequisite for #6389, in which i need to import the converter from Python. I also found out llama-cpp-python inlines that file in their codebase, since it's hard / not trivial to import (short of using importlib, which feels dirty).

}

RESERVED_NAMES = set(["root", *PRIMITIVE_RULES.keys(), *DATE_RULES.keys()])
DOTALL = '[\\U00000000-\\U0010FFFF]'
DOT = '[\\U00000000-\\x09\\x0B\\x0C\\x0E-\\U0010FFFF]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it would be more performant or not, but I'm curious if:
DOT = '[^\\x0A\\x0D]'
would be easier / faster to process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to simpler negative range, thanks!!

@@ -89,3 +89,13 @@ This guide provides a brief overview. Check out the GBNF files in this directory
```
./main -m <model> --grammar-file grammars/some-grammar.gbnf -p 'Some prompt'
```

## Troubleshooting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this section. After this gets merged in, I'll write a section on the dangers of left-recursion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also document the json->grammar converters here, I'll send that separately

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 12, 2024

After (#6609 &) @HanClinto's ✨ magical #6616 ✨, this PR is still required, and we can dramatically increase the # of repetitions without impacting sampling performance.

At 200 repetitions the PR is 18x faster (switched to phi-2), and from 500 reps master is astronomically slow. Since the bottleneck then became finite stack in the recursive repetition rule generator, I've rewritten it and we can now go to 10k repetitions smoothly 🤯 (at 100k, the C++ server segfaults, which I say we keep as a follow up investigation).

Show benchmark commands for 10k reps
echo '{"items": {"type": "number"}, "maxItems": 10000}' > schema.json && \
  git checkout json-faster-repetitions2 && \
  python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
  git checkout master && \
  python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
  make clean && make -j LLAMA_CURL=1 main && \
  mkdir -p models/7B && \
  hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234'

Copy link
Collaborator

@HanClinto HanClinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Great work on this PR -- this is really stellar work!

Only other suggestion I can think is that it might be worth adding integration tests to compare the output of Python json_schema_to_grammar.py vs. json-scham-to-grammar.cpp -- but not sure if we care about ensure lockstep equivalency that much to care about wrapping it.

Overall this looks great, and I'm very impressed with it -- GREAT work on all of this @ochafik !

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 12, 2024

it might be worth adding integration tests to compare the output of Python json_schema_to_grammar.py vs. json-scham-to-grammar.cpp -- but not sure if we care about ensure lockstep equivalency that much to care about wrapping it.

Fully agree, so much I've done this already in #5978 :-D: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-json-schema-to-grammar.cpp (Also tests the JS version)

Overall this looks great, and I'm very impressed with it -- GREAT work on all of this @ochafik !

Thank you so much for your help, ideas & proactive reviews! (and your own speedups) Love this team work 👍

@ochafik ochafik merged commit ab9a324 into ggerganov:master Apr 12, 2024
53 of 59 checks passed
@HanClinto
Copy link
Collaborator

Fully agree, so much I've done this already in #5978 :-D: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-json-schema-to-grammar.cpp (Also tests the JS version)

haha -- very nicely done. 😄

Again, awesome job -- super happy to see this merged in!

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
…ngs, cap number length (ggerganov#6555)

* json: rename python schema converter to make import easier

* server: skip null json_schema / grammar fields

* json: deps management for primitive rules (+ allow null values)

* json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

* grammars: add troubleshooting section to readme

* json: cap length of numbers to 15 digits before/after decimal point

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

* json: unify all repetition code (w/ or w/o sep)

* json: support string minLength/maxLength

* server+json: update server/README w/ result_format

* nits

* json: fix type error w/ python 3.8

* json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions)

* json: simplify DOT `{"type": "string", "pattern": "^.$"}`

* json: remove recursion in opt_repetitions (avoids Python stack overflow)

* json: rm dead code

* json: rm useless assert & ggml.h import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants