llama : add option to render special/control tokens #6807

ggerganov · 2024-04-21T12:16:34Z

Setting special == true in llama_token_to_piece() will cause special/control tokens' text to be rendered in the output:

llama.cpp/llama.h

Lines 827 to 837 in 1f45c2a

    
           // Token Id -> Piece. 
        
           // Uses the vocabulary in the provided context. 
        
           // Does not write null terminator to the buffer. 
        
           // User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens. 
        
           // @param special If true, special tokens are rendered in the output. 
        
           LLAMA_API int32_t llama_token_to_piece( 
        
                     const struct llama_model * model, 
        
                                  llama_token   token, 
        
                                         char * buf, 
        
                                      int32_t   length, 
        
                                         bool   special);

ggml-ci

github-actions · 2024-04-21T13:33:44Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 215 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=22612.46ms p(95)=38873.62ms fails=, finish reason: stop=101 truncated=114
Prompt processing (pp): avg=269.66tk/s p(95)=800.94tk/s
Token generation (tg): avg=23.51tk/s p(95)=26.01tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/render-control-tokens commit=ed5d273c4dcc075a86b94a831bb825fb98519ce0

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 307.32, 307.32, 307.32, 307.32, 307.32, 341.48, 341.48, 341.48, 341.48, 341.48, 648.34, 648.34, 648.34, 648.34, 648.34, 655.03, 655.03, 655.03, 655.03, 655.03, 651.63, 651.63, 651.63, 651.63, 651.63, 637.9, 637.9, 637.9, 637.9, 637.9, 626.6, 626.6, 626.6, 626.6, 626.6, 646.59, 646.59, 646.59, 646.59, 646.59, 643.5, 643.5, 643.5, 643.5, 643.5, 657.53, 657.53, 657.53, 657.53, 657.53, 657.86, 657.86, 657.86, 657.86, 657.86, 677.98, 677.98, 677.98, 677.98, 677.98, 684.45, 684.45, 684.45, 684.45, 684.45, 681.23, 681.23, 681.23, 681.23, 681.23, 677.81, 677.81, 677.81, 677.81, 677.81, 675.64, 675.64, 675.64, 675.64, 675.64, 676.19, 676.19, 676.19, 676.19, 676.19, 679.49, 679.49, 679.49, 679.49, 679.49, 684.69, 684.69, 684.69, 684.69, 684.69, 682.54, 682.54, 682.54, 682.54, 682.54, 686.51, 686.51, 686.51, 686.51, 686.51, 684.57, 684.57, 684.57, 684.57, 684.57, 685.13, 685.13, 685.13, 685.13, 685.13, 691.79, 691.79, 691.79, 691.79, 691.79, 692.01, 692.01, 692.01, 692.01, 692.01, 691.33, 691.33, 691.33, 691.33, 691.33, 688.01, 688.01, 688.01, 688.01, 688.01, 702.3, 702.3, 702.3, 702.3, 702.3, 700.33, 700.33, 700.33, 700.33, 700.33, 697.1, 697.1, 697.1, 697.1, 697.1, 709.86, 709.86, 709.86, 709.86, 709.86, 711.55, 711.55, 711.55, 711.55, 711.55, 710.69, 710.69, 710.69, 710.69, 710.69, 708.45, 708.45, 708.45, 708.45, 708.45, 710.68, 710.68, 710.68, 710.68, 710.68, 715.26, 715.26, 715.26, 715.26, 715.26, 716.95, 716.95, 716.95, 716.95, 716.95, 713.01, 713.01, 713.01, 713.01, 713.01, 706.79, 706.79, 706.79, 706.79, 706.79, 703.22, 703.22, 703.22, 703.22, 703.22, 702.39, 702.39, 702.39, 702.39, 702.39, 702.41, 702.41, 702.41, 702.41, 702.41, 702.69, 702.69, 702.69, 702.69, 702.69, 703.41, 703.41, 703.41, 703.41, 703.41, 701.88, 701.88, 701.88, 701.88, 701.88, 700.97, 700.97, 700.97, 700.97, 700.97, 700.69, 700.69, 700.69, 700.69, 700.69, 704.55, 704.55, 704.55, 704.55, 704.55, 709.4, 709.4, 709.4, 709.4, 709.4, 707.51, 707.51, 707.51, 707.51, 707.51, 706.73, 706.73, 706.73, 706.73, 706.73, 706.14, 706.14, 706.14, 706.14, 706.14, 704.52, 704.52, 704.52, 704.52, 704.52, 705.7, 705.7, 705.7, 705.7, 705.7, 705.04, 705.04, 705.04, 705.04, 705.04, 708.07, 708.07, 708.07, 708.07, 708.07, 707.4, 707.4, 707.4, 707.4, 707.4, 710.05, 710.05, 710.05, 710.05, 710.05, 709.54, 709.54, 709.54, 709.54, 709.54, 716.04, 716.04, 716.04, 716.04, 716.04, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.65, 29.65, 29.65, 29.65, 29.65, 29.42, 29.42, 29.42, 29.42, 29.42, 27.91, 27.91, 27.91, 27.91, 27.91, 25.72, 25.72, 25.72, 25.72, 25.72, 23.69, 23.69, 23.69, 23.69, 23.69, 20.42, 20.42, 20.42, 20.42, 20.42, 17.39, 17.39, 17.39, 17.39, 17.39, 17.16, 17.16, 17.16, 17.16, 17.16, 17.36, 17.36, 17.36, 17.36, 17.36, 17.82, 17.82, 17.82, 17.82, 17.82, 18.27, 18.27, 18.27, 18.27, 18.27, 18.31, 18.31, 18.31, 18.31, 18.31, 18.32, 18.32, 18.32, 18.32, 18.32, 18.17, 18.17, 18.17, 18.17, 18.17, 18.15, 18.15, 18.15, 18.15, 18.15, 18.47, 18.47, 18.47, 18.47, 18.47, 18.8, 18.8, 18.8, 18.8, 18.8, 18.96, 18.96, 18.96, 18.96, 18.96, 19.25, 19.25, 19.25, 19.25, 19.25, 19.31, 19.31, 19.31, 19.31, 19.31, 19.36, 19.36, 19.36, 19.36, 19.36, 19.42, 19.42, 19.42, 19.42, 19.42, 19.46, 19.46, 19.46, 19.46, 19.46, 19.51, 19.51, 19.51, 19.51, 19.51, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.51, 19.51, 19.51, 19.51, 19.51, 19.52, 19.52, 19.52, 19.52, 19.52, 19.49, 19.49, 19.49, 19.49, 19.49, 19.46, 19.46, 19.46, 19.46, 19.46, 19.35, 19.35, 19.35, 19.35, 19.35, 19.25, 19.25, 19.25, 19.25, 19.25, 19.19, 19.19, 19.19, 19.19, 19.19, 18.94, 18.94, 18.94, 18.94, 18.94, 18.78, 18.78, 18.78, 18.78, 18.78, 18.75, 18.75, 18.75, 18.75, 18.75, 18.66, 18.66, 18.66, 18.66, 18.66, 18.54, 18.54, 18.54, 18.54, 18.54, 18.45, 18.45, 18.45, 18.45, 18.45, 18.3, 18.3, 18.3, 18.3, 18.3, 18.19, 18.19, 18.19, 18.19, 18.19, 17.89, 17.89, 17.89, 17.89, 17.89, 17.81, 17.81, 17.81, 17.81, 17.81, 17.82, 17.82, 17.82, 17.82, 17.82, 17.85, 17.85, 17.85, 17.85, 17.85, 17.89, 17.89, 17.89, 17.89, 17.89, 17.94, 17.94, 17.94, 17.94, 17.94, 18.01, 18.01, 18.01, 18.01, 18.01, 18.04, 18.04, 18.04, 18.04, 18.04, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.84, 17.84, 17.84, 17.84, 17.84, 17.76, 17.76, 17.76, 17.76, 17.76, 17.75, 17.75, 17.75, 17.75, 17.75, 17.81, 17.81, 17.81, 17.81, 17.81, 17.87, 17.87, 17.87, 17.87, 17.87, 17.89, 17.89, 17.89, 17.89, 17.89, 17.91, 17.91, 17.91, 17.91, 17.91, 18.02, 18.02, 18.02, 18.02, 18.02, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.42, 0.42, 0.42, 0.42, 0.42, 0.46, 0.46, 0.46, 0.46, 0.46, 0.45, 0.45, 0.45, 0.45, 0.45, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.4, 0.4, 0.4, 0.4, 0.4, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.42, 0.42, 0.42, 0.42, 0.42]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0]

ggerganov · 2024-04-21T13:54:29Z

phi-2-q4_0: 215 iterations

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

ngxson · 2024-04-21T14:10:46Z

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

Very likely, because we're using phi-2 model which does not have native support for chatml (so <|im_end|> is not a single token - it is broken into multiple tokens)

Edit: The simple fix is to bring back the line llama_params["stop"].push_back("<|im_end|>"); in server/utils.hpp. Only chatml <|im_end|> need this special treatment. Other templates like gemma or llama3 don't need this.

ggerganov · 2024-04-21T14:19:56Z

I think we are incorrectly using a base model instead of instruction-tuned one for this test:

https://huggingface.co/microsoft/phi-2

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

ngxson · 2024-04-21T14:23:12Z

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

Ah yeah that's right. We can use dolphin-phi2 then. Here is the link: https://huggingface.co/TheBloke/dolphin-2_6-phi-2-GGUF

The <|im_start|>, <|im_end|> and chat template of the HF model are all correct: https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2/blob/main/tokenizer_config.json#L325

* make : fix common dep on llama.h * llama : add option to render special tokens * readme : add API change notice ggml-ci * swift : fix build

ggerganov added 4 commits April 21, 2024 15:11

make : fix common dep on llama.h

f53661a

llama : add option to render special tokens

0a37fb2

readme : add API change notice

1f45c2a

ggml-ci

swift : fix build

ed5d273

phymbert approved these changes Apr 21, 2024

View reviewed changes

ggerganov merged commit 40f74e4 into master Apr 21, 2024
61 of 64 checks passed

ggerganov deleted the gg/render-control-tokens branch April 21, 2024 15:36

This was referenced Apr 24, 2024

OpenAI-Compatible Chat Completions API Endpoint Responses include EOS / stop tokens #6859

Closed

Fix: Revert showing control tokens by default for server OpenAI Chat completions #6860

Merged

HanClinto mentioned this pull request May 16, 2024

main : don't print special tokens with --grammar #6923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add option to render special/control tokens #6807

llama : add option to render special/control tokens #6807

ggerganov commented Apr 21, 2024 •

edited

Loading

github-actions bot commented Apr 21, 2024

ggerganov commented Apr 21, 2024

ngxson commented Apr 21, 2024 •

edited

Loading

ggerganov commented Apr 21, 2024

ngxson commented Apr 21, 2024 •

edited

Loading

	// Token Id -> Piece.
	// Uses the vocabulary in the provided context.
	// Does not write null terminator to the buffer.
	// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
	// @param special If true, special tokens are rendered in the output.
	LLAMA_API int32_t llama_token_to_piece(
	const struct llama_model * model,
	llama_token token,
	char * buf,
	int32_t length,
	bool special);

llama : add option to render special/control tokens #6807

llama : add option to render special/control tokens #6807

Conversation

ggerganov commented Apr 21, 2024 • edited Loading

github-actions bot commented Apr 21, 2024

ggerganov commented Apr 21, 2024

ngxson commented Apr 21, 2024 • edited Loading

ggerganov commented Apr 21, 2024

ngxson commented Apr 21, 2024 • edited Loading

ggerganov commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 21, 2024 •

edited

Loading

ngxson commented Apr 21, 2024 •

edited

Loading