llama : add Deepseek support #5981 #6252

dragnil1 · 2024-03-23T08:55:56Z

Work in progress. Creating this pr to check If I am on track with the possible implementation Deepseek coder merge #5464 (comment)

ggerganov · 2024-03-23T18:28:37Z

unicode.h

+std::vector<std::wstring> get_gpt2_regex();
+std::vector<std::wstring> get_deepseek_coder_regex();
+std::vector<std::wstring> get_deepseek_llm_regex();


I'm thinking the interface here should be:

std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regexes);

The implementation should be something like what regex_bpe_preprocess currently is. It loops through the regex strings and if we have a known unicode representation (e.g. "//\s?\p{L}+" -> std::wstring) - we apply it with std::wregex. Else, if we have a custom implementation (for example see the GPT2 preprocess function) then we apply that.

The unicode module should not have any kind of notion about GPT2, Deepseek or other model-related stuff. This information should be in llama.cpp

ggerganov · 2024-03-23T18:39:24Z

llama.cpp

+    std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
+        return regex_bpe_preprocess(text, get_deepseek_coder_regex());
+    }


Following my previous comment, this should eventually become:

Suggested change

std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {

return regex_bpe_preprocess(text, get_deepseek_coder_regex());

}

std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {

return unicode_regex_split(text, {

"[\\p{P}\\$\\+<=>\\^~\\|]+",

"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",

"[0-9][0-9][0-9]"

"\\s?\\p{L}+",

"\\s?\\p{P}+",

"\\p{N}",

});

}

ggerganov · 2024-04-01T07:56:42Z

llama.cpp

    const llama_vocab & vocab;

    std::vector<llm_symbol> symbols;
    std::vector<llm_symbol> symbols_final;

    llm_bigram_bpe::queue work_queue;
+
+    const std::vector<std::wstring> gpt2_regex = {


I had a different idea about this - let me try to explain again:

In llama.cpp, we want to keep the original regex strings as they have been specified by the model creators. For example:

's|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)

\s?\p{L}+

etc.

Now, my understanding is that in C++ we cannot simply perform some of those regex matching due to the lack of support for some the regex patterns in the standard library. So to solve this issue, we create the unicode module, which takes the regex strings from above as they are and performs a few different strategies to split the target string:

If we have a known unicode representation generated in some way, we apply that using std::wregex. I.e. we check a constant std::map<std::string, std::wstring> for the presence of the regex

If not, we then check if we have a custom implementation of the regex via a function call (see bpe_gpt2_preprocess() on master which is a custom implementation of regex split with 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+)

Else, we just apply std::regex and hope for the best, or throw an error

Hello, in the unicode.cpp, I have implemented the unicode_regex_split() function which iterates through the given regexes and if some match is found, it uses the regex. Otherwise, it uses the modified bpe_gpt2_preprocess() function renamed as unicode_custom_preprocess(). Now, I have some questions regarding bpe_gpt2_preprocess(). Can it handle the input of deepseek coder and deepseek llm? If no, so, I have to write custom function for them when regex is not found?

Can it handle the input of deepseek coder and deepseek llm?

No, AFAIK it is not compatible with the deepseek regex. The way I understand it is that bpe_gpt2_preprocess() (i.e. unicode_custom_preprocess()) works only for the following regex (based on the comment in the code):

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

So the logic in unicode_regex_split() has to check if this is the input regex and only apply this specific custom implementation in that case. For other regexes, we might want to implement more custom implementations in other functions and use them in unicode_regex_split() in the future.

Note that this my understanding of how this part of the tokenizer is supposed to work. I could be wrong, so don't take all of these suggestions for granted.

In any case, the huge unicode constants like gpt2_regex should not be located in llama.cpp, but instead should be in unicode.cpp.

github-actions · 2024-04-16T00:37:54Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 424 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=11118.74ms p(95)=30395.38ms fails=, finish reason: stop=368 truncated=56
Prompt processing (pp): avg=123.28tk/s p(95)=545.28tk/s
Token generation (tg): avg=25.14tk/s p(95)=34.55tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=unicode-refactor-regex commit=d58d9d80f8152edb5ac913d4f97fea129e3c4d93

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 407.6, 407.6, 407.6, 407.6, 407.6, 537.75, 537.75, 537.75, 537.75, 537.75, 508.81, 508.81, 508.81, 508.81, 508.81, 527.07, 527.07, 527.07, 527.07, 527.07, 552.2, 552.2, 552.2, 552.2, 552.2, 586.64, 586.64, 586.64, 586.64, 586.64, 602.38, 602.38, 602.38, 602.38, 602.38, 602.53, 602.53, 602.53, 602.53, 602.53, 619.4, 619.4, 619.4, 619.4, 619.4, 633.59, 633.59, 633.59, 633.59, 633.59, 635.56, 635.56, 635.56, 635.56, 635.56, 639.99, 639.99, 639.99, 639.99, 639.99, 642.47, 642.47, 642.47, 642.47, 642.47, 662.49, 662.49, 662.49, 662.49, 662.49, 684.86, 684.86, 684.86, 684.86, 684.86, 693.72, 693.72, 693.72, 693.72, 693.72, 712.1, 712.1, 712.1, 712.1, 712.1, 661.12, 661.12, 661.12, 661.12, 661.12, 665.11, 665.11, 665.11, 665.11, 665.11, 665.19, 665.19, 665.19, 665.19, 665.19, 671.25, 671.25, 671.25, 671.25, 671.25, 678.26, 678.26, 678.26, 678.26, 678.26, 679.58, 679.58, 679.58, 679.58, 679.58, 678.61, 678.61, 678.61, 678.61, 678.61, 682.21, 682.21, 682.21, 682.21, 682.21, 684.86, 684.86, 684.86, 684.86, 684.86, 686.44, 686.44, 686.44, 686.44, 686.44, 676.87, 676.87, 676.87, 676.87, 676.87, 642.54, 642.54, 642.54, 642.54, 642.54, 644.28, 644.28, 644.28, 644.28, 644.28, 645.18, 645.18, 645.18, 645.18, 645.18, 654.43, 654.43, 654.43, 654.43, 654.43, 653.05, 653.05, 653.05, 653.05, 653.05, 650.96, 650.96, 650.96, 650.96, 650.96, 650.83, 650.83, 650.83, 650.83, 650.83, 651.5, 651.5, 651.5, 651.5, 651.5, 654.0, 654.0, 654.0, 654.0, 654.0, 657.91, 657.91, 657.91, 657.91, 657.91, 658.25, 658.25, 658.25, 658.25, 658.25, 657.73, 657.73, 657.73, 657.73, 657.73, 661.35, 661.35, 661.35, 661.35, 661.35, 665.56, 665.56, 665.56, 665.56, 665.56, 665.39, 665.39, 665.39, 665.39, 665.39, 669.0, 669.0, 669.0, 669.0, 669.0, 674.14, 674.14, 674.14, 674.14, 674.14, 676.33, 676.33, 676.33, 676.33, 676.33, 676.03, 676.03, 676.03, 676.03, 676.03, 676.14, 676.14, 676.14, 676.14, 676.14, 676.1, 676.1, 676.1, 676.1, 676.1, 677.6, 677.6, 677.6, 677.6, 677.6, 677.81, 677.81, 677.81, 677.81, 677.81, 682.11, 682.11, 682.11, 682.11, 682.11, 685.15, 685.15, 685.15, 685.15, 685.15, 682.5, 682.5, 682.5, 682.5, 682.5, 679.36, 679.36, 679.36, 679.36, 679.36, 678.26, 678.26, 678.26, 678.26, 678.26, 676.88, 676.88, 676.88, 676.88, 676.88, 675.51, 675.51, 675.51, 675.51, 675.51, 676.93, 676.93, 676.93, 676.93, 676.93, 678.28, 678.28, 678.28, 678.28, 678.28, 678.29]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.88, 33.88, 33.88, 33.88, 33.88, 30.14, 30.14, 30.14, 30.14, 30.14, 21.46, 21.46, 21.46, 21.46, 21.46, 21.36, 21.36, 21.36, 21.36, 21.36, 20.41, 20.41, 20.41, 20.41, 20.41, 20.79, 20.79, 20.79, 20.79, 20.79, 21.22, 21.22, 21.22, 21.22, 21.22, 22.33, 22.33, 22.33, 22.33, 22.33, 23.23, 23.23, 23.23, 23.23, 23.23, 23.6, 23.6, 23.6, 23.6, 23.6, 23.61, 23.61, 23.61, 23.61, 23.61, 23.59, 23.59, 23.59, 23.59, 23.59, 23.66, 23.66, 23.66, 23.66, 23.66, 23.65, 23.65, 23.65, 23.65, 23.65, 23.57, 23.57, 23.57, 23.57, 23.57, 23.14, 23.14, 23.14, 23.14, 23.14, 22.51, 22.51, 22.51, 22.51, 22.51, 22.27, 22.27, 22.27, 22.27, 22.27, 22.36, 22.36, 22.36, 22.36, 22.36, 22.52, 22.52, 22.52, 22.52, 22.52, 22.75, 22.75, 22.75, 22.75, 22.75, 22.62, 22.62, 22.62, 22.62, 22.62, 22.35, 22.35, 22.35, 22.35, 22.35, 22.26, 22.26, 22.26, 22.26, 22.26, 22.25, 22.25, 22.25, 22.25, 22.25, 22.23, 22.23, 22.23, 22.23, 22.23, 22.35, 22.35, 22.35, 22.35, 22.35, 22.45, 22.45, 22.45, 22.45, 22.45, 22.59, 22.59, 22.59, 22.59, 22.59, 22.72, 22.72, 22.72, 22.72, 22.72, 22.81, 22.81, 22.81, 22.81, 22.81, 22.78, 22.78, 22.78, 22.78, 22.78, 22.61, 22.61, 22.61, 22.61, 22.61, 22.11, 22.11, 22.11, 22.11, 22.11, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.85, 21.85, 21.85, 21.85, 21.85, 21.91, 21.91, 21.91, 21.91, 21.91, 21.99, 21.99, 21.99, 21.99, 21.99, 22.13, 22.13, 22.13, 22.13, 22.13, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.18, 22.18, 22.18, 22.18, 22.18, 22.16, 22.16, 22.16, 22.16, 22.16, 22.02, 22.02, 22.02, 22.02, 22.02, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.78, 21.78, 21.78, 21.78, 21.78, 21.86, 21.86, 21.86, 21.86, 21.86, 21.93, 21.93, 21.93, 21.93, 21.93, 22.0, 22.0, 22.0, 22.0, 22.0, 22.09, 22.09, 22.09, 22.09, 22.09, 22.01, 22.01, 22.01, 22.01, 22.01, 21.9, 21.9, 21.9, 21.9, 21.9, 21.66, 21.66, 21.66, 21.66, 21.66, 21.61, 21.61, 21.61, 21.61, 21.61, 21.39, 21.39, 21.39, 21.39, 21.39, 20.85, 20.85, 20.85, 20.85, 20.85, 20.61, 20.61, 20.61, 20.61, 20.61, 20.52, 20.52, 20.52, 20.52, 20.52, 20.56]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.45, 0.45, 0.45, 0.45, 0.45, 0.46, 0.46, 0.46, 0.46, 0.46, 0.42, 0.42, 0.42, 0.42, 0.42, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0]

ggerganov · 2024-04-26T07:36:04Z

@dragnil1 Thank you for the help. With LLaMA v3 now switching to BPE tokenizer, this functionality becomes very important (see #6914).

I'll focus on finalizing it ASAP and will likely use the work in this PR as a starting point. Will probably move the branch in the llama.cpp repo so that we can run ggml-ci as well.

The tokenization and unicode handling is definitely not my strongest and favourite part of the codebase, so if you or anyone else have any insights, don't hesitate to share or help out. I think I have the understanding of how to implement BPE pre-processing support, but I could very well be missing something

dragnil1 · 2024-04-26T09:41:35Z

@dragnil1 Thank you for the help. With LLaMA v3 now switching to BPE tokenizer, this functionality becomes very important (see #6914).

I'll focus on finalizing it ASAP and will likely use the work in this PR as a starting point. Will probably move the branch in the llama.cpp repo so that we can run ggml-ci as well.

The tokenization and unicode handling is definitely not my strongest and favourite part of the codebase, so if you or anyone else have any insights, don't hesitate to share or help out. I think I have the understanding of how to implement BPE pre-processing support, but I could very well be missing something

Thanks for letting me work on this pr. Sorry for the delay. I was working on it to get it work on windows. The recent commit passed the tests in ubuntu. But the tokenizer tests failed on windows. I had the idea of using std::wregex when the wchar_t size is 32 bits and std::regex when wchart_t size is 16. But using regex is giving SEGFAULT on tokenizer tests on windows. This is probably because the regex pattern from ReFlex library is too large than the the regex pattern used for wregex. While doing some research on it, I found the most efficient way will be to use boost library with icu support. But it will hamper the minimal dependency support of llama.cpp. Otherwise, we can use the standalone boost regex library. I was also thinking of converting the regex pattern produced by the ReFlex library to utf-32 pattern and utf-16 pattern regex pattern, that respectively works on ubuntu and probably works on windows.

ggerganov · 2024-04-26T13:42:04Z

But using regex is giving SEGFAULT on tokenizer tests on windows. This is probably because the regex pattern from ReFlex library is too large than the the regex pattern used for wregex.

In the latest version #6920, I changed the order of the the regexes: first look for custom implementation and then look for a known equivalent regex. I also disabled DeepSeek code-paths temporary until we get the tests running as they were on master. So with these changes, I don't think we apply large regexes, but it still crashes (based on the Windows CIs).

I don't have a Windows environment to work on, so it's gonna take me some time to figure out where it goes wrong

Edit: apparently the Windows build fails were unrelated - hopefully we have a baseline that works now

dragnil1 · 2024-04-26T16:10:04Z

Ok. I have found the reason of the test failing in windows. Some regex ranges are not valid in windows. Here is an example which will run.

#include <iostream>
#include <string>
#include <regex>

int main() {
    std::wregex pattern2(L"[\U00000041-\U0000005A]"); // will run
    return 0;
}

Here is an example which will not run.

#include <iostream>
#include <string>
#include <regex>

int main() {
    std::wregex pattern1(L"[\U00011700-\U0001171A]"); // will not run
    return 0;
}

Both regex ranges are taken from gpt2 regex.

ggerganov · 2024-04-26T16:17:21Z

Yes, I just noticed the error in the CI:

6: llama_model_load: error loading model: error loading model vocabulary: regex_error(error_range): The expression contained an invalid character range, such as [b-a] in most encodings.

https://github.com/ggerganov/llama.cpp/actions/runs/8850389799/job/24304574061?pr=6920#step:12:1392

Any ideas how to resolve?

dragnil1 · 2024-04-26T16:28:02Z

Yes, I just noticed the error in the CI:
6: llama_model_load: error loading model: error loading model vocabulary: regex_error(error_range): The expression contained an invalid character range, such as [b-a] in most encodings.
https://github.com/ggerganov/llama.cpp/actions/runs/8850389799/job/24304574061?pr=6920#step:12:1392

Any ideas how to resolve?

We cannot use ranges which contain codepoints that requires greater than 2 bytes. We have to convert the 3 bytes or 4 bytes ranges to individual values. But this may cause the regex pattern to be large enough to result in a SEGFAULT. I will let you know if I can find a viable solution.

ggerganov reviewed Mar 23, 2024

View reviewed changes

CISC mentioned this pull request Mar 26, 2024

Fallback to tokenizer.json if vocab.json does not exist #6245

Closed

ggerganov reviewed Apr 1, 2024

View reviewed changes

jaggzh and others added 11 commits April 13, 2024 20:51

merged the changes from deepseeker models to main branch

c4d4f64

Moved regex patterns to unicode.cpp and updated unicode.h

c848f88

Moved header files

4812c79

Resolved issues

de86bb0

added and refactored unicode_regex_split and related functions

bb80290

Updated/merged the deepseek coder pr

3608491

Refactored code

e5dda78

Adding unicode regex mappings

021643f

Adding unicode regex function

0c991be

Added needed functionality, testing remains

6c80b3c

Fixed issues

cc8d529

dragnil1 force-pushed the unicode-refactor-regex branch from 46b1e9c to cc8d529 Compare April 15, 2024 23:55

phymbert mentioned this pull request Apr 16, 2024

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

Closed

Fixed issue with gpt2 regex custom preprocessor

d58d9d8

ggerganov mentioned this pull request Apr 25, 2024

BPE Tokenizer: Multiple newlines doesn't merge into a single token #6809

Closed

ggerganov mentioned this pull request Apr 26, 2024

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged

teleprint-me mentioned this pull request May 7, 2024

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

Closed

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request labels May 10, 2024

Galunid added the obsolete? Marker for potentially obsolete PR label Jun 15, 2024

Galunid closed this Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add Deepseek support #5981 #6252

llama : add Deepseek support #5981 #6252

dragnil1 commented Mar 23, 2024

ggerganov Mar 23, 2024 •

edited

Loading

ggerganov Mar 23, 2024

ggerganov Apr 1, 2024 •

edited

Loading

dragnil1 Apr 1, 2024

ggerganov Apr 4, 2024

github-actions bot commented Apr 16, 2024 •

edited

Loading

ggerganov commented Apr 26, 2024

dragnil1 commented Apr 26, 2024 •

edited

Loading

ggerganov commented Apr 26, 2024 •

edited

Loading

dragnil1 commented Apr 26, 2024 •

edited

Loading

ggerganov commented Apr 26, 2024

dragnil1 commented Apr 26, 2024 •

edited

Loading

-    std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
-        return regex_bpe_preprocess(text, get_deepseek_coder_regex());
-    }
+    std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
+        return unicode_regex_split(text, {
+	        "[\\p{P}\\$\\+<=>\\^~\\|]+",
+	        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
+	        "[0-9][0-9][0-9]"
+	        "\\s?\\p{L}+",
+	        "\\s?\\p{P}+",
+	        "\\p{N}",
+        });
+    }

llama : add Deepseek support #5981 #6252

llama : add Deepseek support #5981 #6252

Conversation

dragnil1 commented Mar 23, 2024

ggerganov Mar 23, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Mar 23, 2024

Choose a reason for hiding this comment

ggerganov Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

dragnil1 Apr 1, 2024

Choose a reason for hiding this comment

ggerganov Apr 4, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 16, 2024 • edited Loading

ggerganov commented Apr 26, 2024

dragnil1 commented Apr 26, 2024 • edited Loading

ggerganov commented Apr 26, 2024 • edited Loading

dragnil1 commented Apr 26, 2024 • edited Loading

ggerganov commented Apr 26, 2024

dragnil1 commented Apr 26, 2024 • edited Loading

ggerganov Mar 23, 2024 •

edited

Loading

ggerganov Apr 1, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024 •

edited

Loading

dragnil1 commented Apr 26, 2024 •

edited

Loading

ggerganov commented Apr 26, 2024 •

edited

Loading

dragnil1 commented Apr 26, 2024 •

edited

Loading

dragnil1 commented Apr 26, 2024 •

edited

Loading