Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode model filename support for Windows #5927

Closed

Conversation

BruceMacD
Copy link
Contributor

@BruceMacD BruceMacD commented Mar 7, 2024

This change adds unicode support for model file paths on Windows.

Example: C:\Users\Željko\mistral-7b-instruct-v0.2.Q4_K_S.gguf

Before:

 error: failed to load model 'C:\Users\bruce_macdonald\Desktop\┼╜eljko\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12'
{"function":"load_model","level":"ERR","line":395,"model":"C:\\Users\\bruce_macdonald\\Desktop\\┼╜eljko\\blobs\\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12","msg":"unable to load model","tid":"180","timestamp":1709843627}

Now it should work.

Fixes #5802

ggml.c Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Mar 8, 2024

When using llama.cpp from the command line, for me this does not change anything, I am already able to open models with unicode characters in master. How are you testing this? The model file is also opened in llama.cpp in llama_file with fopen, why does that not require the same change?

@cebtenzzre
Copy link
Collaborator

When using llama.cpp from the command line, for me this does not change anything, I am already able to open models with unicode characters in master. How are you testing this?

This might be related to what code page is being used for the "ANSI" Win32 functions on your system. It might even be UTF-8. But generally, you do need to use the wide version of fopen to support Unicode on older versions of Windows.

@BruceMacD
Copy link
Contributor Author

BruceMacD commented Mar 8, 2024

@slaren I was able to test this out on a different Windows machine today (running Windows 11) and I did not see the unicode problem in that case, as you observed. The machine I was testing on when I saw the behaviour was older, here is a dump of some of the system info where I can reproduce the issue:

$ systeminfo

OS Name:                   Microsoft Windows Server 2022 Datacenter
OS Version:                10.0.20348 N/A Build 20348
Registered Owner:          N/A
Registered Organization:   N/A
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)

$ chcp
Active code page: 437

For reference, I was testing by running the server example:
.\bin\Release\server.exe -m C:\Users\example\Željko\llama2.gguf

You're correct that the logic in llama_file will need similar changes. I didn't notice it where it is outside my workflow. I tested it in this environment and it has similar issues. It will be a bit more code to fix since it also needs to handle writing the file. I'll try to take a look at that over the next couple days.

@slaren
Copy link
Collaborator

slaren commented Mar 9, 2024

My locale is the same as yours, but I have no problem loading models with unicode characters. Here is an example:

C:\llama.cpp>systeminfo

OS Name:                   Microsoft Windows 11 Pro
OS Version:                10.0.22621 N/A Build 226
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)

C:\llama.cpp>chcp
Active code page: 437

C:\llama.cpp>build\bin\Release\server --port 12345 -m models\Željkoñomic-embed-f16.gguf
{"build":2381,"commit":"77d1ac7e","function":"main","level":"INFO","line":2702,"msg":"build info","tid":"2768","timestamp":1710019099}
{"function":"main","level":"INFO","line":2709,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"2768","timestamp":1710019099,"total_threads":32}
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from models\Äeljko±omic-embed-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors

The only difference that I can see is that I am using Windows 11 instead of server 2022.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copy-pasting this code across several locations, let's use a function. Maybe we could put a static function in a utility header and use it everywhere?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expose the function through the ggml API:

    FILE * ggml_fopen(const char * fname, const char * mode);

Unfortunately, we would have to include <stdio.h> which is not great, but I guess it is the best option

@BruceMacD
Copy link
Contributor Author

BruceMacD commented Mar 14, 2024

Closing this for now as I was working through some build issues. I'll re-open when I have time to come back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot load model with unicode character in path
4 participants