Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use _wfopen instead of fopen on Windows #6248

Merged
merged 3 commits into from
Mar 23, 2024
Merged

Conversation

cebtenzzre
Copy link
Collaborator

This PR uses _wfopen() instead of fopen() on Windows, to make it more convenient for downstream projects to use llama.cpp with UTF-8 paths.

(This doesn't fix llama_model_quantize, which uses std::ofstream, but AFAIK there is no way to open a std::ofstream with a wchar_t path on MinGW, and this should only really be used by the quantize example anyway.)

Supersedes #5927

Tested with llama-cpp-python. Without this change:

>>> from llama_cpp import Llama
>>> path = b'\xF0\x9F\x98\x8A'.decode()  # '😊'
>>> llm = Llama(model_path=path, verbose=True)
llama_model_load: error loading model: failed to open 😊: No such file or directory
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\msys64\home\Jared\llama-cpp-python\llama_cpp\llama.py", line 314, in __init__
    self._model = _LlamaModel(
                  ^^^^^^^^^^^^
  File "C:\msys64\home\Jared\llama-cpp-python\llama_cpp\_internals.py", line 55, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: 😊

With this change, the model loads successfully.


See the Wikipedia page Unicode in Microsoft Windows for historical context that explains why Unicode is such a headache on Windows.

On recent versions of Windows, one can set the "ANSI" code page used by the *A functions1 to UTF-8 for the current process using setlocale(), for an executable or DLL using a manifest, or system-wide. However, for llama.cpp to be a portable and easy-to-use library, it shouldn't rely on this internally.

The approach used in this PR will work regardless of the above. The standard C wcstombs() depends on the process locale as set by setlocale(), but since this translation is only necessary on Windows, this PR directly uses MultiByteToWideChar with CP_UTF8.

NOTE: When the system code page is not UTF-8, the examples and tests will still fail on Unicode paths. The simplest fix for this that will also work on MinGW is for all of them to call setlocale(), and to use wmain or GetCommandLineW to get argv in UTF-8. If there is interest, this can be implemented in a future PR.

Footnotes

  1. There is also the "OEM" code page used by console functions such as ReadConsoleA/WriteConsoleA - this is what chcp changes

@cebtenzzre cebtenzzre requested a review from ggerganov March 23, 2024 02:24
@cebtenzzre cebtenzzre changed the title use wide versions of file path functions on Windows use _wfopen instead of fopen on Windows Mar 23, 2024
@cebtenzzre
Copy link
Collaborator Author

A related problem is that this doesn't work correctly on MSVC:

llama.cpp/llama.cpp

Lines 3974 to 3976 in 50ccaf5

const std::vector<int> ids = llama_tokenize_internal(vocab, "\u010A", false);
GGML_ASSERT(!ids.empty() && "model vocab missing newline token");
vocab.linefeed_id = ids[0];

This warning is important:

  C:\msys64\home\Jared\llama-cpp-python\vendor\llama.cpp\llama.cpp(3974,69): warning C4566: character represented by universal-character-name '\u010A' cannot be represented in the current code page (1252) [C:\Users\Jared\AppData\Local\Temp\tmpapjgk6fk\build\vendor\llama.cpp\llama.vcxproj

MSVC uses the system ANSI code page to encode char* string literals by default, and Windows-1252 does not contain Ċ. This can be fixed by using at least /execution-charset:utf-8. Alternatively, we could explicitly encode the UTF-8 representation in the source using \x - this would be "\xC4\x8A".

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could explicitly encode the UTF-8 representation in the source using \x - this would be "\xC4\x8A".

Sounds good

@cebtenzzre cebtenzzre merged commit 94d1b3b into master Mar 23, 2024
57 of 58 checks passed
TheFlipbook added a commit to TheFlipbook/llama.cpp that referenced this pull request Mar 24, 2024
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov#6248
slaren pushed a commit that referenced this pull request Mar 24, 2024
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by #6248
github-actions bot pushed a commit to KerfuffleV2/ggml-sys-bleedingedge that referenced this pull request Mar 25, 2024
== Relevant log messages from source repo:

commit a32b77c4b2c1808654d0b952f26c37d73d2e746b
Author: Rick G <26732651+TheFlipbook@users.noreply.github.com>
Date:   Sun Mar 24 14:45:56 2024 -0700

    Fix heap corruption from wmode out-of-bound writes on windows (#6272)

    * would throw error on VS2022 on GGML_FREE(wmode)
    * wchar_t is usually 2 bytes, but malloc wants bytes
      * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
    * Fixes error possibly introduced by ggerganov/llama.cpp#6248
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Mar 25, 2024
also fix missing #defines before windows.h, and BPE LF token on MSVC

(cherry picked from commit 94d1b3b)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre pushed a commit to nomic-ai/llama.cpp that referenced this pull request Mar 25, 2024
…nov#6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov#6248

(cherry picked from commit a32b77c)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Mar 27, 2024
…6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov/llama.cpp#6248
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Mar 27, 2024
…6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov/llama.cpp#6248
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
also fix missing #defines before windows.h, and BPE LF token on MSVC
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…nov#6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov#6248
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
also fix missing #defines before windows.h, and BPE LF token on MSVC
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
…nov#6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov#6248
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
also fix missing #defines before windows.h, and BPE LF token on MSVC
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
…nov#6272)

* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
  * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by ggerganov#6248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants