-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use _wfopen instead of fopen on Windows #6248
Conversation
A related problem is that this doesn't work correctly on MSVC: Lines 3974 to 3976 in 50ccaf5
This warning is important:
MSVC uses the system ANSI code page to encode char* string literals by default, and Windows-1252 does not contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could explicitly encode the UTF-8 representation in the source using \x - this would be "\xC4\x8A".
Sounds good
* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov#6248
* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by #6248
== Relevant log messages from source repo: commit a32b77c4b2c1808654d0b952f26c37d73d2e746b Author: Rick G <26732651+TheFlipbook@users.noreply.github.com> Date: Sun Mar 24 14:45:56 2024 -0700 Fix heap corruption from wmode out-of-bound writes on windows (#6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov/llama.cpp#6248
also fix missing #defines before windows.h, and BPE LF token on MSVC (cherry picked from commit 94d1b3b) Signed-off-by: Jared Van Bortel <jared@nomic.ai>
…nov#6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov#6248 (cherry picked from commit a32b77c) Signed-off-by: Jared Van Bortel <jared@nomic.ai>
…6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov/llama.cpp#6248
…6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov/llama.cpp#6248
also fix missing #defines before windows.h, and BPE LF token on MSVC
…nov#6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov#6248
also fix missing #defines before windows.h, and BPE LF token on MSVC
…nov#6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov#6248
also fix missing #defines before windows.h, and BPE LF token on MSVC
…nov#6272) * would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by ggerganov#6248
This PR uses _wfopen() instead of fopen() on Windows, to make it more convenient for downstream projects to use llama.cpp with UTF-8 paths.
(This doesn't fix llama_model_quantize, which uses std::ofstream, but AFAIK there is no way to open a std::ofstream with a wchar_t path on MinGW, and this should only really be used by the
quantize
example anyway.)Supersedes #5927
Tested with llama-cpp-python. Without this change:
With this change, the model loads successfully.
See the Wikipedia page Unicode in Microsoft Windows for historical context that explains why Unicode is such a headache on Windows.
On recent versions of Windows, one can set the "ANSI" code page used by the *A functions1 to UTF-8 for the current process using setlocale(), for an executable or DLL using a manifest, or system-wide. However, for llama.cpp to be a portable and easy-to-use library, it shouldn't rely on this internally.
The approach used in this PR will work regardless of the above. The standard C wcstombs() depends on the process locale as set by setlocale(), but since this translation is only necessary on Windows, this PR directly uses MultiByteToWideChar with CP_UTF8.
NOTE: When the system code page is not UTF-8, the examples and tests will still fail on Unicode paths. The simplest fix for this that will also work on MinGW is for all of them to call setlocale(), and to use wmain or GetCommandLineW to get argv in UTF-8. If there is interest, this can be implemented in a future PR.
Footnotes
There is also the "OEM" code page used by console functions such as ReadConsoleA/WriteConsoleA - this is what
chcp
changes ↩