-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite loading code to try to satisfy everyone #801
Conversation
VirtualLock only locks your memory into the working set #pragma comment (lib, "ntdll.lib")
extern "C" __declspec(dllimport) long __stdcall RtlAdjustPrivilege(unsigned long dwPrivilege, int bEnablePrivilege, int bIsThreadPrivilege, int *pbPreviosValue);
{
int bPrev;
unsigned long dwErr;
dwErr = RtlAdjustPrivilege(/* SE_LOCK_MEMORY_PRIVILEGE */ 4, 1, 0, &bPrev);
if (dwErr) {
printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
return 1;
}
dwErr = RtlAdjustPrivilege(/* SE_INCREASE_QUOTA_PRIVILEGE */ 5, 1, 0, &bPrev);
if (dwErr) {
printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
return 1;
}
dwErr = RtlAdjustPrivilege(/* SE_INC_WORKING_SET_PRIVILEGE */ 33, 1, 0, &bPrev);
if (dwErr) {
printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
return 1;
}
} advapi32 version typedef struct _TOKEN_PRIVILEGES_3 {
unsigned long PrivilegeCount;
LUID_AND_ATTRIBUTES Privileges[3];
} TOKEN_PRIVILEGES_3, *PTOKEN_PRIVILEGES_3;
inline static int privs()
{
TOKEN_PRIVILEGES_3 tkp;
void* hToken;
unsigned long dwErr;
tkp.PrivilegeCount = 3;
tkp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
tkp.Privileges[1].Attributes = SE_PRIVILEGE_ENABLED;
tkp.Privileges[2].Attributes = SE_PRIVILEGE_ENABLED;
if (!OpenProcessToken((void*)-1, TOKEN_QUERY | TOKEN_ADJUST_PRIVILEGES, &hToken)) {
dwErr = GetLastError();
return printf("[!] advapi32:OpenProcessToken() failed; error code = 0x%08X\n", dwErr);
}
if (!LookupPrivilegeValueA(0, "SeLockMemoryPrivilege", &tkp.Privileges[0].Luid) ||
!LookupPrivilegeValueA(0, "SeIncreaseQuotaPrivilege", &tkp.Privileges[1].Luid) ||
!LookupPrivilegeValueA(0, "SeIncreaseWorkingSetPrivilege", &tkp.Privileges[2].Luid)) {
dwErr = GetLastError();
CloseHandle(hToken);
return printf("[!] advapi32:LookupPrivilegeValueA() failed; error code = 0x%08X\n", dwErr);
}
AdjustTokenPrivileges(hToken, 0, (TOKEN_PRIVILEGES*)&tkp, 0, 0, 0);
dwErr = GetLastError();
CloseHandle(hToken);
if (dwErr) {
return printf("[!] advapi32:AdjustTokenPrivileges() failed; error code = 0x%08X\n", dwErr);
}
return 0;
} |
if (size == 0) { | ||
return; | ||
} | ||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed. errno will be reset to zero after a successful IO operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, successful calls leave errno
as-is rather than setting it to 0. In most cases there's nevertheless no need to reset errno
to 0 before calling an API, because you only check errno
if the call fails, and in that case it must have set errno
. fread
and fwrite
are weird, because the spec doesn't require them to set errno
if they fail. But in practice they do set errno
at least on Linux and macOS (not sure about Windows), and there's no better alternative way to get the cause of about a read/write error, at least not one that's portable. (You can use ferror
to portably check whether an error occurred, but not what it was.) Therefore I report errno, but reset it to 0 first so that if you're on some system where fread
/fwrite
don't set errno
on failure, you'll at least see something like "Undefined error: 0" or "Success" – which is not too helpful but better than showing a misleading error from some past call. (Even better would be to check for errno == 0
and print something like "unknown error", but I felt it wasn't worth the lines of code.)
This explanation may become redundant if I have to switch to OS raw read/write functions to avoid unnecessary copies (still need to check this). Edit: Turns out I don't have to switch.
} | ||
} | ||
|
||
void init(void * addr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not put this to the constructor? this is purely initializing a member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the llama_mlock
objects live in llama_model
and are constructed before the mlock actually happens. An alternative approach that does allow it to be in the constructor would be to store it behind a unique_ptr
, like I did with llama_mmap
, but there wasn't any particular need for it here (unlike llama_mmap
where the mapping gets transferred out of the llama_model_loader
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama.cpp permissions changed from 644 → 755
i guess this is unintended
will fix |
Figured out the VirtualLock issue.
|
Addressed all PR feedback, and checked both items off my todo list. The PR still could use testing. |
I've tried the versions/configurations, listed below, the output is also added as attachment. I tried to make some combinations using various models and data formats. All models load and produce output with no errors. Tested on windows/cmake/vs/ps Let me know if i need to try something specific or run some other test. output --> pizza_version_test.txt
|
@comex Do you have any particular things in mind which you think need testing? I tested this PR using
and confirmed they all load and give the expected results (at least for the first few chunks), that mmap is only used for the ggjt files, and that I also tested All tests on macOS arm64. One note is, I think it would be useful to print out the magic and version (or some other descriptor), since it's not always obvious from the file itself (unless I'm just missing it in the output somewhere). Also it may be worth printing a warning in the original ggml case, since the broken tokenizer really degrades its quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @comex for this hard work!
The PR is OK to be merged as it is, but here are some comments:
-
The
mlock
move fromggml
tollama
is great -
I've written elsewhere my opinion on the validation and error checking: Fix memory bugs in loading code #651 (review). Regardless, this is fine
-
Not a big fan of the
llama_util.h
introduction. The way I see it, themmap
/mlock
functionality has to be part ofcommon
and be provided optionally through the C API. The way of thinking is: "we demonstrate how one can use these extra features via theexamples
". I know it makes life a little bit more difficult for developers who want to usellama
+mmap
/mlock
in their projects, but it's not really that big of an obstacle. And I think we are past the point of "there is no reason not to usemmap
" -
In the long run, I think
llama_buffer
can become part ofllama.cpp
andllama_util.h
be merged intocommon
-
We have to add some obsoletion periods for old formats. For example, 2-3 weeks after introducing a new format, the old ones will be deprecated completely. At the moment, there is huge interest in the project and it's great to provide such a wide support for all the formats to make life easier for the non-technical people, but in the long run - it's better to simplify the code
There is now a trivial conflict after #812 has been merged: |
Thanks.
Hmm, I think you may have misunderstood my intent with If you did understand already that it's not a public header, then I don't really understand what you're recommending, so please elaborate. :)
I'm not sure whether you mean the argument is resolved in favor of "there is no reason" or "there is a reason". I've seen both of those views expressed very strongly by different people. :) Either way, I'll state my view: Some of the issues people have had with mmap are solvable (alignment is a non-issue if we're going to deprecate old formats; the issue with Windows doing small-sized reads is theoretically solved with In the future, after getting rid of support for old formats, the non-mmap path can be rewritten to just read the entire file directly into a buffer (as opposed to doing reads tensor-by-tensor), so the amount of extra code needed to support both paths will be very small. |
What you tested is great (I did some similar tests but not as thorough). Another kind of testing I'd like is from Windows users. I did some basic testing in a Windows VM, but not much. Specifically:
Good points; I'll think about doing a followup with this. |
Regarding the
The idea I have is to extend for example In any case, this is mostly brainstorming about an alternative design. I don't insist on doing it now, or even later.
My impression, based on the feedback so far, is that there are still cases where
Sounds good. If we manage to solve the reported issues and we don't observe too many OS-specific difficulties in the future, I am totally fine to keep this tightly integrated with the library. |
- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
Also improve model type printing, and fix indentation of an unrelated switch statement.
Anyway, PR has been rebased on the latest master. Significant changes:
Feel free to merge if you want; I apparently can't trigger a merge myself until @slaren retracts the "changes requested" status. Thanks. |
I think that's @prusnak, I haven't requested any changes. |
retracted my review since the file permissions issue has been corrected, the merge should now be possible |
Seems strange that you can lock pages in memory without actually having the privilege to do so, but at the same time it doesn't really surprise me since the windows privilege system has always been and still is all over the place. The MSDN doc could also simply be wrong which it has been many times in the past, and maybe if the lock policy isn't configured (Not defined) means that actually everyone has that privilege and not that no one has it. Cases like that can throw you off when doing |
Since Georgi himself has approved and all lingering issues appeared to have been resolved, I went ahead and merged this. |
Big thumbs up for this. It seems like in general the
After:
|
Successfully compiled master branch and successfully compiled comex's pizza branch, and successfully run ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512. If confused how exactly I did compile it, read #103 (comment) |
Features:
Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.)
Support both mmap and read (mmap is used by default, but can be disabled with
--no-mmap
, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported).Support multi-file models like before, but automatically determine the number of parts rather than requiring
--n_parts
.Improve validation and error checking.
Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front).
Support VirtualLock on Windows (using the same
--mlock
option as on Unix).Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...)
madvise/PrefetchVirtualMemory support (based on Advise the kernel to preload the mapped memory #740)
Switch from ifstream to the
fopen
family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap).Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way').
Todo:
VirtualLock does not work at all on the one Windows machine I tested it on (it complains about quota). Figure out why.Fixed.Verify that using theVerified that when reading a large amount of data withfopen
family of functions actually does what I think it does, performance-wise.fread
, it passes the pointer directly to the kernel rather than doing an intermediate copy, on Linux, macOS, and Windows. Andifstream
does not do so on at least macOS (didn't test the other two). So moving fromifstream
to thefopen
family was indeed an improvement, but there's no benefit to going further and using OS APIs directly.More testing.
Implementation notes:
I tried to factor the code into more discrete pieces than before.
Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly:
Destructors to make it easier to ensure everything gets cleaned up.
Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. (Edit: The exceptions are converted to error codes at the API boundary.)
Co-authored-by: Pavol Rusnak pavol@rusnak.io (for the bit I copied from #740)