Rewrite loading code to try to satisfy everyone #801

comex · 2023-04-06T05:57:51Z

Features:

Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.)
Support both mmap and read (mmap is used by default, but can be disabled with --no-mmap, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported).
Support multi-file models like before, but automatically determine the number of parts rather than requiring --n_parts.
Improve validation and error checking.
Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front).
Support VirtualLock on Windows (using the same --mlock option as on Unix).
- Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...)
  - To help implement this, move mlock support from ggml to the loading code.
madvise/PrefetchVirtualMemory support (based on Advise the kernel to preload the mapped memory #740)
Switch from ifstream to the fopen family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap).
Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way').

Todo:

~~VirtualLock does not work at all on the one Windows machine I tested it on (it complains about quota). Figure out why.~~ Fixed.
~~Verify that using the fopen family of functions actually does what I think it does, performance-wise.~~ Verified that when reading a large amount of data with fread, it passes the pointer directly to the kernel rather than doing an intermediate copy, on Linux, macOS, and Windows. And ifstream does not do so on at least macOS (didn't test the other two). So moving from ifstream to the fopen family was indeed an improvement, but there's no benefit to going further and using OS APIs directly.
More testing.

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly:

Destructors to make it easier to ensure everything gets cleaned up.
Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. (Edit: The exceptions are converted to error codes at the API boundary.)

Co-authored-by: Pavol Rusnak pavol@rusnak.io (for the bit I copied from #740)

anzz1 · 2023-04-06T07:59:32Z

VirtualLock only locks your memory into the working set

#pragma comment (lib, "ntdll.lib")
extern "C" __declspec(dllimport) long __stdcall RtlAdjustPrivilege(unsigned long dwPrivilege, int bEnablePrivilege, int bIsThreadPrivilege, int *pbPreviosValue);

{
  int bPrev;
  unsigned long dwErr;

  dwErr = RtlAdjustPrivilege(/* SE_LOCK_MEMORY_PRIVILEGE */ 4, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }

  dwErr = RtlAdjustPrivilege(/* SE_INCREASE_QUOTA_PRIVILEGE */ 5, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }

  dwErr = RtlAdjustPrivilege(/* SE_INC_WORKING_SET_PRIVILEGE */ 33, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }
}

advapi32 version

typedef struct _TOKEN_PRIVILEGES_3 {
    unsigned long PrivilegeCount;
    LUID_AND_ATTRIBUTES Privileges[3];
} TOKEN_PRIVILEGES_3, *PTOKEN_PRIVILEGES_3;

inline static int privs()
{
  TOKEN_PRIVILEGES_3 tkp;
  void* hToken;
  unsigned long dwErr;
  tkp.PrivilegeCount = 3;
  tkp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
  tkp.Privileges[1].Attributes = SE_PRIVILEGE_ENABLED;
  tkp.Privileges[2].Attributes = SE_PRIVILEGE_ENABLED;

  if (!OpenProcessToken((void*)-1, TOKEN_QUERY | TOKEN_ADJUST_PRIVILEGES, &hToken)) {
    dwErr = GetLastError();
    return printf("[!] advapi32:OpenProcessToken() failed; error code = 0x%08X\n", dwErr);
  }

  if (!LookupPrivilegeValueA(0, "SeLockMemoryPrivilege", &tkp.Privileges[0].Luid) || 
      !LookupPrivilegeValueA(0, "SeIncreaseQuotaPrivilege", &tkp.Privileges[1].Luid) || 
      !LookupPrivilegeValueA(0, "SeIncreaseWorkingSetPrivilege", &tkp.Privileges[2].Luid)) {
    dwErr = GetLastError();
    CloseHandle(hToken);
    return printf("[!] advapi32:LookupPrivilegeValueA() failed; error code = 0x%08X\n", dwErr);
  }

  AdjustTokenPrivileges(hToken, 0, (TOKEN_PRIVILEGES*)&tkp, 0, 0, 0);
  dwErr = GetLastError();
  CloseHandle(hToken);

  if (dwErr) {
    return printf("[!] advapi32:AdjustTokenPrivileges() failed; error code = 0x%08X\n", dwErr);
  }

  return 0;
}

llama_util.h

howard0su · 2023-04-06T13:54:02Z

llama_util.h

+        if (size == 0) {
+            return;
+        }
+        errno = 0;


I don't think this is needed. errno will be reset to zero after a successful IO operation.

In general, successful calls leave errno as-is rather than setting it to 0. In most cases there's nevertheless no need to reset errno to 0 before calling an API, because you only check errno if the call fails, and in that case it must have set errno. fread and fwrite are weird, because the spec doesn't require them to set errno if they fail. But in practice they do set errno at least on Linux and macOS (not sure about Windows), and there's no better alternative way to get the cause of about a read/write error, at least not one that's portable. (You can use ferror to portably check whether an error occurred, but not what it was.) Therefore I report errno, but reset it to 0 first so that if you're on some system where fread/fwrite don't set errno on failure, you'll at least see something like "Undefined error: 0" or "Success" – which is not too helpful but better than showing a misleading error from some past call. (Even better would be to check for errno == 0 and print something like "unknown error", but I felt it wasn't worth the lines of code.)

This explanation may become redundant if I have to switch to OS raw read/write functions to avoid unnecessary copies (still need to check this). Edit: Turns out I don't have to switch.

llama_util.h

howard0su · 2023-04-06T13:56:25Z

llama_util.h

+        }
+    }
+
+    void init(void * addr) {


why not put this to the constructor? this is purely initializing a member variable.

Because the llama_mlock objects live in llama_model and are constructed before the mlock actually happens. An alternative approach that does allow it to be in the constructor would be to store it behind a unique_ptr, like I did with llama_mmap, but there wasn't any particular need for it here (unlike llama_mmap where the mapping gets transferred out of the llama_model_loader).

llama_util.h

prusnak

llama.cpp permissions changed from 644 → 755

i guess this is unintended

comex · 2023-04-07T04:58:17Z

llama.cpp permissions changed from 644 → 755

i guess this is unintended

will fix

comex · 2023-04-07T05:11:28Z

Figured out the VirtualLock issue. RtlAdjustPrivilege doesn't help, but what does help is increasing both the minimum and maximum working set size instead of only the maximum. As per some documentation I didn't read before:

The maximum number of pages that a process can lock is equal to the number of pages in its minimum working set minus a small overhead.

comex · 2023-04-07T05:45:34Z

Addressed all PR feedback, and checked both items off my todo list. The PR still could use testing.

KASR · 2023-04-07T08:05:30Z

I've tried the versions/configurations, listed below, the output is also added as attachment. I tried to make some combinations using various models and data formats. All models load and produce output with no errors. Tested on windows/cmake/vs/ps

Let me know if i need to try something specific or run some other test.

output --> pizza_version_test.txt

./main_pizza -m ./models/65B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 32 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap -b 512

./main_pizza -m ./models/65B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap -b 512

./main_pizza -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/7B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/alpaca-native-enhanced-7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/alpaca-native-enhanced-7B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized-ggjt.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized-ggjt.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

bakkot · 2023-04-07T08:07:50Z

@comex Do you have any particular things in mind which you think need testing?

I tested this PR using ./perplexity with

7B ggml
7B ggmf
7B ggjt
13B ggml (two part)
13B ggmf (two part)
13B ggjt (single part)

and confirmed they all load and give the expected results (at least for the first few chunks), that mmap is only used for the ggjt files, and that --n_parts is not necessary.

I also tested ./quantize on the two-part 13B ggmf and confirmed the resulting single file behaves as expected.

All tests on macOS arm64.

One note is, I think it would be useful to print out the magic and version (or some other descriptor), since it's not always obvious from the file itself (unless I'm just missing it in the output somewhere).

Also it may be worth printing a warning in the original ggml case, since the broken tokenizer really degrades its quality.

ggerganov

Thank you @comex for this hard work!
The PR is OK to be merged as it is, but here are some comments:

The mlock move from ggml to llama is great
I've written elsewhere my opinion on the validation and error checking: Fix memory bugs in loading code #651 (review). Regardless, this is fine
Not a big fan of the llama_util.h introduction. The way I see it, the mmap / mlock functionality has to be part of common and be provided optionally through the C API. The way of thinking is: "we demonstrate how one can use these extra features via the examples". I know it makes life a little bit more difficult for developers who want to use llama + mmap / mlock in their projects, but it's not really that big of an obstacle. And I think we are past the point of "there is no reason not to use mmap"
In the long run, I think llama_buffer can become part of llama.cpp and llama_util.h be merged into common
We have to add some obsoletion periods for old formats. For example, 2-3 weeks after introducing a new format, the old ones will be deprecated completely. At the moment, there is huge interest in the project and it's great to provide such a wide support for all the formats to make life easier for the non-technical people, but in the long run - it's better to simplify the code

prusnak · 2023-04-07T18:19:15Z

There is now a trivial conflict after #812 has been merged:

comex · 2023-04-07T19:53:17Z

Thanks.

Not a big fan of the llama_util.h introduction. The way I see it, the mmap / mlock functionality has to be part of common and be provided optionally through the C API.

Hmm, I think you may have misunderstood my intent with llama_util.h. It's not meant to be public; it's meant to be included by llama.cpp and nothing else. The public-facing interface for mmap / mlock is through the C API, namely the use_mmap and use_mlock fields of llama_context_params. I just separated out llama_util.h in order to differentiate bits of code that are more like general-purpose helpers, compared to llama.cpp which is mostly specific to the task at hand. But I have no strong opinions about this separation; I'm fine with putting the code directly in llama.cpp instead. Alternately, I could keep it separate but add a comment clarifying that it's not a public header.

If you did understand already that it's not a public header, then I don't really understand what you're recommending, so please elaborate. :)

And I think we are past the point of "there is no reason not to use mmap"

I'm not sure whether you mean the argument is resolved in favor of "there is no reason" or "there is a reason". I've seen both of those views expressed very strongly by different people. :)

Either way, I'll state my view:

Some of the issues people have had with mmap are solvable (alignment is a non-issue if we're going to deprecate old formats; the issue with Windows doing small-sized reads is theoretically solved with PrefetchVirtualMemory, included in this PR, though I need confirmation). But the fact that kernels are more likely to page out mmapped memory will remain a usability thorn, since the main workaround (mlock) requires elevated privileges. So I'd like to see both mmap and non-mmap paths supported indefinitely.

In the future, after getting rid of support for old formats, the non-mmap path can be rewritten to just read the entire file directly into a buffer (as opposed to doing reads tensor-by-tensor), so the amount of extra code needed to support both paths will be very small.

comex · 2023-04-07T20:12:07Z

@bakkot

@comex Do you have any particular things in mind which you think need testing?

What you tested is great (I did some similar tests but not as thorough). Another kind of testing I'd like is from Windows users. I did some basic testing in a Windows VM, but not much. Specifically:

I'd like to be sure that the use of PrefetchVirtualMemory solves Windows page fault disk i/o slow on first load #705; I guess I'll leave a comment there. (If it doesn't, the --no-mmap path should still restore behavior to how it was before, but in theory the same speed should be achievable with memory mapping and its associated advantages.)
Regarding mlock: I did not include the RtlAdjustPrivileges calls suggested by @anzz1 because in my testing they were not needed; however, I'm not a Windows expert and I'm wondering if there are environments where they are needed.
Also regarding mlock: I'm wondering if there are any global settings that limit the amount of locked memory, in which case they should probably be mentioned in the error message if VirtualLock fails, similar to how it works on Linux and macOS.

One note is, I think it would be useful to print out the magic and version (or some other descriptor), since it's not always obvious from the file itself (unless I'm just missing it in the output somewhere).

Also it may be worth printing a warning in the original ggml case, since the broken tokenizer really degrades its quality.

Good points; I'll think about doing a followup with this.

ggerganov · 2023-04-08T05:49:05Z

@comex

Regarding the llama_util.h - it's mostly me obsessing over not adding extra files to the core library.
The header does a good job at separating the helpers and I agree it is better compared to putting all the stuff inside llama.cpp.

If you did understand already that it's not a public header, then I don't really understand what you're recommending, so please elaborate. :)

The idea I have is to extend for example llama_context_params with mmap and mlock related callbacks that the user can provide and they will be optionally used in llama.cpp if provided. This way, we give the responsibility to the user of the library to detect if these are supported on the current platform and also to provide the specific implementation. Sample implementations of the callbacks will be provided in the common lib and used by the examples to demonstrate. The plus is that we will not have llama logic so tightly coupled with OS-specific system calls and I think this would be easier to maintain. The obvious drawback is that it becomes a bit of a hurdle for the developers to pass the callbacks through the interface.

In any case, this is mostly brainstorming about an alternative design. I don't insist on doing it now, or even later.
I think it is better to merge the proposed PR as it is, since it provides a lot of valuable functionality that will be useful atm.

I'm not sure whether you mean the argument is resolved in favor of "there is no reason" or "there is a reason". I've seen both of those views expressed very strongly by different people. :)

My impression, based on the feedback so far, is that there are still cases where mmap is causing some troubles. So it is not so easy to argue that it has to be an always-on feature.

So I'd like to see both mmap and non-mmap paths supported indefinitely.

In the future, after getting rid of support for old formats, the non-mmap path can be rewritten to just read the entire file directly into a buffer (as opposed to doing reads tensor-by-tensor), so the amount of extra code needed to support both paths will be very small.

Sounds good. If we manage to solve the reported issues and we don't observe too many OS-specific difficulties in the future, I am totally fine to keep this tightly integrated with the library.

- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)

Also improve model type printing, and fix indentation of an unrelated switch statement.

comex · 2023-04-08T20:17:47Z

The idea I have is to extend for example llama_context_params with mmap and mlock related callbacks that the user can provide and they will be optionally used in llama.cpp if provided.
Ah, now I understand, sorry.

Anyway, PR has been rebased on the latest master. Significant changes:

Updated for compatibility with Add quantize-stats command for testing quantization #728.
While I was at it, I moved the function added by that PR (llama_internal_get_tensor_map) to be exported from a new header, llama_internal.h, whereas originally it was included in llama.h but guarded under __cplusplus (because it returns a std::unordered_map). By putting it in a separate header, external users of llama.h don't have to pay the cost of including <string> and <unordered_map> just to declare a function they can't even use.
stderr output now includes the model type along with the other parameters, as suggested by @bakkot.
Some fixes.

Feel free to merge if you want; I apparently can't trigger a merge myself until @slaren retracts the "changes requested" status. Thanks.

slaren · 2023-04-08T20:23:32Z

I think that's @prusnak, I haven't requested any changes.

Suggestion addressed

prusnak · 2023-04-08T22:34:31Z

Feel free to merge if you want; I apparently can't trigger a merge myself until ...

retracted my review since the file permissions issue has been corrected, the merge should now be possible

anzz1 · 2023-04-08T22:53:31Z

Regarding mlock: I did not include the RtlAdjustPrivileges calls suggested by @anzz1 because in my testing they were not needed; however, I'm not a Windows expert and I'm wondering if there are environments where they are needed.

Seems strange that you can lock pages in memory without actually having the privilege to do so, but at the same time it doesn't really surprise me since the windows privilege system has always been and still is all over the place. The MSDN doc could also simply be wrong which it has been many times in the past, and maybe if the lock policy isn't configured (Not defined) means that actually everyone has that privilege and not that no one has it. Cases like that can throw you off when doing whoami /priv as "Disabled" in the process token can mean "not explicitly enabled" instead of "explicitly disabled". If it works, then it works I guess.

blackhole89 · 2023-04-09T23:11:19Z

Since Georgi himself has approved and all lingering issues appeared to have been resolved, I went ahead and merged this.

rabidcopy · 2023-04-10T03:03:57Z

Big thumbs up for this. It seems like in general the mmap implementation has been improved upon. A problem I noticed previously is that sometimes(if not often) it would "forget" that I've already loaded the model recently and would load it from scratch anyway on subsequent runs. With this merged it seems extremely consistent and when it does load a pre-cached model, the times seem to be better now.
Before:

llama_print_timings:        load time =  1483.01 ms

After:

llama_print_timings:        load time =   891.55 ms

CoderRC · 2023-04-19T00:01:32Z

Successfully compiled master branch and successfully compiled comex's pizza branch, and successfully run ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512.
In msys2 with mingw32 gcc compiler using:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended' CFLAGS='-D_POSIX_MAPPED_FILES -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -mfma -mf16c -mavx -mavx2' CXXFLAGS='-D_POSIX_MAPPED_FILES -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function'

If confused how exactly I did compile it, read #103 (comment)

comex force-pushed the pizza branch from 7779dee to ae32a25 Compare April 6, 2023 06:00

howard0su reviewed Apr 6, 2023

View reviewed changes

prusnak previously requested changes Apr 6, 2023

View reviewed changes

slaren mentioned this pull request Apr 6, 2023

Add LoRA support #820

Merged

ggerganov added the high priority Very important issue label Apr 7, 2023

ggerganov approved these changes Apr 7, 2023

View reviewed changes

comex requested a review from prusnak April 7, 2023 19:57

comex mentioned this pull request Apr 7, 2023

Windows page fault disk i/o slow on first load #705

Closed

comex force-pushed the pizza branch from 4ae05a2 to f8fbd32 Compare April 8, 2023 19:13

comex force-pushed the pizza branch from f8fbd32 to e2cb5ab Compare April 8, 2023 19:33

Print model version.

7b41849

Also improve model type printing, and fix indentation of an unrelated switch statement.

KASR mentioned this pull request Apr 9, 2023

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Closed

blackhole89 merged commit 180b693 into ggerganov:master Apr 9, 2023

TrajansRow mentioned this pull request Apr 10, 2023

Feature Request: Expose llama.cpp --no-mmap option LostRuins/koboldcpp#37

Closed

This was referenced Apr 10, 2023

Advise the kernel to preload the mapped memory #740

Closed

Efficient preloading for mmap() #869

Closed

slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 10, 2023

Add compatibility with ggerganov#801

0d8999a

slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 11, 2023

Add compatibility with ggerganov#801

d639323

funnbot mentioned this pull request Apr 12, 2023

Latest release crashes on start #903

Closed

slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 15, 2023

Add compatibility with ggerganov#801

4d25b26

slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 16, 2023

Add compatibility with ggerganov#801

c920f00

CoderRC mentioned this pull request Apr 19, 2023

Fix model loading time through prefetching the file on another thread #734

Closed

philpax mentioned this pull request Apr 24, 2023

fix #149 - load tensors by type, ignoring filetype rustformers/llm#152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite loading code to try to satisfy everyone #801

Rewrite loading code to try to satisfy everyone #801

comex commented Apr 6, 2023 •

edited

Loading

anzz1 commented Apr 6, 2023 •

edited

Loading

howard0su Apr 6, 2023

comex Apr 7, 2023 •

edited

Loading

howard0su Apr 6, 2023

comex Apr 7, 2023

prusnak left a comment

comex commented Apr 7, 2023

comex commented Apr 7, 2023

comex commented Apr 7, 2023

KASR commented Apr 7, 2023

bakkot commented Apr 7, 2023

ggerganov left a comment

prusnak commented Apr 7, 2023

comex commented Apr 7, 2023

comex commented Apr 7, 2023

ggerganov commented Apr 8, 2023

comex commented Apr 8, 2023

slaren commented Apr 8, 2023

prusnak commented Apr 8, 2023

anzz1 commented Apr 8, 2023

blackhole89 commented Apr 9, 2023

rabidcopy commented Apr 10, 2023 •

edited

Loading

CoderRC commented Apr 19, 2023

Rewrite loading code to try to satisfy everyone #801

Rewrite loading code to try to satisfy everyone #801

Conversation

comex commented Apr 6, 2023 • edited Loading

anzz1 commented Apr 6, 2023 • edited Loading

howard0su Apr 6, 2023

Choose a reason for hiding this comment

comex Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

howard0su Apr 6, 2023

Choose a reason for hiding this comment

comex Apr 7, 2023

Choose a reason for hiding this comment

prusnak left a comment

Choose a reason for hiding this comment

comex commented Apr 7, 2023

comex commented Apr 7, 2023

comex commented Apr 7, 2023

KASR commented Apr 7, 2023

bakkot commented Apr 7, 2023

ggerganov left a comment

Choose a reason for hiding this comment

prusnak commented Apr 7, 2023

comex commented Apr 7, 2023

comex commented Apr 7, 2023

ggerganov commented Apr 8, 2023

comex commented Apr 8, 2023

slaren commented Apr 8, 2023

prusnak commented Apr 8, 2023

anzz1 commented Apr 8, 2023

blackhole89 commented Apr 9, 2023

rabidcopy commented Apr 10, 2023 • edited Loading

CoderRC commented Apr 19, 2023

comex commented Apr 6, 2023 •

edited

Loading

anzz1 commented Apr 6, 2023 •

edited

Loading

comex Apr 7, 2023 •

edited

Loading

rabidcopy commented Apr 10, 2023 •

edited

Loading