Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite loading code to try to satisfy everyone #801

Merged
merged 2 commits into from
Apr 9, 2023

Conversation

comex
Copy link
Contributor

@comex comex commented Apr 6, 2023

Features:

  • Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.)

  • Support both mmap and read (mmap is used by default, but can be disabled with --no-mmap, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported).

  • Support multi-file models like before, but automatically determine the number of parts rather than requiring --n_parts.

  • Improve validation and error checking.

  • Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front).

  • Support VirtualLock on Windows (using the same --mlock option as on Unix).

    • Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...)

      • To help implement this, move mlock support from ggml to the loading code.
  • madvise/PrefetchVirtualMemory support (based on Advise the kernel to preload the mapped memory #740)

  • Switch from ifstream to the fopen family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap).

  • Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way').

Todo:

  • VirtualLock does not work at all on the one Windows machine I tested it on (it complains about quota). Figure out why. Fixed.

  • Verify that using the fopen family of functions actually does what I think it does, performance-wise. Verified that when reading a large amount of data with fread, it passes the pointer directly to the kernel rather than doing an intermediate copy, on Linux, macOS, and Windows. And ifstream does not do so on at least macOS (didn't test the other two). So moving from ifstream to the fopen family was indeed an improvement, but there's no benefit to going further and using OS APIs directly.

  • More testing.

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly:

  • Destructors to make it easier to ensure everything gets cleaned up.

  • Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. (Edit: The exceptions are converted to error codes at the API boundary.)

Co-authored-by: Pavol Rusnak pavol@rusnak.io (for the bit I copied from #740)

@anzz1
Copy link
Contributor

anzz1 commented Apr 6, 2023

VirtualLock only locks your memory into the working set

#pragma comment (lib, "ntdll.lib")
extern "C" __declspec(dllimport) long __stdcall RtlAdjustPrivilege(unsigned long dwPrivilege, int bEnablePrivilege, int bIsThreadPrivilege, int *pbPreviosValue);

{
  int bPrev;
  unsigned long dwErr;

  dwErr = RtlAdjustPrivilege(/* SE_LOCK_MEMORY_PRIVILEGE */ 4, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }

  dwErr = RtlAdjustPrivilege(/* SE_INCREASE_QUOTA_PRIVILEGE */ 5, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }

  dwErr = RtlAdjustPrivilege(/* SE_INC_WORKING_SET_PRIVILEGE */ 33, 1, 0, &bPrev);
  if (dwErr) {
    printf("ntdll:RtlAdjustPrivilege() failed; error code = 0x%08X\n", dwErr);
    return 1;
  }
}

advapi32 version

typedef struct _TOKEN_PRIVILEGES_3 {
    unsigned long PrivilegeCount;
    LUID_AND_ATTRIBUTES Privileges[3];
} TOKEN_PRIVILEGES_3, *PTOKEN_PRIVILEGES_3;

inline static int privs()
{
  TOKEN_PRIVILEGES_3 tkp;
  void* hToken;
  unsigned long dwErr;
  tkp.PrivilegeCount = 3;
  tkp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
  tkp.Privileges[1].Attributes = SE_PRIVILEGE_ENABLED;
  tkp.Privileges[2].Attributes = SE_PRIVILEGE_ENABLED;

  if (!OpenProcessToken((void*)-1, TOKEN_QUERY | TOKEN_ADJUST_PRIVILEGES, &hToken)) {
    dwErr = GetLastError();
    return printf("[!] advapi32:OpenProcessToken() failed; error code = 0x%08X\n", dwErr);
  }

  if (!LookupPrivilegeValueA(0, "SeLockMemoryPrivilege", &tkp.Privileges[0].Luid) || 
      !LookupPrivilegeValueA(0, "SeIncreaseQuotaPrivilege", &tkp.Privileges[1].Luid) || 
      !LookupPrivilegeValueA(0, "SeIncreaseWorkingSetPrivilege", &tkp.Privileges[2].Luid)) {
    dwErr = GetLastError();
    CloseHandle(hToken);
    return printf("[!] advapi32:LookupPrivilegeValueA() failed; error code = 0x%08X\n", dwErr);
  }

  AdjustTokenPrivileges(hToken, 0, (TOKEN_PRIVILEGES*)&tkp, 0, 0, 0);
  dwErr = GetLastError();
  CloseHandle(hToken);

  if (dwErr) {
    return printf("[!] advapi32:AdjustTokenPrivileges() failed; error code = 0x%08X\n", dwErr);
  }

  return 0;
}

llama_util.h Outdated Show resolved Hide resolved
llama_util.h Outdated Show resolved Hide resolved
if (size == 0) {
return;
}
errno = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. errno will be reset to zero after a successful IO operation.

Copy link
Contributor Author

@comex comex Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, successful calls leave errno as-is rather than setting it to 0. In most cases there's nevertheless no need to reset errno to 0 before calling an API, because you only check errno if the call fails, and in that case it must have set errno. fread and fwrite are weird, because the spec doesn't require them to set errno if they fail. But in practice they do set errno at least on Linux and macOS (not sure about Windows), and there's no better alternative way to get the cause of about a read/write error, at least not one that's portable. (You can use ferror to portably check whether an error occurred, but not what it was.) Therefore I report errno, but reset it to 0 first so that if you're on some system where fread/fwrite don't set errno on failure, you'll at least see something like "Undefined error: 0" or "Success" – which is not too helpful but better than showing a misleading error from some past call. (Even better would be to check for errno == 0 and print something like "unknown error", but I felt it wasn't worth the lines of code.)

This explanation may become redundant if I have to switch to OS raw read/write functions to avoid unnecessary copies (still need to check this). Edit: Turns out I don't have to switch.

llama_util.h Outdated Show resolved Hide resolved
}
}

void init(void * addr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put this to the constructor? this is purely initializing a member variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the llama_mlock objects live in llama_model and are constructed before the mlock actually happens. An alternative approach that does allow it to be in the constructor would be to store it behind a unique_ptr, like I did with llama_mmap, but there wasn't any particular need for it here (unlike llama_mmap where the mapping gets transferred out of the llama_model_loader).

llama_util.h Show resolved Hide resolved
prusnak
prusnak previously requested changes Apr 6, 2023
Copy link
Collaborator

@prusnak prusnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama.cpp permissions changed from 644 → 755

i guess this is unintended

@slaren slaren mentioned this pull request Apr 6, 2023
@comex
Copy link
Contributor Author

comex commented Apr 7, 2023

llama.cpp permissions changed from 644 → 755

i guess this is unintended

will fix

@comex
Copy link
Contributor Author

comex commented Apr 7, 2023

Figured out the VirtualLock issue. RtlAdjustPrivilege doesn't help, but what does help is increasing both the minimum and maximum working set size instead of only the maximum. As per some documentation I didn't read before:

The maximum number of pages that a process can lock is equal to the number of pages in its minimum working set minus a small overhead.

@comex
Copy link
Contributor Author

comex commented Apr 7, 2023

Addressed all PR feedback, and checked both items off my todo list. The PR still could use testing.

@ggerganov ggerganov added the high priority Very important issue label Apr 7, 2023
@KASR
Copy link
Contributor

KASR commented Apr 7, 2023

I've tried the versions/configurations, listed below, the output is also added as attachment. I tried to make some combinations using various models and data formats. All models load and produce output with no errors. Tested on windows/cmake/vs/ps

Let me know if i need to try something specific or run some other test.

output --> pizza_version_test.txt

./main_pizza -m ./models/65B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 32 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap -b 512

./main_pizza -m ./models/65B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap -b 512

./main_pizza -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/7B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/alpaca-native-enhanced-7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/alpaca-native-enhanced-7B/ggml-model-f16.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized-ggjt.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos --no-mmap

./main_pizza -m ./models/gpt4all-7B/gpt4all-lora-quantized-ggjt.bin -p "This is a long story about how programming came to be:" -n 100 -t 24 --temp 0.2 -c 2048 -s 132456 --ignore-eos

@bakkot
Copy link
Contributor

bakkot commented Apr 7, 2023

@comex Do you have any particular things in mind which you think need testing?

I tested this PR using ./perplexity with

  • 7B ggml
  • 7B ggmf
  • 7B ggjt
  • 13B ggml (two part)
  • 13B ggmf (two part)
  • 13B ggjt (single part)

and confirmed they all load and give the expected results (at least for the first few chunks), that mmap is only used for the ggjt files, and that --n_parts is not necessary.

I also tested ./quantize on the two-part 13B ggmf and confirmed the resulting single file behaves as expected.

All tests on macOS arm64.

One note is, I think it would be useful to print out the magic and version (or some other descriptor), since it's not always obvious from the file itself (unless I'm just missing it in the output somewhere).

Also it may be worth printing a warning in the original ggml case, since the broken tokenizer really degrades its quality.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comex for this hard work!
The PR is OK to be merged as it is, but here are some comments:

  • The mlock move from ggml to llama is great

  • I've written elsewhere my opinion on the validation and error checking: Fix memory bugs in loading code #651 (review). Regardless, this is fine

  • Not a big fan of the llama_util.h introduction. The way I see it, the mmap / mlock functionality has to be part of common and be provided optionally through the C API. The way of thinking is: "we demonstrate how one can use these extra features via the examples". I know it makes life a little bit more difficult for developers who want to use llama + mmap / mlock in their projects, but it's not really that big of an obstacle. And I think we are past the point of "there is no reason not to use mmap"

  • In the long run, I think llama_buffer can become part of llama.cpp and llama_util.h be merged into common

  • We have to add some obsoletion periods for old formats. For example, 2-3 weeks after introducing a new format, the old ones will be deprecated completely. At the moment, there is huge interest in the project and it's great to provide such a wide support for all the formats to make life easier for the non-technical people, but in the long run - it's better to simplify the code

@prusnak
Copy link
Collaborator

prusnak commented Apr 7, 2023

There is now a trivial conflict after #812 has been merged:

Screenshot 2023-04-07 at 20 19 02

@comex
Copy link
Contributor Author

comex commented Apr 7, 2023

Thanks.

  • Not a big fan of the llama_util.h introduction. The way I see it, the mmap / mlock functionality has to be part of common and be provided optionally through the C API.

Hmm, I think you may have misunderstood my intent with llama_util.h. It's not meant to be public; it's meant to be included by llama.cpp and nothing else. The public-facing interface for mmap / mlock is through the C API, namely the use_mmap and use_mlock fields of llama_context_params. I just separated out llama_util.h in order to differentiate bits of code that are more like general-purpose helpers, compared to llama.cpp which is mostly specific to the task at hand. But I have no strong opinions about this separation; I'm fine with putting the code directly in llama.cpp instead. Alternately, I could keep it separate but add a comment clarifying that it's not a public header.

If you did understand already that it's not a public header, then I don't really understand what you're recommending, so please elaborate. :)

And I think we are past the point of "there is no reason not to use mmap"

I'm not sure whether you mean the argument is resolved in favor of "there is no reason" or "there is a reason". I've seen both of those views expressed very strongly by different people. :)

Either way, I'll state my view:

Some of the issues people have had with mmap are solvable (alignment is a non-issue if we're going to deprecate old formats; the issue with Windows doing small-sized reads is theoretically solved with PrefetchVirtualMemory, included in this PR, though I need confirmation). But the fact that kernels are more likely to page out mmapped memory will remain a usability thorn, since the main workaround (mlock) requires elevated privileges. So I'd like to see both mmap and non-mmap paths supported indefinitely.

In the future, after getting rid of support for old formats, the non-mmap path can be rewritten to just read the entire file directly into a buffer (as opposed to doing reads tensor-by-tensor), so the amount of extra code needed to support both paths will be very small.

@comex comex requested a review from prusnak April 7, 2023 19:57
@comex
Copy link
Contributor Author

comex commented Apr 7, 2023

@bakkot

@comex Do you have any particular things in mind which you think need testing?

What you tested is great (I did some similar tests but not as thorough). Another kind of testing I'd like is from Windows users. I did some basic testing in a Windows VM, but not much. Specifically:

  • I'd like to be sure that the use of PrefetchVirtualMemory solves Windows page fault disk i/o slow on first load #705; I guess I'll leave a comment there. (If it doesn't, the --no-mmap path should still restore behavior to how it was before, but in theory the same speed should be achievable with memory mapping and its associated advantages.)

  • Regarding mlock: I did not include the RtlAdjustPrivileges calls suggested by @anzz1 because in my testing they were not needed; however, I'm not a Windows expert and I'm wondering if there are environments where they are needed.

  • Also regarding mlock: I'm wondering if there are any global settings that limit the amount of locked memory, in which case they should probably be mentioned in the error message if VirtualLock fails, similar to how it works on Linux and macOS.

One note is, I think it would be useful to print out the magic and version (or some other descriptor), since it's not always obvious from the file itself (unless I'm just missing it in the output somewhere).

Also it may be worth printing a warning in the original ggml case, since the broken tokenizer really degrades its quality.

Good points; I'll think about doing a followup with this.

@ggerganov
Copy link
Owner

@comex

Regarding the llama_util.h - it's mostly me obsessing over not adding extra files to the core library.
The header does a good job at separating the helpers and I agree it is better compared to putting all the stuff inside llama.cpp.

If you did understand already that it's not a public header, then I don't really understand what you're recommending, so please elaborate. :)

The idea I have is to extend for example llama_context_params with mmap and mlock related callbacks that the user can provide and they will be optionally used in llama.cpp if provided. This way, we give the responsibility to the user of the library to detect if these are supported on the current platform and also to provide the specific implementation. Sample implementations of the callbacks will be provided in the common lib and used by the examples to demonstrate. The plus is that we will not have llama logic so tightly coupled with OS-specific system calls and I think this would be easier to maintain. The obvious drawback is that it becomes a bit of a hurdle for the developers to pass the callbacks through the interface.

In any case, this is mostly brainstorming about an alternative design. I don't insist on doing it now, or even later.
I think it is better to merge the proposed PR as it is, since it provides a lot of valuable functionality that will be useful atm.

I'm not sure whether you mean the argument is resolved in favor of "there is no reason" or "there is a reason". I've seen both of those views expressed very strongly by different people. :)

My impression, based on the feedback so far, is that there are still cases where mmap is causing some troubles. So it is not so easy to argue that it has to be an always-on feature.

So I'd like to see both mmap and non-mmap paths supported indefinitely.

In the future, after getting rid of support for old formats, the non-mmap path can be rewritten to just read the entire file directly into a buffer (as opposed to doing reads tensor-by-tensor), so the amount of extra code needed to support both paths will be very small.

Sounds good. If we manage to solve the reported issues and we don't observe too many OS-specific difficulties in the future, I am totally fine to keep this tightly integrated with the library.

- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on ggerganov#740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
Also improve model type printing, and fix indentation of an unrelated
switch statement.
@comex
Copy link
Contributor Author

comex commented Apr 8, 2023

The idea I have is to extend for example llama_context_params with mmap and mlock related callbacks that the user can provide and they will be optionally used in llama.cpp if provided.
Ah, now I understand, sorry.

Anyway, PR has been rebased on the latest master. Significant changes:

  • Updated for compatibility with Add quantize-stats command for testing quantization #728.
    While I was at it, I moved the function added by that PR (llama_internal_get_tensor_map) to be exported from a new header, llama_internal.h, whereas originally it was included in llama.h but guarded under __cplusplus (because it returns a std::unordered_map). By putting it in a separate header, external users of llama.h don't have to pay the cost of including <string> and <unordered_map> just to declare a function they can't even use.

  • stderr output now includes the model type along with the other parameters, as suggested by @bakkot.

  • Some fixes.

Feel free to merge if you want; I apparently can't trigger a merge myself until @slaren retracts the "changes requested" status. Thanks.

@slaren
Copy link
Collaborator

slaren commented Apr 8, 2023

I think that's @prusnak, I haven't requested any changes.

@prusnak prusnak dismissed their stale review April 8, 2023 21:04

Suggestion addressed

@prusnak
Copy link
Collaborator

prusnak commented Apr 8, 2023

Feel free to merge if you want; I apparently can't trigger a merge myself until ...

retracted my review since the file permissions issue has been corrected, the merge should now be possible

@anzz1
Copy link
Contributor

anzz1 commented Apr 8, 2023

  • Regarding mlock: I did not include the RtlAdjustPrivileges calls suggested by @anzz1 because in my testing they were not needed; however, I'm not a Windows expert and I'm wondering if there are environments where they are needed.

Seems strange that you can lock pages in memory without actually having the privilege to do so, but at the same time it doesn't really surprise me since the windows privilege system has always been and still is all over the place. The MSDN doc could also simply be wrong which it has been many times in the past, and maybe if the lock policy isn't configured (Not defined) means that actually everyone has that privilege and not that no one has it. Cases like that can throw you off when doing whoami /priv as "Disabled" in the process token can mean "not explicitly enabled" instead of "explicitly disabled". If it works, then it works I guess.

@blackhole89
Copy link
Contributor

Since Georgi himself has approved and all lingering issues appeared to have been resolved, I went ahead and merged this.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 10, 2023

Big thumbs up for this. It seems like in general the mmap implementation has been improved upon. A problem I noticed previously is that sometimes(if not often) it would "forget" that I've already loaded the model recently and would load it from scratch anyway on subsequent runs. With this merged it seems extremely consistent and when it does load a pre-cached model, the times seem to be better now.
Before:

llama_print_timings:        load time =  1483.01 ms

After:

llama_print_timings:        load time =   891.55 ms

slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 10, 2023
slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 11, 2023
slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 15, 2023
slaren added a commit to slaren/llama.cpp that referenced this pull request Apr 16, 2023
@CoderRC
Copy link

CoderRC commented Apr 19, 2023

Successfully compiled master branch and successfully compiled comex's pizza branch, and successfully run ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512.
In msys2 with mingw32 gcc compiler using:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended' CFLAGS='-D_POSIX_MAPPED_FILES -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -mfma -mf16c -mavx -mavx2' CXXFLAGS='-D_POSIX_MAPPED_FILES -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function'

If confused how exactly I did compile it, read #103 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.