-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use case for multithreaded hashing in C #184
Comments
May I recommend benchmarking BLAKE3 with the assembly implementations? They are several times faster than the portable code on most x86 machines :) It's unlikely we'll implement our own work-stealing multithreading strategy on top of pthreads. If we were going to add this into the C implementation, we'd probably use OpenMP. I don't have much experience with OpenMP myself, but my understanding is that it's a nontrivial dependency that not everyone would be happy to have in their build. (For example, it seems like the MSVC implementation of OpenMP targets a version of the standard that's almost 20 years old?) It might make sense to ask, would it be cleaner to just expose C bindings for the Rust crate and link against that? I don't know of anyone doing this in production, but I was able to whip up an example just now: https://github.com/oconnor663/rust_blake3_c_bindings |
Thanks for the reply. @assembly. In fact no, I want/need to compile on different platforms, sometimes even Annapurna (QNAP NAS). I am making my naive benchmarker (for my use case, of course) to compare against others crypto and non cryptographically strong hasesh more easily. @multuthread: I do not like OpenMP too (neither pthread), but pthread is just about diffuse to work on Windows/Linux/FreeBSD without much hassle. Not perfect, but... it works. In fact does NOT work for C++, but BLAKE3 is C. No ifdef, no compiler-time-checks etc OpenMP is just like a kudzu. |
Yeah that matches my impressions. Implementing Another thing to keep an eye out for here: One of the projects near the top of my list is to build an alternative concurrency model for the Rust implementation. The recursive divide-and-conquer approach enabled by If we make that work in Rust, we could also consider porting that approach to C. It think it can be done without work-stealing, which might make it less onerous for us to implement and for callers to build? But I'm not totally sure. I might need to get advice here from folks with more C experience than I have. |
It is very important to AVOID at ALL COSTS memory mapped files (mandatory for wyhash, for example). This is a snippet of a BLAKE3 "chunked" hashing of a file, C style
I currently use pthread to run N parallel threads (one per core) each one process a different file (obviously no "spinning rust" here :) In the normal world (SSD) even very heavy algorithms (such as SHA256 and BLAKE3 "straight") are more than enough for the available bandwidth (~550MB/s) for a couple of cores (~250MB/s / core for BLAKE3 and ~280 for SHA256) But I deal with high performance systems, where the extensive use of large RAMDISKs to maintain fileservers and mysql tablespaces allows minimal latencies and very high bandwidths (~20GB+/s), in which case I am forced to use XXH3 which is good, but maybe not as reliable as BLAKE3 I don't know Rust well, I should study the original source for a possible porting to C. |
I'd like an improved C version too ---> depending on rust just for a hash function is a no go. |
To be clear, the BLAKE3 library implementations themselves will never perform memory mapping. They just consume in-memory buffers, and the source and nature of those buffers is up to the caller. The
Some back-of-the-envelope thoughts about this: Here's figure 4 from the BLAKE3 paper. This shows multithreaded BLAKE3 throughput on a beefy AWS server, with the current Rayon-based work-stealing implementation. Peak reliable throughput on that machine is about 80 GB/s using 16 threads. You can go higher with more threads, but you start to exhaust memory bandwidth, and the results start to get weird. However, if you ditched the AVX-512 implementation for the portable one, you could probably keep adding threads without hitting those limits, to make up for the reduced per-core throughput. You'd also avoid AVX-512-related downclocking in that case. |
@alterstep @fcorbelli maybe you could say more about what sort of interfaces/architectures you're interested in, since there are a lot of options here:
Personally, (2) scares me. (1) is a little less scary, but my main worry is that the set of people who can tolerate that new dependency, but who can't tolerate calling into Rust, is pretty slim. I'm happiest about (3), but of course that's the least work for me, and I don't know whether anyone else would be happy with it :) Incidentally, the Rust implementation already exposes some very minimal APIs in the shape of (3), to support the Bao project. Optimizing those better and stabilizing them might be desirable either way. See the |
(3) would be good. If the code is called from swift, nim, zig, kotlin, existing native scheduling mechanisms will be compatible. |
For me 1. Something based on an off-the-shelf thread pooling library like OpenMP. 2. A bespoke thread pool internal to the BLAKE3 library 3. A lower-level library API that allows the caller to "bring their own parallelism". I am referring to the experience with other hashes ported to various systems and languages: in 99.9% they are included in the projects (merged into the source code) therefore materially and statically linked. Virtually no one calls a Rust (or whatever) library from anything else to compute a trivial hash that isn't a Rust (or whatever) program. The functions really needed are half a dozen: no API is always better than an API that (almost) no one will use. Sad but true. Therefore, what I would consider the solution to "cover" the C/C ++ world is a version of BLAKE3 in C that relies on something very common (pthread) and requires nothing more than an #include, maybe a #define MAX_THREADS_COUNT and a -pthread in the makefile. That's all. Anything more "weird", which would require me as a developer days or weeks of studying and tweaking and tinkering, would simply get discarded in favor of other hashes that I can get to work in my program in half an hour. This opinion of mine is not necessarily perfect, it is a compromise between those who want to use a hash (which is a simple minimum cog for way more complex programs), and those who perhaps have to make university papers and can work on it for months. |
I might not be reading you right here, so just in case: The low-level API I'm proposing here for (3) would be probably get implemented separately for both Rust and C. C callers who wanted to use it would not need to depend on Rust. |
If I understand right a low-level API that needs to be called by "something" (the C software) to implement the multithread. Very quick and very dirty benchmark on AMD 5950X, Windows 10, and 400.000 bytes-long buffers, gcc 7.3.0 in "straight" c++ mode
|
In fact something like this (just a mockup)
That's all |
I want to point out a few issues with the API you've sketched here, not just to nitpick this example, but to try to clarify that doing multithreading well raises important questions that different callers tend to have different opinions about:
By now a lot of questions are coming up: If we don't want to use A hybrid possibility is that this project might pursue (3) in its stable API, but we could separtely publish one or more higher-level wrappers on top of that API, which could double as both doc examples and as convenience APIs for folks who are happy with common defaults? It might make sense to just see what sticks once all this is halfway working. |
@buffer. 64K buffer
1MB buffer
Why is long to explain, essentially depends on what is being read from the filesystem, and how. |
@fread. |
@ mmap. mmaps is "evil" for the humongous overhead needed. |
How a hash is used in "real world" (no crypto, messages exchange etc, but data storage)?
This is a "chunked" example
with a "chunked" (like fread) in the first part, and a rolling SHA-1 in the latter In short, the uses are many (and these are just mine) so I think I can summarize like this
Which is exactly how BLAKE3 works in C. It doesn't really matter what the mechanism is inside the two functions (say put () 1 byte and write () a buffer). Because a plain-old-SHA-1 runs like that (just to have a term of comparison, as mentioned synthetic benchmarks are not interesting in the non-academic field)
And this is XXH3 (just only as an order of magnitude)
|
Finally the short version: how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3, faster then the current (~ same speed of SHA-256) and faster then SHA-1?Very rough estimate for a single core on 512K buffer
|
As far as I know, BLAKE3 was designed to exploit as much parallelism as possible, either within a single thread by using the processor's SIMD instructions, or with multiple threads (each thread also using the processor's SIMD instructions). By completely avoiding SIMD instructions, you are handicapping yourself, and it's already impressive that it's as fast as SHA-256. One issue is the size of the compression function state. BLAKE3 uses a 4x4 matrix of 32-bit values, and the C compiler will allocate one register for each of these 16 values. If the architecture is register-starved (like x86-64, which only has 16 registers, and some of them will already have other uses), this will lead to lots of spilling to the stack. A quick look at Wikipedia tells me that SHA-256 uses only eight 32-bit values, while SHA-1 uses only five. When using SIMD registers, on the other hand, each row of the 4x4 register can use a single register, and a single instruction can operate on the whole row. The BLAKE3 compression function was carefully designed to allow for that; the diagonal operations can be made by rotating each row (through a SIMD shuffle instruction) to align the columns, and later rotating them back. It's more than a 4 times speedup since there's no longer spilling to the stack, but less than a 4 times speedup since there are extra instructions for rotating and gathering the input into SIMD registers; I'd expect the final speedup of using SIMD to be around 4 times. And that's just for 128-bit (4x32-bit) SIMD, more gains are possible with larger SIMD registers (by processing more than one block in parallel). But it might be possible to make the "straight-C" implementation (I presume it's https://github.com/BLAKE3-team/BLAKE3/blob/1.0.0/c/blake3_portable.c) a bit faster. One surprising result I found while playing with SIMD and pseudo-SIMD in the early Rust era (soon after Rust 1.0), in my https://github.com/cesarb/blake2-rfc crate, was that the order of operations matter. Instead of doing each of the G operations completely, they should be interleaved, that is, the first line of G for the four columns, then the second line of G for the four columns, and so on; that is, precisely the order it would be done with SIMD instructions, even if you're not using them. This allows processors with lots of out-of-order capability (like most x86-64 processors) to do more work in parallel, since there are more independent instructions between dependent instructions. As a bonus, a smart enough compiler can notice what you're doing, and actually use a few SIMD instructions under the hood. However, doing that would mean |
And about |
In fact, on Intel CPU, BLAKE3 is faster then SHA-256 (much faster, maybe 50%) and "on par" on AMD. |
It depends heavily on the case. I have already done tests with various hashes (including BLAKE3 straight) with one or 32 threads, from RAMDISK or cache, on Windows 10. mmap doesn't just "magically" map data, it has to read it from disk (= slower) and has to handle the situation where the file is >of RAM (or rather the address space of the process), =more and more overhead. I don't know if the measurements I made could be interesting |
Just for fun BLAKE3, on "normal" files
On memory mapped
SHA1 on file
SHA1 on memory mapped
XXH3 on file
XXH3 on memory mapped
Absolute performance is not important, but in comparison between the same algorithms from a memory mapped file or from read (). As you can see, at least for a fairly heterogeneous mix of files, the operating system overhead is not insignificant, up to 50%, on 32 running threads. BLAKE3 on file, SINGLE thread
BLAKE3 on memory mapped, SINGLE thread
Same speed. |
Any progress / estimated date of implementation 😊 of this matter? Has a decision already been made regarding variants 1/2/3? For me, the order is quite clear: |
No progress to report. My current priority (apart from teaching at NYU) is on fixing the poor performance of b3sum on spinning disks, by implementing a different multithreading strategy. That might ultimately involve some overlap with the this thread's option ♯3 (a stable low-level API that different threading strategies can sit on top of), but that remains to be seen. |
Does anyone have any experience with the following project? https://github.com/Blaze-3/BLAKE3-gpu (not GPU accelerated openmp version) |
Do you have any Rust vs C version performance comparisons? I wanted to find out the performance benefits of the multithreaded version, so I compared versions C, Rust singlethreaded, Rust multithreaded (update_rayon) on a series of 4-128MB chunks (memory byte arrays - JNI), Windows 10 64bit, 6 core Xeon, built with -O3. And the winner is C singlethreaded: 100% (processing time - reference value) |
Have you tested the HW-accelerated (sse) blake3? |
@kangert Rust and C single-threaded should be using the same assembly implementations and should give nearly identical performance, so I wonder whether there could be an issue with how your bindings/benchmarks are getting compiled? If you push a version to GitHub, I'd be happy to take a look. For getting a rough idea of how well multithreading works on any particular platform, I often reach for the Python bindings, which are based on the Rust implementation and which expose a threading flag. |
@fcorbelli I used the codes of this project (default settings, which should use all instruction sets available on a particular CPU). |
@oconnor663 My tests are based on https://github.com/sken77/BLAKE3jni (that is, from an existing copy of it https://github.com/semantalytics/BLAKE3jni). So the Java part is essentially the same, maybe I'm doing something wrong in passing the input to Rust (which, to put it mildly, I'm really not an expert on). It's basically two lines of code. Converting from
|
@oconnor663 I should also add that I used
|
I can see the problem here. As I found out when trying to use the GPU to accelerate BLAKE3, this hash is so fast that it's often limited by the speed of the memory. Doing any extra copying of the data in memory can be enough to make the total time a lot higher. You are using The C code at https://github.com/semantalytics/BLAKE3jni on the other hand is using a different API, That should be enough to explain the speed difference. To match the C code, you should either also use |
@cesarb Much better. |
From a quick C/ (++) implementation of BLAKE3, in my ZPAQ's fork archiver, I found modest performance (file hashing), similar to ~ SHA256 (~300MB/s) on single thread (just C code, no ASM).
Indeed good for a cryptographic hash, but the inability to use multithreading (which in my opinion is the real advantage of BLAKE3) leaves me unsatisfied.
The use is twofold: integrity check of archived files and deduplication of documents from filesystems
Are there any plans for this (multithreaded in C/C ++) via pthread or whatever?
Thank you for all reply
The text was updated successfully, but these errors were encountered: