Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case for multithreaded hashing in C #184

Open
fcorbelli opened this issue Jul 26, 2021 · 34 comments
Open

Use case for multithreaded hashing in C #184

fcorbelli opened this issue Jul 26, 2021 · 34 comments

Comments

@fcorbelli
Copy link

From a quick C/ (++) implementation of BLAKE3, in my ZPAQ's fork archiver, I found modest performance (file hashing), similar to ~ SHA256 (~300MB/s) on single thread (just C code, no ASM).

Indeed good for a cryptographic hash, but the inability to use multithreading (which in my opinion is the real advantage of BLAKE3) leaves me unsatisfied.

The use is twofold: integrity check of archived files and deduplication of documents from filesystems

Are there any plans for this (multithreaded in C/C ++) via pthread or whatever?

Thank you for all reply

@oconnor663
Copy link
Member

May I recommend benchmarking BLAKE3 with the assembly implementations? They are several times faster than the portable code on most x86 machines :)

It's unlikely we'll implement our own work-stealing multithreading strategy on top of pthreads. If we were going to add this into the C implementation, we'd probably use OpenMP. I don't have much experience with OpenMP myself, but my understanding is that it's a nontrivial dependency that not everyone would be happy to have in their build. (For example, it seems like the MSVC implementation of OpenMP targets a version of the standard that's almost 20 years old?)

It might make sense to ask, would it be cleaner to just expose C bindings for the Rust crate and link against that? I don't know of anyone doing this in production, but I was able to whip up an example just now: https://github.com/oconnor663/rust_blake3_c_bindings

@fcorbelli
Copy link
Author

Thanks for the reply.

@assembly. In fact no, I want/need to compile on different platforms, sometimes even Annapurna (QNAP NAS). I am making my naive benchmarker (for my use case, of course) to compare against others crypto and non cryptographically strong hasesh more easily.

@multuthread: I do not like OpenMP too (neither pthread), but pthread is just about diffuse to work on Windows/Linux/FreeBSD without much hassle. Not perfect, but... it works. In fact does NOT work for C++, but BLAKE3 is C. No ifdef, no compiler-time-checks etc OpenMP is just like a kudzu.
@rust. No for the very same reason: want/need a single .cpp file that will compile without libraries, linking, strange assembly, make files, any kind of #define or whatever "tricks".

@oconnor663
Copy link
Member

oconnor663 commented Jul 27, 2021

Yeah that matches my impressions. Implementing rayon::join ourselves on top of pthreads seems like a ton of complexity, and having a bespoke pool of workers that no other library can share also seems like a drag. But if anyone knows of a "just like rayon::join but in C with pthreads" library project out there, please let me know.

Another thing to keep an eye out for here: One of the projects near the top of my list is to build an alternative concurrency model for the Rust implementation. The recursive divide-and-conquer approach enabled by rayon::join is fabulous for hashing memory, but when your memory is actually an mmapped file backed by a spinning disk, all that seeking thrashes the disk and gives us horrible IO performance (#31). A different approach here would be to have one thread reading large chunks of input "from the front" and farming those out to worker threads over channels. The threading overhead will be a bit higher, but for large enough inputs we could smooth that out, and the result would hopefully be almost as fast but without the thrashing problem.

If we make that work in Rust, we could also consider porting that approach to C. It think it can be done without work-stealing, which might make it less onerous for us to implement and for callers to build? But I'm not totally sure. I might need to get advice here from folks with more C experience than I have.

@fcorbelli
Copy link
Author

It is very important to AVOID at ALL COSTS memory mapped files (mandatory for wyhash, for example).

This is a snippet of a BLAKE3 "chunked" hashing of a file, C style

   FILE* inFile = freadopen(i_filename);
	if (inFile==NULL) 
		return "";
	
	blake3_hasher hasher;
	blake3_hasher_init(&hasher);
	unsigned char buffer[65536];
	ssize_t readSize;

	while ((readSize = fread(buffer, 1, sizeof(buffer), inFile)) > 0) 
			blake3_hasher_update(&hasher, buffer, readSize);
	fclose(inFile);
	
	uint8_t output[BLAKE3_OUT_LEN];
	blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);

I currently use pthread to run N parallel threads (one per core) each one process a different file (obviously no "spinning rust" here :)
Very very quick and dirty: given X files to be worked, create N different vectors, each one with X/N files.
Run the X job, join et voilà.

In the normal world (SSD) even very heavy algorithms (such as SHA256 and BLAKE3 "straight") are more than enough for the available bandwidth (~550MB/s) for a couple of cores (~250MB/s / core for BLAKE3 and ~280 for SHA256)

But I deal with high performance systems, where the extensive use of large RAMDISKs to maintain fileservers and mysql tablespaces allows minimal latencies and very high bandwidths (~20GB+/s), in which case I am forced to use XXH3 which is good, but maybe not as reliable as BLAKE3

I don't know Rust well, I should study the original source for a possible porting to C.
However it is a bit too time expensive work for my current needs.
Just for fun I am developing a time-limited benchmark of the "straight" BLAKE3-c

@alterstep
Copy link

I'd like an improved C version too ---> depending on rust just for a hash function is a no go.

@oconnor663
Copy link
Member

It is very important to AVOID at ALL COSTS memory mapped files

To be clear, the BLAKE3 library implementations themselves will never perform memory mapping. They just consume in-memory buffers, and the source and nature of those buffers is up to the caller. The b3sum command line does mmap by default when it can.

But I deal with high performance systems, where the extensive use of large RAMDISKs to maintain fileservers and mysql tablespaces allows minimal latencies and very high bandwidths (~20GB+/s)

Some back-of-the-envelope thoughts about this: Here's figure 4 from the BLAKE3 paper.

image

This shows multithreaded BLAKE3 throughput on a beefy AWS server, with the current Rayon-based work-stealing implementation. Peak reliable throughput on that machine is about 80 GB/s using 16 threads. You can go higher with more threads, but you start to exhaust memory bandwidth, and the results start to get weird. However, if you ditched the AVX-512 implementation for the portable one, you could probably keep adding threads without hitting those limits, to make up for the reduced per-core throughput. You'd also avoid AVX-512-related downclocking in that case.

@oconnor663
Copy link
Member

oconnor663 commented Jul 27, 2021

I'd like an improved C version too ---> depending on rust just for a hash function is a no go.

@alterstep @fcorbelli maybe you could say more about what sort of interfaces/architectures you're interested in, since there are a lot of options here:

  1. Something based on an off-the-shelf thread pooling library like OpenMP. From my perspective the main benefit to this approach is less complexity in the BLAKE3 library, and from the caller's perspective the main benefit is sharing resources and configuration with other parallel tasks. Of course the downside we've discussed above is a thorny dependency. This is what the Rust implementation does, with its optional dependency on Rayon (which thankfully doesn't cause any build issues).
  2. A bespoke thread pool internal to the BLAKE3 library. This would be a lot of complexity in the BLAKE3 library, and it would introduce dependencies on at least pthreads and C11 atomics, which might still be problematic for some compilers. It also wouldn't allow for any resource sharing, and configuration/tuning options would be very limited. But it would at least avoid the thorny dependency on OpenMP or similar.
  3. A lower-level library API that allows the caller to "bring their own parallelism". We could provide advanced functions (with appropriately scary names) to let you compute internal subtree hashes, so that you can farm out many subtrees to many threads. This would be lightweight for the BLAKE3 library itself, since it involves no new functionality, just some new APIs and some thinking about how to make them less error-prone. It would be more work for callers (involving some nontrivial understanding of BLAKE3 implementation details), but it would give callers the maximum possible amount of control over their parallelism.

Personally, (2) scares me. (1) is a little less scary, but my main worry is that the set of people who can tolerate that new dependency, but who can't tolerate calling into Rust, is pretty slim. I'm happiest about (3), but of course that's the least work for me, and I don't know whether anyone else would be happy with it :)

Incidentally, the Rust implementation already exposes some very minimal APIs in the shape of (3), to support the Bao project. Optimizing those better and stabilizing them might be desirable either way. See the guts module, which is hidden from docs by default. RUSTDOCFLAGS='-Z unstable-options --document-hidden-items' cargo +nightly doc

@alterstep
Copy link

(3) would be good. If the code is called from swift, nim, zig, kotlin, existing native scheduling mechanisms will be compatible.

@fcorbelli
Copy link
Author

For me

1. Something based on an off-the-shelf thread pooling library like OpenMP.
No. Too complex to make it work

2. A bespoke thread pool internal to the BLAKE3 library
Obviously yes

3. A lower-level library API that allows the caller to "bring their own parallelism".
Obviously no. Almost no one call the API of an opensource project so small (hash) doing the very own super complex multithreaded interface. It is not something like a 3D API for videogames that I can study for months.

I am referring to the experience with other hashes ported to various systems and languages: in 99.9% they are included in the projects (merged into the source code) therefore materially and statically linked.

Virtually no one calls a Rust (or whatever) library from anything else to compute a trivial hash that isn't a Rust (or whatever) program.

The functions really needed are half a dozen: no API is always better than an API that (almost) no one will use. Sad but true.

Therefore, what I would consider the solution to "cover" the C/C ++ world is a version of BLAKE3 in C that relies on something very common (pthread) and requires nothing more than an #include, maybe a #define MAX_THREADS_COUNT and a -pthread in the makefile. That's all.

Anything more "weird", which would require me as a developer days or weeks of studying and tweaking and tinkering, would simply get discarded in favor of other hashes that I can get to work in my program in half an hour.

This opinion of mine is not necessarily perfect, it is a compromise between those who want to use a hash (which is a simple minimum cog for way more complex programs), and those who perhaps have to make university papers and can work on it for months.

@oconnor663
Copy link
Member

Virtually no one calls a Rust (or whatever) library from anything else to compute a trivial hash that isn't a Rust (or whatever) program.

I might not be reading you right here, so just in case: The low-level API I'm proposing here for (3) would be probably get implemented separately for both Rust and C. C callers who wanted to use it would not need to depend on Rust.

@fcorbelli
Copy link
Author

If I understand right a low-level API that needs to be called by "something" (the C software) to implement the multithread.
Ahem... no.
No, because if I must read a bunch of documentation for the API, write a (rather) complex piece of multithreaded software that call the APIs (to make a multithreaded hash)... I will simply stick in XXH3 (~7GB/s for a single core, for a fast one) or SHA 256 (if I need something reliable).
And this would be a pity, because BLAKE3 seems promising

Very quick and very dirty benchmark on AMD 5950X, Windows 10, and 400.000 bytes-long buffers, gcc 7.3.0 in "straight" c++ mode

      BLAKE3:   278.62 MB/s (done     1.36 GB)
     SHA-256:   282.75 MB/s (done     1.38 GB)
       SHA-1:   894.24 MB/s (done     4.37 GB)
    XXHASH64:     6.07 GB/s (done    30.25 GB)
        XXH3:     7.00 GB/s (done    34.86 GB)
     CRC-32C:     7.06 GB/s (done    35.19 GB)
      WYHASH:     8.64 GB/s (done    43.08 GB)
      CRC-32:     9.01 GB/s (done    44.88 GB)

@fcorbelli
Copy link
Author

In fact something like this (just a mockup)

        blake3_hasher hasher;
	blake3_hasher_init(&hasher,16); ///<---- 16 threads
	unsigned char buffer[65536];
	ssize_t readSize;

	while ((readSize = fread(buffer, 1, sizeof(buffer), inFile)) > 0) 
			blake3_hasher_update(&hasher, buffer, readSize);
	fclose(inFile);
	
	uint8_t output[BLAKE3_OUT_LEN];
	blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN,16); ///<<<--- 16 threads

That's all

@oconnor663
Copy link
Member

I want to point out a few issues with the API you've sketched here, not just to nitpick this example, but to try to clarify that doing multithreading well raises important questions that different callers tend to have different opinions about:

  • The buffer here is too small to make good use of multiple threads. In my experience on x86 machines, 128 KiB is the bare minimum buffer size to benefit at all from multithreading. (At least, the way it's done currently in the Rust implementation.) For 16 threads, you might notice in the graph above that throughput isn't much better than 8 threads until the inputs are well above 1 MiB. These buffers are too large for the stack and need to be allocated on the heap.
  • The fread is a bottleneck here. What you'll see if you try to benchmark code like this, is that with a large enough buffer you'll get good parallelism inside of blake3_hasher_update, but all those threads have to go to sleep every time the main thread refills the buffer, and your overall throughput won't be good. Even if fread was zero cost, putting all your threads to sleep is costly in and of itself, because among other things it means that everything needs to wait for the slowest worker to finish. The more threads you have, the worse this bottleneck gets.
  • As far as I know, there are basically two directions you can go with this. One is to mmap the whole file, and pass it in as a single call to blake3_hasher_update. That avoids problematic bottlenecks in your own code, in exchange for all the caveats that come up with mmap. The other direction is to turn blake3_hasher_update into a non-blocking call that fires off "background tasks" of some kind to do the actual work. Since these tasks can't take ownership of buffer, ideally you'd replace the whole fread-update loop with a single call something like blake3_update_from_file with an internal read loop of its own, to avoid extra copies.

By now a lot of questions are coming up: If we don't want to use mmap, can we assume that everything is a FILE*, or do we need to come up with an abstraction for other types of streams? Is anyone going to need to configure thread niceness? If BLAKE3 took opinionated stances on these questions, the end result might be a lot like "just use mmap internally". (Which is to say, inevitably a bad fit for your use case or someone else's.) On the other hand I'm not an expert in this stuff, and there are probably other strategies here that I'm not thinking of.

A hybrid possibility is that this project might pursue (3) in its stable API, but we could separtely publish one or more higher-level wrappers on top of that API, which could double as both doc examples and as convenience APIs for folks who are happy with common defaults? It might make sense to just see what sticks once all this is halfway working.

@fcorbelli
Copy link
Author

@buffer.
In fact, no.
No in "real world" where a bunch of things need to run, overhead by OS and whatever

64K buffer


C:\zpaqfranz>zpaqfranz sha1 z:\knb -blake3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:blake3
franz:summary    1
Getting BLAKE3 ignoring .zfs and :$DATA

Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.172000

Creating 32 hashing thread(s) with BLAKE3
001% 00:00:04 (  83.89 MB) of (   8.19 GB)           87.968.096 /sec
010% 00:00:02 ( 838.45 MB) of (   8.19 GB)          879.176.836 /sec
020% 00:00:01 (   1.64 GB) of (   8.19 GB)        1.758.295.196 /sec
030% 00:00:01 (   2.46 GB) of (   8.19 GB)        2.637.457.284 /sec
040% 00:00:01 (   3.27 GB) of (   8.19 GB)        3.516.583.877 /sec
050% 00:00:01 (   4.09 GB) of (   8.19 GB)        4.395.711.188 /sec
060% 00:00:00 (   4.91 GB) of (   8.19 GB)        5.274.864.279 /sec
070% 00:00:00 (   5.73 GB) of (   8.19 GB)        6.154.047.542 /sec
080% 00:00:00 (   6.55 GB) of (   8.19 GB)        7.033.165.714 /sec
090% 00:00:00 (   7.37 GB) of (   8.19 GB)        3.956.165.365 /sec
Scanning filesystem time            0.172 s
Data transfer+CPU   time            3.642 s
Data output         time            0.016 s
Total size                      8.791.414.749 (   8.19 GB)
Tested size                     8.791.414.749 (   8.19 GB)
Duplicated size                   264.663.198 ( 252.40 MB)
Duplicated files                        4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 2.413.897.514 B/s
GLOBAL SHA256: 2F83D0EE10E05012578B27258D74418038CDD926F1C8BD0E812C0CA4CA9D2803

1MB buffer


C:\zpaqfranz>zpaqfranz sha1 z:\knb -blake3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:blake3
franz:summary    1
Getting BLAKE3 ignoring .zfs and :$DATA

Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.187000

Creating 32 hashing thread(s) with BLAKE3
001% 00:00:03 (  83.88 MB) of (   8.19 GB)           87.960.327 /sec
010% 00:00:02 ( 838.51 MB) of (   8.19 GB)          879.242.157 /sec
020% 00:00:01 (   1.64 GB) of (   8.19 GB)        1.758.539.305 /sec
030% 00:00:01 (   2.46 GB) of (   8.19 GB)        2.638.294.226 /sec
040% 00:00:01 (   3.28 GB) of (   8.19 GB)        3.517.141.268 /sec
050% 00:00:01 (   4.09 GB) of (   8.19 GB)        4.396.131.446 /sec
060% 00:00:00 (   4.91 GB) of (   8.19 GB)        5.275.847.802 /sec
070% 00:00:00 (   5.73 GB) of (   8.19 GB)        6.154.319.413 /sec
080% 00:00:00 (   6.55 GB) of (   8.19 GB)        7.033.492.094 /sec
090% 00:00:00 (   7.37 GB) of (   8.19 GB)        3.956.290.170 /sec
Scanning filesystem time            0.187 s
Data transfer+CPU   time            3.626 s
Data output         time            0.016 s
Total size                      8.791.414.749 (   8.19 GB)
Tested size                     8.791.414.749 (   8.19 GB)
Duplicated size                   264.663.198 ( 252.40 MB)
Duplicated files                        4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 2.424.549.020 B/s
GLOBAL SHA256: 2F83D0EE10E05012578B27258D74418038CDD926F1C8BD0E812C0CA4CA9D2803

Why is long to explain, essentially depends on what is being read from the filesystem, and how.
I don think this interesting for this thread, so I don't go deeper (in some cases you simply cannot allocate big buffers, because you have 512K of RAM like on some cheap NAS and whatever)

@fcorbelli
Copy link
Author

@fread.
I have not studied in depth how the BLAKE3 multithreading mechanism works and, in fact, I would expect to have even a limited parallelism, for example two / four threads, able to exceed the threshold of 550MB / s effective (SATA SSDs) .
As an optimal target there are 2GB / s, which is the practical empirical limit for NVMe drives and real filesystems (i.e. not almost useless benchmarks)

@fcorbelli
Copy link
Author

@ mmap. mmaps is "evil" for the humongous overhead needed.
If you are curious about wyhash github (which really needs mmap) you will see my discussion about it. Short version: mmap sucks

@fcorbelli
Copy link
Author

How a hash is used in "real world" (no crypto, messages exchange etc, but data storage)?
In two main situation

  1. calc the hash of a file, something on the filesystem. Typically with a cicle on a FILE or whatever (no C++ streams, no mmaps or high-level-something, way too slow)
  2. calc the hash of a file by... chunks (deduplication, compression) or even one byte at time
    I didn't want to put too complex snippets, but similar to this one (obviously everyone is different)
    This is a "byte-by-byte" example
unzSHA256 mysha256;
for (uint32_t i=0; i<f.ptr.size(); ++i) 
if (f.ptr[i]<frag.size())
for (uint32_t j=24; j<frag[f.ptr[i]].size(); ++j)
mysha256.put(frag[f.ptr[i]][j]);

This is a "chunked" example

     while (true) {
			if (bufptr>=buflen) bufptr=0, buflen=fread(buf, 1, BUFSIZE, in);
	
	  
			if (g_franzotype>0)
				if (bufptr==0)
					if (buflen>0) 
					{
						p->second.file_crc32=crc32_16bytes(buf,buflen,p->second.file_crc32);
	
						if (g_franzotype==FRANZO_XXHASH64)
							p->second.file_xxhash64.add(buf,buflen);

						if (g_franzotype==FRANZO_SHA_1)
							for (int i=0;i<buflen;i++)
								p->second.file_sha1.put(*(buf+i));
												
						if (g_franzotype==FRANZO_SHA_256)
							for (int i=0;i<buflen;i++)
								p->second.file_sha256.put(*(buf+i));
								
						if (g_franzotype==FRANZO_XXH3)
							if (p->second.pfile_xxh3)
								(void)XXH3_128bits_update(p->second.pfile_xxh3, buf,buflen);

					}
          if (bufptr>=buflen) c=EOF;
          else c=(unsigned char)buf[bufptr++];
          if (c!=EOF) {
            if (c==o1[c1]) h=(h+c+1)*314159265u, ++hits;
            else h=(h+c+1)*271828182u;
            o1[c1]=c;
            c1=c;
            sha1.put(c);
            fragbuf[sz++]=c;
          }
          if (c==EOF
              || sz>=MAX_FRAGMENT
              || (fragment<=22 && h<(1u<<(22-fragment)) && sz>=MIN_FRAGMENT))
            break;
        }

with a "chunked" (like fread) in the first part, and a rolling SHA-1 in the latter

In short, the uses are many (and these are just mine) so I think I can summarize like this

  1. initialize
  2. put one byte at a time / write a block of bytes at a time
  3. finalize
  4. take the result

Which is exactly how BLAKE3 works in C.

It doesn't really matter what the mechanism is inside the two functions (say put () 1 byte and write () a buffer).
If a single-threaded-straight-C-BLAKE3 can achieve 500MB/s (or better in par with SHA-1, 1GB/s) for core that will be wonderful.
A 2GB/s for core of a strong cryptographic hash even better.
So a multithreaded -C-BLAKE3 have only the objective to be significantly faster

Because a plain-old-SHA-1 runs like that (just to have a term of comparison, as mentioned synthetic benchmarks are not interesting in the non-academic field)

C:\zpaqfranz>zpaqfranz sha1 z:\knb -sha1 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:SHA-1 (-sha1)
franz:summary    1
Getting SHA1 ignoring .zfs and :$DATA

Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.203000

Creating 32 hashing thread(s) with SHA1
001% 00:00:01 (  84.00 MB) of (   8.19 GB)           88.080.384 /sec
010% 00:00:00 ( 838.50 MB) of (   8.19 GB)          879.230.976 /sec
020% 00:00:00 (   1.64 GB) of (   8.19 GB)        1.758.461.952 /sec
030% 00:00:00 (   2.46 GB) of (   8.19 GB)        2.637.692.928 /sec
040% 00:00:00 (   3.28 GB) of (   8.19 GB)        3.516.923.904 /sec
050% 00:00:00 (   4.09 GB) of (   8.19 GB)        4.396.154.880 /sec
060% 00:00:00 (   4.91 GB) of (   8.19 GB)        5.274.861.568 /sec
070% 00:00:00 (   5.73 GB) of (   8.19 GB)        6.154.092.544 /sec
080% 00:00:00 (   6.55 GB) of (   8.19 GB)        7.033.323.520 /sec
Scanning filesystem time            0.203 s
Data transfer+CPU   time            1.172 s
Data output         time            0.016 s
Total size                      8.791.414.749 (   8.19 GB)
Tested size                     8.791.414.749 (   8.19 GB)
Duplicated size                   265.085.633 ( 252.80 MB)
Duplicated files                        4.095
Worked on 8.791.414.749 bytes avg speed (hashtime) 7.501.207.123 B/s
GLOBAL SHA256: A500E353987F6827A3FE6CF63C49BB5CC2C87E53F5BF437E9EB46461ED8B45B8

And this is XXH3 (just only as an order of magnitude)

C:\zpaqfranz>zpaqfranz sha1 z:\knb -xxh3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:XXH3 (-xxh3)
franz:summary    1
Getting XXH3 ignoring .zfs and :$DATA

Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.188000

Creating 32 hashing thread(s) with XXH3
001% 00:00:01 (  83.84 MB) of (   8.19 GB)           79.161.575 /sec
010% 00:00:00 ( 838.43 MB) of (   8.19 GB)          879.161.575 /sec
020% 00:00:00 (   1.64 GB) of (   8.19 GB)        1.758.342.797 /sec
030% 00:00:00 (   2.46 GB) of (   8.19 GB)        2.637.475.960 /sec
040% 00:00:00 (   3.27 GB) of (   8.19 GB)        3.516.600.072 /sec
050% 00:00:00 (   4.09 GB) of (   8.19 GB)        4.395.753.431 /sec
060% 00:00:00 (   4.91 GB) of (   8.19 GB)        5.274.907.772 /sec
070% 00:00:00 (   5.73 GB) of (   8.19 GB)        6.154.055.584 /sec
080% 00:00:00 (   6.55 GB) of (   8.19 GB)        7.033.139.650 /sec
090% 00:00:00 (   7.37 GB) of (   8.19 GB)        7.912.300.605 /sec
Scanning filesystem time            0.188 s
Data transfer+CPU   time            0.501 s
Data output         time            0.015 s
Total size                      8.791.414.749 (   8.19 GB)
Tested size                     8.791.414.749 (   8.19 GB)
Duplicated size                   264.663.198 ( 252.40 MB)
Duplicated files                        4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 17.547.734.029 B/s
GLOBAL SHA256: 4A9D67B40296A19FA0E860D48C21563FAA1BC6785899A485CAC46BB7AAB1431

@fcorbelli
Copy link
Author

Finally the short version:

how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3, faster then the current (~ same speed of SHA-256) and faster then SHA-1?

Very rough estimate for a single core on 512K buffer

     SHA-256:   271.50 MB/s (done     1.33 GB)
      BLAKE3:   275.80 MB/s (done     1.35 GB)
       SHA-1:   903.80 MB/s (done     4.41 GB)
    XXHASH64:     6.05 GB/s (done    30.25 GB)
        XXH3:     7.01 GB/s (done    34.97 GB)
      WYHASH:     8.75 GB/s (done    43.61 GB)

@cesarb
Copy link
Contributor

cesarb commented Jul 31, 2021

how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3

As far as I know, BLAKE3 was designed to exploit as much parallelism as possible, either within a single thread by using the processor's SIMD instructions, or with multiple threads (each thread also using the processor's SIMD instructions). By completely avoiding SIMD instructions, you are handicapping yourself, and it's already impressive that it's as fast as SHA-256.

One issue is the size of the compression function state. BLAKE3 uses a 4x4 matrix of 32-bit values, and the C compiler will allocate one register for each of these 16 values. If the architecture is register-starved (like x86-64, which only has 16 registers, and some of them will already have other uses), this will lead to lots of spilling to the stack. A quick look at Wikipedia tells me that SHA-256 uses only eight 32-bit values, while SHA-1 uses only five.

When using SIMD registers, on the other hand, each row of the 4x4 register can use a single register, and a single instruction can operate on the whole row. The BLAKE3 compression function was carefully designed to allow for that; the diagonal operations can be made by rotating each row (through a SIMD shuffle instruction) to align the columns, and later rotating them back. It's more than a 4 times speedup since there's no longer spilling to the stack, but less than a 4 times speedup since there are extra instructions for rotating and gathering the input into SIMD registers; I'd expect the final speedup of using SIMD to be around 4 times. And that's just for 128-bit (4x32-bit) SIMD, more gains are possible with larger SIMD registers (by processing more than one block in parallel).

But it might be possible to make the "straight-C" implementation (I presume it's https://github.com/BLAKE3-team/BLAKE3/blob/1.0.0/c/blake3_portable.c) a bit faster. One surprising result I found while playing with SIMD and pseudo-SIMD in the early Rust era (soon after Rust 1.0), in my https://github.com/cesarb/blake2-rfc crate, was that the order of operations matter. Instead of doing each of the G operations completely, they should be interleaved, that is, the first line of G for the four columns, then the second line of G for the four columns, and so on; that is, precisely the order it would be done with SIMD instructions, even if you're not using them. This allows processors with lots of out-of-order capability (like most x86-64 processors) to do more work in parallel, since there are more independent instructions between dependent instructions. As a bonus, a smart enough compiler can notice what you're doing, and actually use a few SIMD instructions under the hood.

However, doing that would mean blake3_portable.c would no longer be the simple, easy-to-read reference implementation it currently is, and that for no real gain since all processors which would benefit from that extra complexity (highly out-of-order processors) have SIMD instructions, which could be used instead for even more speed.

@cesarb
Copy link
Contributor

cesarb commented Jul 31, 2021

And about mmap: BLAKE3 is so fast (once you're using SIMD and multiple threads) that it's actually limited by the memory speed. When you do a read() (either directly or through an abstraction like fread()), you're doing an extra read from memory (from the kernel page cache) to memory (your buffer), before reading it again (for the hash). Using mmap avoids that extra copy, since it directly maps the pages from the kernel page cache into your address space. This can be easily seen by comparing the speed of b3sum and b3sum --no-mmap on a large file (like an ISO) which is already on the page cache.

@fcorbelli
Copy link
Author

how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3

As far as I know, BLAKE3 was designed to exploit as much parallelism as possible, either within a single thread by using the processor's SIMD instructions, or with multiple threads (each thread also using the processor's SIMD instructions). By completely avoiding SIMD instructions, you are handicapping yourself, and it's already impressive that it's as fast as SHA-256.

In fact, on Intel CPU, BLAKE3 is faster then SHA-256 (much faster, maybe 50%) and "on par" on AMD.
SHA-256 has an excessive number of rounds, they could be significantly reduced possibly tripling its performance without loss of security, but the standard is now fixed

@fcorbelli
Copy link
Author

And about mmap: BLAKE3 is so fast (once you're using SIMD and multiple threads) that it's actually limited by the memory speed. When you do a read() (either directly or through an abstraction like fread()), you're doing an extra read from memory (from the kernel page cache) to memory (your buffer), before reading it again (for the hash). Using mmap avoids that extra copy, since it directly maps the pages from the kernel page cache into your address space. This can be easily seen by comparing the speed of b3sum and b3sum --no-mmap on a large file (like an ISO) which is already on the page cache.

It depends heavily on the case. I have already done tests with various hashes (including BLAKE3 straight) with one or 32 threads, from RAMDISK or cache, on Windows 10.
With large and small files.
The overhead of mmap is obviously high, and it grows with the number of threads. The same also happens with a much faster (non-cryptographic) hash than BLAKE3 (wyhash). It is important to point out C, pthread, Windows, AMD. It is not taken for granted that the same behavior exists for Rust etc.

mmap doesn't just "magically" map data, it has to read it from disk (= slower) and has to handle the situation where the file is >of RAM (or rather the address space of the process), =more and more overhead. I don't know if the measurements I made could be interesting

@fcorbelli
Copy link
Author

Just for fun
Straight C (gcc 7.3 only -O3), Windows 10, AMD, from ramdisk.
If duplicated size differ, then algo implementation is broken

BLAKE3, on "normal" files

C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -all -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug  1 2021
(...)
Creating 32 hashing thread(s) with BLAKE3
Scanning filesystem time            0.140 s
Data transfer+CPU   time            3.563 s
Data output         time            0.015 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 2.472.486.757 B/s

On memory mapped

C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -all -mm -summary
(...)
Creating 32 hashing thread(s) with BLAKE3
Scanning filesystem time            0.156 s
Data transfer+CPU   time            3.845 s
Data output         time            0.015 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 2.291.149.627 B/s

SHA1 on file

C:\zpaqfranz>zpaqfranz sum z:\knb -sha1 -all -summary
(...)
Scanning filesystem time            0.172 s
Data transfer+CPU   time            1.157 s
Data output         time            0.016 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 7.614.062.50

SHA1 on memory mapped

C:\zpaqfranz>zpaqfranz sum z:\knb -sha1 -all -mm -summary
(...)
Scanning filesystem time            0.156 s
Data transfer+CPU   time            2.657 s
Data output         time            0.015 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 3.315.570.311 B/s

XXH3 on file

C:\zpaqfranz>zpaqfranz sum z:\knb -xxh3 -all -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug  1 2021
(...)
Scanning filesystem time            0.172 s
Data transfer+CPU   time            0.564 s
Data output         time            0.016 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 15.619.628.221 B/s

XXH3 on memory mapped

C:\zpaqfranz>zpaqfranz sum z:\knb -xxh3 -all -mm -summary
(...)
Scanning filesystem time            0.156 s
Data transfer+CPU   time            1.454 s
Data output         time            0.015 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 6.058.782.886 B/s

Absolute performance is not important, but in comparison between the same algorithms from a memory mapped file or from read (). As you can see, at least for a fairly heterogeneous mix of files, the operating system overhead is not insignificant, up to 50%, on 32 running threads.
Now look @1 at time.

BLAKE3 on file, SINGLE thread

C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug  1 2021
(...)
Creating 1 hashing thread(s) with BLAKE3
(...)
Scanning filesystem time            0.157 s
Data transfer+CPU   time           30.860 s
Data output         time            0.016 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 285.465.661 B/s

BLAKE3 on memory mapped, SINGLE thread

C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -mm -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug  1 2021
franz:blake3 (-blake3) memory mapped file (-mm)
(...)
Creating 1 hashing thread(s) with BLAKE3
Scanning filesystem time            0.140 s
Data transfer+CPU   time           31.329 s
Data output         time            0.015 s
Total size                      8.809.470.317 (   8.20 GB)
Tested size                     8.809.470.317 (   8.20 GB)
Duplicated size                   264.682.716 ( 252.42 MB)
Duplicated files                        4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 281.192.196 B/s

Same speed.
It is not a large test etc, just to show how it is not so obvious which is the best choice

@kangert
Copy link

kangert commented Oct 21, 2021

Any progress / estimated date of implementation 😊 of this matter? Has a decision already been made regarding variants 1/2/3?

For me, the order is quite clear:
2 - Yes, please.
1 - I can live with that.
3 - By no means.

@oconnor663
Copy link
Member

No progress to report. My current priority (apart from teaching at NYU) is on fixing the poor performance of b3sum on spinning disks, by implementing a different multithreading strategy. That might ultimately involve some overlap with the this thread's option ♯3 (a stable low-level API that different threading strategies can sit on top of), but that remains to be seen.

@kangert
Copy link

kangert commented Jun 1, 2022

Does anyone have any experience with the following project?

https://github.com/Blaze-3/BLAKE3-gpu (not GPU accelerated openmp version)

@kangert
Copy link

kangert commented Jul 9, 2022

Do you have any Rust vs C version performance comparisons?

I wanted to find out the performance benefits of the multithreaded version, so I compared versions C, Rust singlethreaded, Rust multithreaded (update_rayon) on a series of 4-128MB chunks (memory byte arrays - JNI), Windows 10 64bit, 6 core Xeon, built with -O3.

And the winner is C singlethreaded: 100% (processing time - reference value)
Rust multithreaded: about 115%
Rust singlethreaded: about 160%

@fcorbelli
Copy link
Author

Have you tested the HW-accelerated (sse) blake3?
On my AMD 5950X (Win10) goes from 547 MB/s to 3.4 GB/s, six time faster (390KB chunks)

@oconnor663
Copy link
Member

@kangert Rust and C single-threaded should be using the same assembly implementations and should give nearly identical performance, so I wonder whether there could be an issue with how your bindings/benchmarks are getting compiled? If you push a version to GitHub, I'd be happy to take a look.

For getting a rough idea of how well multithreading works on any particular platform, I often reach for the Python bindings, which are based on the Rust implementation and which expose a threading flag.

@kangert
Copy link

kangert commented Jul 9, 2022

@fcorbelli I used the codes of this project (default settings, which should use all instruction sets available on a particular CPU).
Then I also tried the pure and prefer_intrinsics features.

@kangert
Copy link

kangert commented Jul 9, 2022

@oconnor663 My tests are based on https://github.com/sken77/BLAKE3jni (that is, from an existing copy of it https://github.com/semantalytics/BLAKE3jni). So the Java part is essentially the same, maybe I'm doing something wrong in passing the input to Rust (which, to put it mildly, I'm really not an expert on).

It's basically two lines of code. Converting from input: jbyteArray and calling update or update_rayon on blake3::Hasher:

let bytes : &[u8] = &JNIEnv::convert_byte_array(&env, input).unwrap();
hasher.update_rayon(&bytes);

@kangert
Copy link

kangert commented Jul 9, 2022

@oconnor663 I should also add that I used Cargo.toml file from project https://github.com/BLAKE3-team/BLAKE3/tree/master/b3sum
+

[profile.release]
opt-level = 3
lto = true

@cesarb
Copy link
Contributor

cesarb commented Jul 10, 2022

@oconnor663 My tests are based on https://github.com/sken77/BLAKE3jni (that is, from an existing copy of it https://github.com/semantalytics/BLAKE3jni). So the Java part is essentially the same, maybe I'm doing something wrong in passing the input to Rust (which, to put it mildly, I'm really not an expert on).

It's basically two lines of code. Converting from input: jbyteArray and calling update or update_rayon on blake3::Hasher:

let bytes : &[u8] = &JNIEnv::convert_byte_array(&env, input).unwrap();
hasher.update_rayon(&bytes);

I can see the problem here. As I found out when trying to use the GPU to accelerate BLAKE3, this hash is so fast that it's often limited by the speed of the memory. Doing any extra copying of the data in memory can be enough to make the total time a lot higher.

You are using JNIEnv::convert_byte_array(), which I'm guessing is the code found at https://docs.rs/jni/0.19.0/src/jni/wrapper/jnienv.rs.html#1112-1125 which calls the JNI APIs GetArrayLength and GetByteArrayRegion. GetArrayLength should be nearly instant, but GetByteArrayRegion is basically doing a full copy of the data (into a newly allocated and zeroed Vec).

The C code at https://github.com/semantalytics/BLAKE3jni on the other hand is using a different API, GetDirectBufferAddress (and to allow for that, on the Java side it's using ByteBuffer.allocateDirect() to allocate the memory). That API returns a direct pointer to the memory contained within the ByteBuffer, without copying.

That should be enough to explain the speed difference. To match the C code, you should either also use ByteBuffer.allocateDirect() on the Java side (but then be careful of extra copies within the Java code) and then JNIEnv::get_direct_buffer_address() on the Rust side, or if you want to keep using a byte array on the Java side, you could try either JNIEnv::get_byte_array_elements() or JNIEnv::get_primitive_array_critical() which might (or might not; depends on the specific GC implementation used by your JVM) avoid doing the memory copy.

@kangert
Copy link

kangert commented Jul 11, 2022

@cesarb Much better. JNIEnv::get_primitive_array_critical() works perfectly.
Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants