-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[WIP] ICP: AES_GCM: add support for SSE4.1 acceleration #14531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
While it's still a draft, I took care to squash the commits in a way which should make reviewing this straight forward. I did quite some testing without revealing any problems, but at this point I'd appreciate any help on testing on a wider range of systems. If you need any help on how to test this, please drop a note here and I'll instruct. If you want to test this, in any case please do it on a test VM with a test pool since this code could permanently corrupt any pools accessed from it. There's still some cleanup an some minor refinements left, so this is a draft for now. I'm also wondering if I should keep the debugging code in the |
The people at ixsystems may be interested in testing this too. Surely they'll have a far better test infrastructure to test this on than I do. |
@lundman I took major efforts to make this clang compatible on Linux. If you have a spare hour you may want to test if it works on macOS and Windows as well. |
One important point regarding testing I missed above: If you are testing this on hardware which has AVX support you need to activate SSE support by doing # echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl or setting it on module load time. |
Excellent, I will give it a go on macOS and Windows. |
Thank you a lot for your work! I have a ReadyNAS 426 with an Intel(R) Atom(TM) CPU C3538 @ 2.10GHz without AVX with SSE3, SSE4.1 & SSE4.2. I'd be willing to test your implementation. I have 6 × 8TB drives in ZFS for storage (ZFS directly uses the disks and autoformatted them to GPT: 7.3T ZFS + 8M reserved) and 2 × 64GB High Endurance microSD cards in RAID with debian bookworm. Could you please provide instructions on how I can best test this PR? |
I did a light fio run on a small Celeron J3455 [Goldmont] board with SSE4.2 but no AVX. Using a single-vdev tmpfs-backed pool, no compression:
fio was: Hardly science, but those initial numbers seem pretty good. Strong work! |
Thank you @AttilaFueloep, I too have very promising results on a C3558 system running Debian Bullseye under KVM hypervisor and WD Black SN750 NVMe disk. In my tests below I just used dd to read a big file with arc size limited to 500MB. Unencrypted read speed was 909MB/s, the old
|
@oberien Well, I'd suggest the following: Spin up a VM which has all the dependencies for building ZFS installed, but not the ZFS packages itself. Inside the VM do: git clone http://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make Open a new root terminal inside the VM since for me the installation step failed with sudo. cd <dir where you build zfs>
make install
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
zpool create ...
zfs create -o encryption= ... Run your favorite workloads and benchmark tools. |
module/icp/include/modes/modes.h
Outdated
if (sc->gcm_Htable != NULL) { | ||
kmem_free(c->gcm_Htable, c->gcm_htab_len); | ||
} | ||
if (sc->gcm_pt_buf != NULL) { | ||
vmem_free(c->gcm_pt_buf, c->gcm_pt_buf_len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be sc->
instead of c->
in the free arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nice catch. I've this fix in my tree already, dunno why I missed it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a merge error. Fixed.
I tested using an 8GB Ramdisk both on a Debian host (shipped ZFS) and in a Debian VM (zfs compiled from this PR). Interestingly,
Setup VM (on Debian): sudo apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system virtinst dnsmasq
virsh net-start default
virt-install --virt-type kvm --name bullseye-amd64 \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian11 \
--disk size=10 --memory 10000 \
--graphics none \
--console pty,target_type=serial \
--extra-args "console=ttyS0" Setup ZFS: sudo apt install build-essential git automake libtool gettext zlib1g-dev uuid-dev libblkid-dev libssl-dev linux-headers-$(uname -r) libaio-dev libudev-dev
git config --global pull.rebase true
git config --global user.email test@example.com
git clone https://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make
make install
depmod
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl Test code: zfs --version
mkdir /tmp/tmpfs
mount -t tmpfs none /tmp/tmpfs
truncate -s 8G /tmp/tmpfs/test
zpool create -o ashift=12 -O mountpoint=none test /tmp/tmpfs/test
# without encryption
zfs create -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test
# with encryption
dd if=/dev/urandom of=/tmp/test_key bs=1 count=32
zfs create -o encryption=aes-256-gcm -o keyformat=raw -o keylocation=file:///tmp/test_key -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test
# cleanup
zpool destroy test
rmdir /test
rm /tmp/test_key
rm /tmp/tmpfs/test
umount /tmp/tmpfs
rmdir /tmp/tmpfs Results:
tldr: Host without encryption is 150% as fast as the VM. Host with encryption is about 50% as fast as the VM. Assuming the unencrypted difference is due to VM overhead, I assume host with encryption with this patch to be in the ballpark of ~3× as fast as before. |
|
macOS changes to compile: The
panic
show registers
|
@lundman Thanks for testing on macOS. The odd thing I spotted is this:
That should read
EDIT: Removed a lot of nonsense I wrote sleep deprived and added what I should've written in the first place. |
Oh I assumed it was because |
No it really is gone:
Could it be |
More on panic
looks like |
D'oh, I was a zombie yesterday, no sleep no brain.
Looks like a compiler bug.
That adds the contents at memory location 0x10 to the %r11 register. That will of course panic with a page fault. Could you try (I'm assuming that on macOS |
OK
Now it crashes further down - lemme walk it until it doesn't crash. On the $1 thing, closest I found was
|
Ah no, that was the only change needed:
|
Great! I'll add the workaround. |
Windows compile:
|
So now that's a real problem, not easily fixed. Are both, macOS and Windows clangs at the same version? |
Not at all; macOS is Apple's clang:
and Windows is Note that |
OK replacing |
I'd bet if you comment out the |
The llvm guys said:
I'll see if I can figure out what they mean |
Well, as far as I understand it, just replace |
Such good news to see progress in this issue. Thank you so much for your effort. Hope I can soon use native encryption on my Atom C3338. The Supermicro Inc. A2SDi-2C-HLN4F-B seems so ideal for a DIY NAS as soon as encrytion is working reliable :) |
I have it running on my nas as replicated store and I didnt found any issue neither corruption when scrubing or backup checksuming, I think it works fully reliable, but would be nice to hear from someone else experience. |
I've been running it on my NAS (C3558) for months without issues. |
After the positive comments here, I made the effort to try these patches on my Celeron J4125 based NAS. What a difference! These finally make ZFS usable on these machines. I have yet to find any issues, so I'll keep running these until they get merged. |
@AttilaFueloep what is missing to merge this? how we can help you if we are not C++ devs? |
I know this is for lower performance CPUs, just want to link this recent work in the linux kernel (https://www.phoronix.com/news/AES-GCM-Intel-AMD-Linux-6.11). I dont know if zfs on linux is using these interfaces and will may benefit from it (if you have supported CPU), or if zfs does have its own framework. In the latter case porting the relevant parts from the kernel would may lead to another increased performance. But anyway, looking forward to get this integrated first/soon! Big thanks to @AttilaFueloep |
Applied the patch today for a Celeron NAS that kept waving |
Thank you for the write-up! As a non-developer running Debian Bookworm, this was very helpful. I've installed the patched Debian ZFS DKMS package on a system with a simple mirrored pool and an Intel Atom C2558 CPU running at 2.40GHz. Without the patch, this system struggled to read from encrypted datasets at faster than ~90MB/s. With the patch, the same hardware handily saturates gigabit Ethernet and the CPU seems noticeably less busy. Excellent work! |
cryptsetup / Linux encryption seems to have the very same issue where only AVX is fully supported. cryptsetup frequently leads to kernel crashes due to hangs while encrypting data before committing it to the disk. It makes encryption on low-end hardware but also mid-tier enterprise (older Xenon processors that are in fact very affordable as second-hand equipment for a home lab) very disappointing. Exact same MO like the ZFS issue but with extra crashes due to the underlying storage system (whereas with ZFS, it's an all-in-one) waiting for the queue, not getting the green light and then emitting a crash while mentioning that some operation (write) hung for a certain amount of seconds. So a port of the operations to SSE(+) would solve the issue with cryptsetup as well. |
@AttilaFueloep may I be so free to ask what the status of this PR is? Not trying to push you. But I'm just curious whether this is a "feature" you're still pursuing*, or whether it has gone stale. Based on comments of others it seems to work just fine, and I think (some of them) are running it on "production" systems as well? (maybe not the systems ZFS was designed for, but for a DIY NAS etc) So over time (it's my assumption @FeepingCreature has been running it in "production" since November last year, and others chiming in later) I guess that is a good indicator that it's working / "has been tested"? But it obviously (only) has been ran by those which are helped by this change, so might have regressions in other area's which aren't fully tested yet. And even a comment that you won't be pursuing this PR anymore might be helpful. As in that case someone else might feel free to continue on your work**. * In the linked issue I read you earlier wanted to buy some NAS yourself with a CPU not supporting AVX, so in that case you needed this yourself, which I myself know is a great motivator for contributions :). But if you lost interest in that you might have lost interest in this PR as well, understandably. ** I would love to do so myself, but I hardly know any C, let alone all the assembly stuff, kernel (module) development or the ZFS codebase, so I don't have the knowledge to do so. But I'm sure there are far more knowledgeable people than me who might be able to help out or continue on your work so far. |
Sorry for the loong delay, got other priorities unfortunately. Thanks for the encouraging comments and all the testing. I'll try to get this PR over the line soon. |
Unfortunately we can't use Linux crypto due to GPL vs CDDL. That said, the code base I ported the SSE implementation from also has a VAES implementation which should not be hard to port either. The major problem is that I don't own any hardware which supports VAES, so no dice. |
That's really great to read. I yesterday compiled this on a test system (/applied patch to 2.2.5 of Debians bookworm-backports) and results indeed are great. The test system does support AVX so I also could compare with that, and both AVX and this SSE4.1 implementation hit above 500MB/s (single fio run on 8GB file), difference was 6MB/s I think (in favor of AVX) which definitaly would be in my margin of error. With the pclmulqdq it only did like ~130MB/s so 25% of AVX/SSE. This is on an older i5 laptop (quad core, 4GB RAM, SATA SSD, so 500MB/s most likely maxes out the drive or SATA interface I guess). My actual system would be 2 N5105 systems, which also do like 150MB/s with pclmulqdq, so I'm likely expecting the same speed (improvement) there. Which would mean I could upgrade networking to 2,5Gbit/s and make it worthwile. @AttilaFueloep the original description states it might corrupt a pool. But I assume that's just precautionary? And based on for example those who've been running this for over half a year also prove that it's safe-ish to run? So if I put this on a production system there will be zero chance I loose the pool? (Corrupting the encrypted dataset is a risk I'm willing to take, but most likely also is close to zero). |
Hello,
Thank you for your effort identifying the problem and suggesting a solution. I am trying to figure out if I can trust my data to my current setup using native encryption. Should I hold my breath until the issue is fixed or did every body move on to systems with AVX or ditch native encryption long ago? I don´t want to cross of my current NAS setup which I thought more than capable. I consider returning to luks beneath ZFS or gocrypt on top. I don´t know which replacement for a low power nas to pick with ecc. All the recommendations for low power hardware for this use case are dated to the extend that those are hardly available now. Don´t have reliable figures how low a ryzen 5600 with B550 Mainboard can go in idle. So how much 24h wattage I would have to add on top off the additional hardware cost... To summarize, I come here very often and hope for progress for this issue to resolve. Thank you for your input and thank you so much if you could push this any further. |
@ChrisWitt I am just on the sidelines, but for the sake of discussion...
LUKS / cryptsetup has the very same problem and expects hardware to have AVX extensions. This seems to be a kernel-level omission and not really a package-centric omission. All crypto operations will be slow without AVX on Linux. Ideally, someone would port SSE to LUKS as well. You will get very much the same slowdowns and...
What happens on my systems is that due to the slowness, the queue of pending I/O operations tend to grow, and at some point the kernel gets tired of waiting and bombs out with a kernel panic, namely the When that happens the filesystem is not properly closed such that it requires repairs on remount. So, as a straight up answer to your question, if you get any of the former, then no. Until a patch like this makes it into ZFS (and then LUKS, etc.) data consistency will be affected when you get so much I/O slowness that the kernel panics and then trashes any file descriptors when the filesystem gets closed.
play x-files theme Good question. SSE performs just as well. Why did the Linux overlords only implement acceleration with AVX and not SSE? It definitely shafts an entire line of older (to be read, affordable) Xenon systems that could pretty much be effective and cut down on costs of newer hardware. If you do get any response, please clue me in. The sheer amount of hours that I have wasted with commodity / older enterprise hardware only to figure that nothing is working as expected just because crypto is "slow" and due to missing AVX extensions is obscene. I have swapped out spinning HDDs for SSDs and NVME just to make it more absurd and nope, it still does not work when you have a large storage array and that's due to this problem regarding crypto. |
I'm also rooting for this very useful patch to get included as it would really help my main and backup storage devices. |
@AttilaFueloep are you still working on this? I rebased your pull request onto current master (without debug statements) at tssge#1 and I am seeing 6-10x improvements in encryption speed on my device (Intel Celeron 3865U). I did a lot of testing before running this on actual data and have not noticed any issues in testing or production. Such a huge improvement is almost unbelievable and I think it would be great to get this upstreamed. If you're not working on this anymore do you mind me taking over to get this upstreamed? Was thinking to mainly clean up the debug defines and probably break the SIMD selection/abstraction refactoring onto it's own commit for clarity and to ease adding new SIMD ops in the future. |
Sorry for dropping the ball on this. After working out the windows quirks with lundman i didn't got any dev feedback and then life happened ... It's encouraging to hear that this code is used by several people without any issues. I'll try to get it over the line. Please note that there are two open PRs in this area (#16601 #17058). I think there's some coordination required to make things smooth. |
from #17058:
@AttilaFueloep Once refactored and rebased, I will take a look. I have actually looked in on this a couple of times but had a similar problem as with #17058: the actual assembly & math code I'm woefully unqualified to judge, and ICP touchpoints are a nightmare. These days I just know a little more about what to about those, which is where the tests came from. FWIW, things I'd be looking for in review are:
I'll leave it with you, but like I say, more than happy to put it on the big test rig, and I actually am relifing a small machine right now that could benefit from this (a Celeron J3455 Goldmont), so I can give it a whirl there too. |
@robn Thanks, I'll get back to you once I'm ready. The assembly files deter reviewers from looking at the non-assembly parts, that's true. Still review of the remaining parts would be valuable for me. Also keep in mind that they are taken from reputed sources and have been in use for a long time. The changes to them are massive I must admit, I had to translate them from NASM to GNU as syntax (for various reasoms you don't want to know about). But that should've been sytax mostly plus some call parameter juggling, no logic touched. What I can say is that there are (or were?) people using this in production(?) without any issues. So this code has some mileage already. I tested it quite extensively also. What I did was to use the newly added implementation to additionally operate on a copy of the original GCM context (shadow context). That way I could compare the results of both implementations (SSE vs AVX) and log or panic if they did differ. With that in place I created heavy FS activity. That should have given quite a good coverage I hope. TL;DR Given the mileage and the way i tested this, I feel quite comfortable about the correctness, but of course you can never know. The security implications (if any) of my changes on the other hand are unknown.
Regarding the existing cycle implementation, do you know about any issues with it? I took special care to get the AVX part right, any bugs I'll try to fix.
That's for sure. :-) |
I've been using it since august I guess (based on when I last commented here) on a system with non-important data. The system is an N5105 based router (so the system is important, if not critical). It runs ZFS on root, but only one or two datasets actually make use of encryption. And these datasets aren't important / I can recover from data loss. But I haven't noticed any issues with it and it seems to be running just fine. Meaning that neither ZFS / scrubs, nor file / "software" based checksums have detected any issues with the files. What I am running, to be specific, is ZFS 2.2.5 of Debian Bookworm, with this PR / patch applied, following the procedure giving here: https://grimore.org/unix/speeding_up_zfs_encryption_with_sse4.1_acceleration (link posted earlier in this thread). I do have a second N5105 system, but on this I'm not running this PR. This as it does contain important data and I thus don't really want to run "experimental" code on it. (But at the same time, as it has been running rock solid for 6 months on the other system, it obviously gives me confidence that it should work on this system as well). |
@AttilaFueloep yeah, I'm not too worried as long as I can do a rough side-by-side scan and see that there's not obviously anything weird introduced. Not that I'm expecting anything; just trying to be vigilant.
Yeah, I had similar concerns in #14249. The consensus there was that we aren't cryptographers, and using code that either easy for us to audit or well known and trusted (or both!) will be good enough. Not much else we can do!
I like this technique. I'm doing some rework on the algorithm selection at the moment (mostly just exploratory at this stage); that might be a good thing to include in the toolkit.
I'm inclined to trust the test vectors for general correctness. There's other classes of problem of course (timing, sidechannel) but we do the best we can.
There were some questions about it in #17058, but I wasn't paying close attention. Looks like it should be ok now. Good work :) |
@RobertMe Thanks for confirming and devoting a box for guinea-pigging my changes, much appreciated! |
As a side note, as someone else noted in this thread, Linux crypto does not support SSE (didn't know that), so even switching to dmcrypt won't help. You'll end up with poor performance on LUKS devices as well. Dunno, but with TrueNAS SCALE being based on Linux, iXsystem might be interested in this change as well. It would make encryption on low power devices viable. |
That would be hard in this case due to the NASM -> GNU as conversion.
There's even an acronym for that: IANAC. ;-)
Agreed.
Good to have feedback on this. So I'll add the debugging code to my refactoring. I think I should wire it up by adding a BTW I'm also touching the algorithm selection. We should coordinate to avoid double work.
Yeah, the tests you added are a great improvement, it's really good to have them. Thanks for doing that!
Agreed. Thanks for confirming on the cycle implementation. |
Just pushed a raw draft at https://github.com/AttilaFueloep/zfs/tree/zfs-refactor-gcm-simd. Passes minimal testing (create, write, and read files). At this point it needs cleanup and thorough testing. @robn If you're curious you could have a peek. |
@jamasi Still working on it. For the time being I rebased onto 2.3.4-release and master. Cherry picking commit 4fdac4a or d248d8a respectively should give a working version. Please note that both are only lightly tested (built, ran fio and a scrub on a raidz pool). You'll want to make sure that you have a backup. |
Motivation and Context
Small NAS solution are often based on hardware with no AVX support, where encryption performs subpar. Most of them do support SSE4.1 on the other hand. Supporting SSE4.1 will give them a massive boost in aes-gcm crypto performance.
Description
This is still a draft, a detailed description will be added once this is ready for merging.
Closes #10846
How Has This Been Tested?
Inside a VM two
fio
invocations on an encrypted datasets andmprime -t
were run in parallel. Various other testswere done as well. Finally, this PR contains debugging code which runs the AVX and SSE implementations on the same context an compares the results (See the DEBUG_GCM_ASM #define).
Types of changes
Checklist:
Signed-off-by
.