Skip to content

Conversation

AttilaFueloep
Copy link
Contributor

Motivation and Context

Small NAS solution are often based on hardware with no AVX support, where encryption performs subpar. Most of them do support SSE4.1 on the other hand. Supporting SSE4.1 will give them a massive boost in aes-gcm crypto performance.

Description

This is still a draft, a detailed description will be added once this is ready for merging.

Closes #10846

How Has This Been Tested?

Inside a VM two fio invocations on an encrypted datasets and mprime -t were run in parallel. Various other tests
were done as well. Finally, this PR contains debugging code which runs the AVX and SSE implementations on the same context an compares the results (See the DEBUG_GCM_ASM #define).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@AttilaFueloep AttilaFueloep marked this pull request as draft February 25, 2023 01:33
@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Feb 25, 2023

While it's still a draft, I took care to squash the commits in a way which should make reviewing this straight forward.

I did quite some testing without revealing any problems, but at this point I'd appreciate any help on testing on a wider range of systems. If you need any help on how to test this, please drop a note here and I'll instruct. If you want to test this, in any case please do it on a test VM with a test pool since this code could permanently corrupt any pools accessed from it.

There's still some cleanup an some minor refinements left, so this is a draft for now. I'm also wondering if I should keep the debugging code in the #ifdef DEBUG_GCM_ASM conditionals in the final PR. I'm leaning towards it since it would ease developing of the planned follow-ups for AVX2 and maybe VAES. They could be removed once the dust settles and the code gets good usage coverage.

@AttilaFueloep
Copy link
Contributor Author

The people at ixsystems may be interested in testing this too. Surely they'll have a far better test infrastructure to test this on than I do.

@AttilaFueloep
Copy link
Contributor Author

@lundman I took major efforts to make this clang compatible on Linux. If you have a spare hour you may want to test if it works on macOS and Windows as well.

@AttilaFueloep
Copy link
Contributor Author

One important point regarding testing I missed above:

If you are testing this on hardware which has AVX support you need to activate SSE support by doing

# echo  sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

or setting it on module load time.

@lundman
Copy link
Contributor

lundman commented Feb 25, 2023

Excellent, I will give it a go on macOS and Windows.

@oberien
Copy link

oberien commented Feb 25, 2023

Thank you a lot for your work!

I have a ReadyNAS 426 with an Intel(R) Atom(TM) CPU C3538 @ 2.10GHz without AVX with SSE3, SSE4.1 & SSE4.2. I'd be willing to test your implementation.

I have 6 × 8TB drives in ZFS for storage (ZFS directly uses the disks and autoformatted them to GPT: 7.3T ZFS + 8M reserved) and 2 × 64GB High Endurance microSD cards in RAID with debian bookworm.

Could you please provide instructions on how I can best test this PR?

@robn
Copy link
Member

robn commented Feb 25, 2023

I did a light fio run on a small Celeron J3455 [Goldmont] board with SSE4.2 but no AVX. Using a single-vdev tmpfs-backed pool, no compression:

  • generic: ~3.7MB/s
  • pclmulqdq: ~6.1MB/s
  • sse4_1: ~13.2MB/s
  • no encryption: ~17MB/s

fio was: fio --name=xxx --size=4K --bs=4K --rw=randrw --ioengine=psync --sync=1 --iodepth=32 --numjobs=1 --direct=1 --end_fsync=1 --gtod_reduce=1 --time_based --runtime=60

Hardly science, but those initial numbers seem pretty good. Strong work!

@AdamLantos
Copy link

AdamLantos commented Feb 26, 2023

Thank you @AttilaFueloep, I too have very promising results on a C3558 system running Debian Bullseye under KVM hypervisor and WD Black SN750 NVMe disk.

In my tests below I just used dd to read a big file with arc size limited to 500MB.

Unencrypted read speed was 909MB/s, the old pclmulqdq implementation did 86MB/s, the new sse4_1 implementation increased speed to 670MB/s! Enabling compression, the speed stayed roughly the same. For reference, FreeBSD achieved roughly similar speed (660MB/s) in my previous tests on a similar setup.

# uname -a
Linux labdebian 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

# echo 524288000 >/sys/module/zfs/parameters/zfs_arc_max
# zfs create -o compression=off test/noenc
# zfs create -o encryption=on -o keyformat=passphrase -o compression=on test/enc2
# zfs create -o encryption=on -o keyformat=passphrase -o compression=off test/enc3
# cp /mnt/media/test.tar /test/noenc/
# cp /mnt/media/test.tar /test/enc3/

# dd if=/test/noenc/test.tar bs=128k of=/dev/null status=progress
6319505408 bytes (6.3 GB, 5.9 GiB) copied, 7 s, 903 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 7.54886 s, 909 MB/s

# echo generic > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
time dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6840385536 bytes (6.8 GB, 6.4 GiB) copied, 189 s, 36.2 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 189.506 s, 36.2 MB/s

# echo pclmulqdq > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6820200448 bytes (6.8 GB, 6.4 GiB) copied, 79 s, 86.3 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 79.4401 s, 86.3 MB/s

# echo sse4_1 > /sys/module/zfs/parameters/icp_gcm_impl
# dd if=/test/enc3/test.tar bs=128k of=/dev/null status=progress
6715080704 bytes (6.7 GB, 6.3 GiB) copied, 10 s, 672 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.1724 s, 674 MB/s

# dd if=/test/enc2/test.tar bs=128k of=/dev/null status=progress
6629883904 bytes (6.6 GB, 6.2 GiB) copied, 10 s, 663 MB/s
52328+1 records in
52328+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 10.2971 s, 666 MB/s

@AttilaFueloep
Copy link
Contributor Author

@oberien Well, I'd suggest the following:

Spin up a VM which has all the dependencies for building ZFS installed, but not the ZFS packages itself.

Inside the VM do:

git clone http://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make 

Open a new root terminal inside the VM since for me the installation step failed with sudo.

cd <dir where you build zfs>
make install
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl
zpool create ...
zfs create -o encryption= ...

Run your favorite workloads and benchmark tools.

Comment on lines 330 to 335
if (sc->gcm_Htable != NULL) {
kmem_free(c->gcm_Htable, c->gcm_htab_len);
}
if (sc->gcm_pt_buf != NULL) {
vmem_free(c->gcm_pt_buf, c->gcm_pt_buf_len);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be sc-> instead of c-> in the free arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nice catch. I've this fix in my tree already, dunno why I missed it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a merge error. Fixed.

@oberien
Copy link

oberien commented Feb 26, 2023

I tested using an 8GB Ramdisk both on a Debian host (shipped ZFS) and in a Debian VM (zfs compiled from this PR).

Interestingly, ./configure states:

checking whether host toolchain supports AVX... yes
checking whether host toolchain supports AVX2... yes
checking whether host toolchain supports AVX512F... yes

Setup VM (on Debian):

sudo apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system virtinst dnsmasq
virsh net-start default
virt-install --virt-type kvm --name bullseye-amd64 \
--location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \
--os-variant debian11 \
--disk size=10 --memory 10000 \
--graphics none \
--console pty,target_type=serial \
--extra-args "console=ttyS0"

Setup ZFS:

sudo apt install build-essential git automake libtool gettext zlib1g-dev uuid-dev libblkid-dev libssl-dev linux-headers-$(uname -r) libaio-dev libudev-dev
git config --global pull.rebase true
git config --global user.email test@example.com
git clone https://github.com/openzfs/zfs/
cd zfs
git pull origin pull/14531/head
./autogen.sh && ./configure --prefix=/ --enable-debuginfo --enable-debug && make 
make install
depmod
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

Test code:

zfs --version
mkdir /tmp/tmpfs
mount -t tmpfs none /tmp/tmpfs
truncate -s 8G /tmp/tmpfs/test
zpool create -o ashift=12 -O mountpoint=none test /tmp/tmpfs/test

# without encryption
zfs create -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test

# with encryption
dd if=/dev/urandom of=/tmp/test_key bs=1 count=32
zfs create -o encryption=aes-256-gcm -o keyformat=raw -o keylocation=file:///tmp/test_key -o atime=off -o compression=off -o mountpoint=/test test/test
dd if=/dev/zero of=/test/test bs=1M count=4k
zpool export test && zpool import -d /tmp/tmpfs/test test -l
dd if=/test/test of=/dev/null bs=1M count=4096
zfs destroy test/test

# cleanup
zpool destroy test
rmdir /test
rm /tmp/test_key
rm /tmp/tmpfs/test
umount /tmp/tmpfs
rmdir /tmp/tmpfs

Results:

For comparison `perf bench mem memcpy`:
Host: 2.64GB/s
Guest: 1.93GB/s

Output host unencrypted:
zfs-2.1.9-1
zfs-kmod-2.1.7-1
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.29287 s, 589 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.66974 s, 920 MB/s

Output host encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 27.5948 s, 156 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 22.4259 s, 192 MB/s

Output guest unencrypted:
zfs-2.1.99-1750_g57314b5ec
zfs-kmod-2.1.99-1750_g57314b5ec
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 9.31044 s, 461 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.15297 s, 600 MB/s

Output guest encrypted:
write: 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 14.0712 s, 305 MB/s
read:  4294967296 bytes (4.3 GB, 4.0 GiB) copied, 15.1526 s, 283 MB/s

tldr: Host without encryption is 150% as fast as the VM. Host with encryption is about 50% as fast as the VM. Assuming the unencrypted difference is due to VM overhead, I assume host with encryption with this patch to be in the ballpark of ~3× as fast as before.

@AttilaFueloep
Copy link
Contributor Author

@oberien

Interestingly, ./configure states:

checking whether host toolchain supports AVX... yes
checking whether host toolchain supports AVX2... yes
checking whether host toolchain supports AVX512F... yes

configure just checks if your toolchain has support for SIMD instructions. The existence of SIMD instructions is checked at runtime (e.g zfs_avx_available())

@lundman
Copy link
Contributor

lundman commented Feb 28, 2023

macOS changes to compile:

openzfsonosx@df3931c

The FN_NAME fix is a quick hack, we should discuss that sometime.

# sysctl kstat.zfs.darwin.tunable.icp_gcm_impl=sse4_1
kstat.zfs.darwin.tunable.icp_gcm_impl: cycle [fastest] sse4_1 generic pclmulqdq  -> cycle fastest [sse4_1] generic pclmulqdq

# ./zfs create -o encryption=aes-256-gcm -o keyformat=passphrase BOOM/gcm
Enter new passphrase:
Re-enter new passphrase:

panic

0xffffff801e72b0f0 : 0xffffff801678ba40 mach_kernel : trap_from_kernel + 0x26
0xffffff801e72b110 : 0xffffff7f99e92947 org.openzfsonosx.zfs : _initial_num_blocks_is_2_34 + 0x15c
0xffffff801e72b370 : 0xffffff7f99d51d56 org.openzfsonosx.zfs : _aes_encrypt_contiguous_blocks + 0x36
0xffffff801e72b390 : 0xffffff7f99d3a1c1 org.openzfsonosx.zfs : _crypto_update_uio + 0x251
0xffffff801e72b3f0 : 0xffffff7f99d3ac21 org.openzfsonosx.zfs : _aes_encrypt_atomic + 0x181
0xffffff801e72b510 : 0xffffff7f99d387cd org.openzfsonosx.zfs : _crypto_encrypt + 0xdd
0xffffff801e72b590 : 0xffffff7f99d1b880 org.openzfsonosx.zfs : _zio_do_crypt_uio + 0x1c0
0xffffff801e72b6b0 : 0xffffff7f99d1b693 org.openzfsonosx.zfs : _zio_crypt_key_wrap + 0x153
frame #0: 0xffffff7f99e92947 zfs`initial_num_blocks_is_2_34 at isalc_gcm_sse.S:2342
(lldb) disass
zfs`initial_num_blocks_is_2_34:
    0xffffff7f99e927eb <+0>:    movdqu %xmm8, %xmm6
    0xffffff7f99e927f0 <+5>:    movdqu 0x10(%rdi), %xmm9
    0xffffff7f99e927f6 <+11>:   paddd  0xbd231(%rip), %xmm9      ; ZERO + 14
    0xffffff7f99e927ff <+20>:   movdqa %xmm9, %xmm7
    0xffffff7f99e92804 <+25>:   pshufb 0xbd1b3(%rip), %xmm7      ; TWOONE + 62
    0xffffff7f99e9280d <+34>:   paddd  0xbd21a(%rip), %xmm9      ; ZERO + 14
    0xffffff7f99e92816 <+43>:   movdqa %xmm9, %xmm8
    0xffffff7f99e9281b <+48>:   pshufb 0xbd19b(%rip), %xmm8      ; TWOONE + 62
    0xffffff7f99e92825 <+58>:   movdqu (%r8), %xmm0
    0xffffff7f99e9282a <+63>:   pxor   %xmm0, %xmm7
...
    0xffffff7f99e92936 <+331>:  movdqu (%rdx,%r11), %xmm12
    0xffffff7f99e9293c <+337>:  pxor   %xmm12, %xmm7
    0xffffff7f99e92941 <+342>:  movdqu %xmm7, (%rsi,%r11)
->  0xffffff7f99e92947 <+348>:  addq   0x6, %r11
    0xffffff7f99e9294f <+356>:  pshufb 0xbd068(%rip), %xmm7      ; TWOONE + 62
    0xffffff7f99e92958 <+365>:  movdqu (%rdx,%r11), %xmm12
    0xffffff7f99e9295e <+371>:  pxor   %xmm12, %xmm8
show registers

(lldb) register read -all
General Purpose Registers:
       rax = 0xffffff7f99f5adf0  zfs`isalc_ops
       rbx = 0xffffff90c8094ae8
       rcx = 0x0000000000000020
       rdx = 0xffffff801e72b950
       rdi = 0xffffff801e72b408
       rsi = 0xffffff90c8094ae8
       rbp = 0xffffff801e72b370
       rsp = 0xffffff801e72b200
        r8 = 0xffffff90c9c77a00
        r9 = 0xffffff90c96f7840
       r10 = 0x0000000000000020
       r11 = 0x0000000000000000
       r12 = 0x0000000000000002
       r13 = 0x0000000000000020
       r14 = 0xffffff801e72b280
       r15 = 0x0000000000000000
       rip = 0xffffff7f99e92947  zfs`initial_num_blocks_is_2_34 + 348
    rflags = 0x0000000000010246
        cs = 0x0000000000000008
        fs = 0x00000000ffff0000
        gs = 0x000000001e720000

Floating Point Registers:
       fcw = 0x0000
       fsw = 0x0000
       ftw = 0x00
       fop = 0x0000
        ip = 0x00000000
        cs = 0x0000
        dp = 0x00000000
        ds = 0x0000
     mxcsr = 0x00000000
  mxcsrmask = 0x00000000
     stmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     stmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm2 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm3 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm4 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm5 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm6 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm7 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm8 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
      xmm9 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm10 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm11 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm12 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm13 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm14 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
     xmm15 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}

Exception State Registers:
3 registers were unavailable.

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Feb 28, 2023

@lundman Thanks for testing on macOS.

The odd thing I spotted is this:

0xffffff7f99e92947 <+348>:  addq   0x6, %r11
                                   ^^^

That should read

0xffffff7f99e92947 <+348>:  addq   $0x10, %r11

EDIT:

Removed a lot of nonsense I wrote sleep deprived and added what I should've written in the first place.

@lundman
Copy link
Contributor

lundman commented Feb 28, 2023

Oh I assumed it was because r11 is NULL, and addq 0x6, %r11 would dereference it, but I admit, I didn't stop and think if it even did dereference

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

That should read 0x10, not 0x06 ($16 in the code, so maybe the 1 got lost?).

No it really is gone:

add $16, \DATA_OFFSET

# llvm-objdump --disassemble ./module/icp/asm-x86_64/modes/macos_zfs-isalc_gcm256_sse.o
    4827: 4c 03 1c 25 06 00 00 00       addq    6, %r11

add 0x10, \DATA_OFFSET

    4827: 4c 03 1c 25 10 00 00 00       addq    16, %r11

Could it be $1 is eaten by the macro definition? Just add 16, %r11 works, so possibly .rept expects $i style notation.

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

More on panic

 panic
panic(cpu 1 caller 0xffffff800dd69aef): Kernel trap at 0xffffff7f91292911, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000000000010, CR3: 0x0000000011a69000, CR4: 0x00000000003606e0
RAX: 0xffffff7f9135adf0, RBX: 0xffffff90b0094ae8, RCX: 0x0000000000000020, RDX: 0xffffff80a72a3950
RSP: 0xffffff80a72a3200, RBP: 0xffffff80a72a3370, RSI: 0xffffff90b0094ae8, RDI: 0xffffff80a72a3408
R8:  0xffffff90b1f26800, R9:  0xffffff90b1957d40, R10: 0x0000000000000020, R11: 0x0000000000000000
R12: 0x0000000000000002, R13: 0x0000000000000020, R14: 0xffffff80a72a3280, R15: 0x0000000000000000
RFL: 0x0000000000010246, RIP: 0xffffff7f91292911, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000010, Error code: 0x0000000000000000, Fault CPU: 0x1 VMM, PL: 0, VF: 1

-> 2156                 (*(isalc_ops.igo_enc_update[impl][keylen]))(
   2157                     ctx, ct_buf, datap, bleft);

(lldb) p/x ct_buf
(uint8_t *) $5 = 0xffffff90b0094ae8 "\xe1q\xbd\xe2\x99\xc4\xcc\U0000001a<f\U0000

       rsi = 0xffffff90b0094ae8

(lldb) x/40 0xffffff90b0094ae8
0xffffff90b0094ae8: 0xe2bd71e1 0x1accc499 0x7c18663c 0x1f2226c6
0xffffff90b0094af8: 0xb0094d49 0xffffff90 0xf310f1e5 0x00a6389e

looks like rsi is correct, and I can read the memory.

@AttilaFueloep
Copy link
Contributor Author

D'oh, I was a zombie yesterday, no sleep no brain.

No it really is gone:

Looks like a compiler bug.

add 0x10, \DATA_OFFSET

That adds the contents at memory location 0x10 to the %r11 register. That will of course panic with a page fault. Could you try add $0x10, \DATA_OFFSET?

(I'm assuming that on macOS $ also indicates an immediate value.)

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

OK
add $(0x10), \DATA_OFFSET
doesn't work, but

                add     $(0x10), \DATA_OFFSET

    47cd: 49 83 c3 10                   addq    $16, %r11

Now it crashes further down - lemme walk it until it doesn't crash.

On the $1 thing, closest I found was

AsmParser.cpp

bool AsmParser::expandMacro()

      if (IsDarwin && !NParameters) {
        // This macro has no parameters, look for $0, $1, etc.

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

Ah no, that was the only change needed:

diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
index d8bc585fc..c19443687 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
@@ -641,7 +641,8 @@ movdqu      16*j(\KEYSCHED), \T_key         // encrypt with last (14th) key round (12 for GC
                XLDR    (\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
                pxor    \T1, xmmi
                XSTR    xmmi, (\CYPH_PLAIN_OUT, \DATA_OFFSET)   // write back ciphertext for \num_initial_blocks blocks
-               add     $16, \DATA_OFFSET
+               add     $(0x10), \DATA_OFFSET
                .ifc \ENC_DEC, DEC
                movdqa  \T1, xmmi
                .endif
(random file copied and verified)
md5 sierra-fix.patch /Volumes/BOOM/gcm/sierra-fix.patch
MD5 (sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f
MD5 (/Volumes/BOOM/gcm/sierra-fix.patch) = cffd2656b57182f0d6d2031e2f545d3f

@AttilaFueloep
Copy link
Contributor Author

Great! I'll add the workaround.

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

Windows compile:

C:\src\openzfs\out\build\x64-Debug\clang -cc1as : fatal error : error in backend: unable to evaluate offset for variable 'breg'

@AttilaFueloep
Copy link
Contributor Author

So now that's a real problem, not easily fixed. Are both, macOS and Windows clangs at the same version?

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

Not at all; macOS is Apple's clang:

Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin19.6.0

and Windows is
clang version 12.0.0 Target: i686-pc-windows-msvc

Note that clang-cl.exe makes it run in Windows-compatibility mode, so a bunch of things behave more like Windows. I'll
have to check if that includes macros.

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

OK replacing clang-cl.exe with just clang.exe did not make any difference.

@AttilaFueloep
Copy link
Contributor Author

I'd bet if you comment out the breg macro use, it will trip over the xmmi macro. And the xmmi one is crucial, without it one would need to expand all macros yielding an unreadable ~10kloc assembly file. As a last resort one could use the original nasm source on Windows but that would be a bit of a nightmare to support.

@lundman
Copy link
Contributor

lundman commented Mar 1, 2023

The llvm guys said:

The issue is that the code uses the breg symbol like a macro, but on Windows the assembler also tries to emit the symbol itself, which fails because its value is a register.

I don’t know why this is different from non-Windows.

In any case, you can sneak past this error by using the private prefix “.L” on the symbol, so .Lbreg instead of just breg.

I'll see if I can figure out what they mean

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Mar 1, 2023

Well, as far as I understand it, just replace breg with .Lbreg in the macro .macro bytereg reg in isalc_reg_sizes.S (line 88 ...) and do the same in isalc_gcm_sse.S. (lines 357 and 370). Sorry but I currently don't have the time to test it here.

@ChrisWitt
Copy link

Such good news to see progress in this issue. Thank you so much for your effort. Hope I can soon use native encryption on my Atom C3338. The Supermicro Inc. A2SDi-2C-HLN4F-B seems so ideal for a DIY NAS as soon as encrytion is working reliable :)

@segator
Copy link

segator commented May 5, 2024

Such good news to see progress in this issue. Thank you so much for your effort. Hope I can soon use native encryption on my Atom C3338. The Supermicro Inc. A2SDi-2C-HLN4F-B seems so ideal for a DIY NAS as soon as encrytion is working reliable :)

I have it running on my nas as replicated store and I didnt found any issue neither corruption when scrubing or backup checksuming, I think it works fully reliable, but would be nice to hear from someone else experience.

@FeepingCreature
Copy link

FeepingCreature commented May 5, 2024

I've been running it on my NAS (C3558) for months without issues.

@Tasqa
Copy link

Tasqa commented Jun 4, 2024

After the positive comments here, I made the effort to try these patches on my Celeron J4125 based NAS. What a difference! These finally make ZFS usable on these machines. I have yet to find any issues, so I'll keep running these until they get merged.

@segator
Copy link

segator commented Jun 5, 2024

@AttilaFueloep what is missing to merge this? how we can help you if we are not C++ devs?

@jumbi77
Copy link
Contributor

jumbi77 commented Jun 7, 2024

I know this is for lower performance CPUs, just want to link this recent work in the linux kernel (https://www.phoronix.com/news/AES-GCM-Intel-AMD-Linux-6.11). I dont know if zfs on linux is using these interfaces and will may benefit from it (if you have supported CPU), or if zfs does have its own framework. In the latter case porting the relevant parts from the kernel would may lead to another increased performance.

But anyway, looking forward to get this integrated first/soon! Big thanks to @AttilaFueloep

@cxtal
Copy link

cxtal commented Jun 17, 2024

Applied the patch today for a Celeron NAS that kept waving gcm_pclmulqdq_mul at the top of perf top. With the patch, the gcm_pclmulqdq_mul is gone and everything is very speedy. Can confirm that the performance improvement is colossal. If it helps anyone using Debian there is a write-up available. TL;DR the instructions to pull from git do not work because there will be compilation errors and the module will not be compiled. Fortunately, Debian stable "Bookworm" packages ZFS at 2.2.4, which seems to accept the changes by @AttilaFueloep and then compiles ZFS in a Debian-complaint way with DKMS such that the ZFS module will be rebuilt when a new kernel is installed. It might keep the ZFS version behind the mainstream but if you're in the scenario that this patch addresses, then the sacrifice is well worth it due to the overwhelming speedup.

@micredd
Copy link

micredd commented Jun 23, 2024

If it helps anyone using Debian there is a write-up available.

Thank you for the write-up! As a non-developer running Debian Bookworm, this was very helpful. I've installed the patched Debian ZFS DKMS package on a system with a simple mirrored pool and an Intel Atom C2558 CPU running at 2.40GHz.

Without the patch, this system struggled to read from encrypted datasets at faster than ~90MB/s. With the patch, the same hardware handily saturates gigabit Ethernet and the CPU seems noticeably less busy.

Excellent work!

@cxtal
Copy link

cxtal commented Jun 23, 2024

cryptsetup / Linux encryption seems to have the very same issue where only AVX is fully supported. cryptsetup frequently leads to kernel crashes due to hangs while encrypting data before committing it to the disk. It makes encryption on low-end hardware but also mid-tier enterprise (older Xenon processors that are in fact very affordable as second-hand equipment for a home lab) very disappointing. Exact same MO like the ZFS issue but with extra crashes due to the underlying storage system (whereas with ZFS, it's an all-in-one) waiting for the queue, not getting the green light and then emitting a crash while mentioning that some operation (write) hung for a certain amount of seconds. So a port of the operations to SSE(+) would solve the issue with cryptsetup as well.

@RobertMe
Copy link

@AttilaFueloep may I be so free to ask what the status of this PR is? Not trying to push you. But I'm just curious whether this is a "feature" you're still pursuing*, or whether it has gone stale.

Based on comments of others it seems to work just fine, and I think (some of them) are running it on "production" systems as well? (maybe not the systems ZFS was designed for, but for a DIY NAS etc) So over time (it's my assumption @FeepingCreature has been running it in "production" since November last year, and others chiming in later) I guess that is a good indicator that it's working / "has been tested"? But it obviously (only) has been ran by those which are helped by this change, so might have regressions in other area's which aren't fully tested yet.
Furthermore I believe that, with the help of @lundman, the (compile) issues on MacOS and Windows have been resolved as well? Which means another box has been ticked to get this PR in a mergable state.
So what's left I think is going over the TODOs in the code and fix those? After which it might be possible to mark the PR as ready for review?

And even a comment that you won't be pursuing this PR anymore might be helpful. As in that case someone else might feel free to continue on your work**.

* In the linked issue I read you earlier wanted to buy some NAS yourself with a CPU not supporting AVX, so in that case you needed this yourself, which I myself know is a great motivator for contributions :). But if you lost interest in that you might have lost interest in this PR as well, understandably.

** I would love to do so myself, but I hardly know any C, let alone all the assembly stuff, kernel (module) development or the ZFS codebase, so I don't have the knowledge to do so. But I'm sure there are far more knowledgeable people than me who might be able to help out or continue on your work so far.

@AttilaFueloep
Copy link
Contributor Author

Sorry for the loong delay, got other priorities unfortunately. Thanks for the encouraging comments and all the testing.

I'll try to get this PR over the line soon.

@AttilaFueloep
Copy link
Contributor Author

@jumbi77

I know this is for lower performance CPUs, just want to link this recent work in the linux kernel (https://www.phoronix.com/news/AES-GCM-Intel-AMD-Linux-6.11).

Unfortunately we can't use Linux crypto due to GPL vs CDDL. That said, the code base I ported the SSE implementation from also has a VAES implementation which should not be hard to port either. The major problem is that I don't own any hardware which supports VAES, so no dice.

@RobertMe
Copy link

RobertMe commented Aug 23, 2024

Sorry for the loong delay, got other priorities unfortunately. Thanks for the encouraging comments and all the testing.

I'll try to get this PR over the line soon.

That's really great to read.

I yesterday compiled this on a test system (/applied patch to 2.2.5 of Debians bookworm-backports) and results indeed are great. The test system does support AVX so I also could compare with that, and both AVX and this SSE4.1 implementation hit above 500MB/s (single fio run on 8GB file), difference was 6MB/s I think (in favor of AVX) which definitaly would be in my margin of error. With the pclmulqdq it only did like ~130MB/s so 25% of AVX/SSE.

This is on an older i5 laptop (quad core, 4GB RAM, SATA SSD, so 500MB/s most likely maxes out the drive or SATA interface I guess). My actual system would be 2 N5105 systems, which also do like 150MB/s with pclmulqdq, so I'm likely expecting the same speed (improvement) there. Which would mean I could upgrade networking to 2,5Gbit/s and make it worthwile.

@AttilaFueloep the original description states it might corrupt a pool. But I assume that's just precautionary? And based on for example those who've been running this for over half a year also prove that it's safe-ish to run? So if I put this on a production system there will be zero chance I loose the pool? (Corrupting the encrypted dataset is a risk I'm willing to take, but most likely also is close to zero).

@ChrisWitt
Copy link

Hello,
I am still watching this issue very eagerly awaiting the fix.
May I add a few questions to understand the current state to manage expectations on my end :)

  1. Is the current state without the proposed patch only slow or do I have to be concerned about data integrity when I keep using OpenZFS on my Atom C3338 without AVX? Is it save to just ignore the kernel errors and accept that it is very slow (20-30Mb/sec) ?
  2. What is the stage of the patch? Did it pass a kind of test suite regarding data integrity? Is it save to use in regard to data integrity? Would it be advised as an files system noob to stick to the stable openzfs version until the patch is merged into stable?
    2.1. What steps are ahead for it?
    2.2 Might waiting be in vain as the user base using this feature without an AVX capable CPU is insignificant? Am I better of replacing my hardware? I have no perspective if this patch is in pipeline for release and if so where it is in the queue.

Thank you for your effort identifying the problem and suggesting a solution. I am trying to figure out if I can trust my data to my current setup using native encryption. Should I hold my breath until the issue is fixed or did every body move on to systems with AVX or ditch native encryption long ago?

I don´t want to cross of my current NAS setup which I thought more than capable. I consider returning to luks beneath ZFS or gocrypt on top. I don´t know which replacement for a low power nas to pick with ecc. All the recommendations for low power hardware for this use case are dated to the extend that those are hardly available now. Don´t have reliable figures how low a ryzen 5600 with B550 Mainboard can go in idle. So how much 24h wattage I would have to add on top off the additional hardware cost...

To summarize, I come here very often and hope for progress for this issue to resolve. Thank you for your input and thank you so much if you could push this any further.

@cxtal
Copy link

cxtal commented Oct 20, 2024

@ChrisWitt I am just on the sidelines, but for the sake of discussion...

consider returning to luks beneath ZFS or gocrypt on top.

LUKS / cryptsetup has the very same problem and expects hardware to have AVX extensions. This seems to be a kernel-level omission and not really a package-centric omission. All crypto operations will be slow without AVX on Linux. Ideally, someone would port SSE to LUKS as well. You will get very much the same slowdowns and...

  1. Is it save to just ignore the kernel errors [...]

What happens on my systems is that due to the slowness, the queue of pending I/O operations tend to grow, and at some point the kernel gets tired of waiting and bombs out with a kernel panic, namely the blocked for more than 120 seconds., and kills the task. The error is usually mistaken for bad hardware and appears more than often if you run older enterprise hardware on older Xenons without any AVX support.

When that happens the filesystem is not properly closed such that it requires repairs on remount. So, as a straight up answer to your question, if you get any of the former, then no. Until a patch like this makes it into ZFS (and then LUKS, etc.) data consistency will be affected when you get so much I/O slowness that the kernel panics and then trashes any file descriptors when the filesystem gets closed.

did every body move on to systems with AVX

play x-files theme Good question. SSE performs just as well. Why did the Linux overlords only implement acceleration with AVX and not SSE? It definitely shafts an entire line of older (to be read, affordable) Xenon systems that could pretty much be effective and cut down on costs of newer hardware. If you do get any response, please clue me in. The sheer amount of hours that I have wasted with commodity / older enterprise hardware only to figure that nothing is working as expected just because crypto is "slow" and due to missing AVX extensions is obscene. I have swapped out spinning HDDs for SSDs and NVME just to make it more absurd and nope, it still does not work when you have a large storage array and that's due to this problem regarding crypto.

@jamasi
Copy link

jamasi commented Nov 7, 2024

I'm also rooting for this very useful patch to get included as it would really help my main and backup storage devices.

@tssge
Copy link

tssge commented Feb 23, 2025

@AttilaFueloep are you still working on this? I rebased your pull request onto current master (without debug statements) at tssge#1 and I am seeing 6-10x improvements in encryption speed on my device (Intel Celeron 3865U). I did a lot of testing before running this on actual data and have not noticed any issues in testing or production.

Such a huge improvement is almost unbelievable and I think it would be great to get this upstreamed. If you're not working on this anymore do you mind me taking over to get this upstreamed?

Was thinking to mainly clean up the debug defines and probably break the SIMD selection/abstraction refactoring onto it's own commit for clarity and to ease adding new SIMD ops in the future.

@AttilaFueloep
Copy link
Contributor Author

Sorry for dropping the ball on this. After working out the windows quirks with lundman i didn't got any dev feedback and then life happened ... It's encouraging to hear that this code is used by several people without any issues. I'll try to get it over the line.

Please note that there are two open PRs in this area (#16601 #17058). I think there's some coordination required to make things smooth.

@robn
Copy link
Member

robn commented Mar 5, 2025

from #17058:

While I have your attention, a review of #14531, once rebased on the refactoring, would be nice as well. To be honest, I was a bit disappointed to see no review feedback on it for years. It has a couple of satisfied users, so I'm trying to get it over the line currently.

@AttilaFueloep Once refactored and rebased, I will take a look. I have actually looked in on this a couple of times but had a similar problem as with #17058: the actual assembly & math code I'm woefully unqualified to judge, and ICP touchpoints are a nightmare. These days I just know a little more about what to about those, which is where the tests came from.

FWIW, things I'd be looking for in review are:

  • do the correctness tests pass?
  • do the perf tests suggest its actually quicker?
  • does the GCM selector do the right thing?
  • does "cycle" .. sorta work? (I know it's sketchy right now)
  • does it at minimum not make the existing mess worse (ie hook up nicely to a pleasant abstraction (your refactor work) or without that, at least be the least worst thing (like what AVX2 did))

I'll leave it with you, but like I say, more than happy to put it on the big test rig, and I actually am relifing a small machine right now that could benefit from this (a Celeron J3455 Goldmont), so I can give it a whirl there too.

@AttilaFueloep
Copy link
Contributor Author

@robn Thanks, I'll get back to you once I'm ready.

The assembly files deter reviewers from looking at the non-assembly parts, that's true. Still review of the remaining parts would be valuable for me. Also keep in mind that they are taken from reputed sources and have been in use for a long time.

The changes to them are massive I must admit, I had to translate them from NASM to GNU as syntax (for various reasoms you don't want to know about). But that should've been sytax mostly plus some call parameter juggling, no logic touched.
Are there any attack vectors I opened? I can't say. That would require a qualified cryptographer to review. I'm afraid that's not going to happen any time soon, so how should we proceed?

What I can say is that there are (or were?) people using this in production(?) without any issues. So this code has some mileage already. I tested it quite extensively also. What I did was to use the newly added implementation to additionally operate on a copy of the original GCM context (shadow context). That way I could compare the results of both implementations (SSE vs AVX) and log or panic if they did differ. With that in place I created heavy FS activity. That should have given quite a good coverage I hope.

TL;DR

Given the mileage and the way i tested this, I feel quite comfortable about the correctness, but of course you can never know. The security implications (if any) of my changes on the other hand are unknown.

does "cycle" .. sorta work? (I know it's sketchy right now)

Regarding the existing cycle implementation, do you know about any issues with it? I took special care to get the AVX part right, any bugs I'll try to fix.

ICP touchpoints are a nightmare.

That's for sure. :-)

@RobertMe
Copy link

RobertMe commented Mar 6, 2025

What I can say is that there are (or were?) people using this in production(?) without any issues

I've been using it since august I guess (based on when I last commented here) on a system with non-important data. The system is an N5105 based router (so the system is important, if not critical). It runs ZFS on root, but only one or two datasets actually make use of encryption. And these datasets aren't important / I can recover from data loss. But I haven't noticed any issues with it and it seems to be running just fine. Meaning that neither ZFS / scrubs, nor file / "software" based checksums have detected any issues with the files.

What I am running, to be specific, is ZFS 2.2.5 of Debian Bookworm, with this PR / patch applied, following the procedure giving here: https://grimore.org/unix/speeding_up_zfs_encryption_with_sse4.1_acceleration (link posted earlier in this thread).

I do have a second N5105 system, but on this I'm not running this PR. This as it does contain important data and I thus don't really want to run "experimental" code on it. (But at the same time, as it has been running rock solid for 6 months on the other system, it obviously gives me confidence that it should work on this system as well).

@robn
Copy link
Member

robn commented Mar 7, 2025

The assembly files deter reviewers from looking at the non-assembly parts, that's true. Still review of the remaining parts would be valuable for me. Also keep in mind that they are taken from reputed sources and have been in use for a long time.

@AttilaFueloep yeah, I'm not too worried as long as I can do a rough side-by-side scan and see that there's not obviously anything weird introduced. Not that I'm expecting anything; just trying to be vigilant.

Are there any attack vectors I opened? I can't say. That would require a qualified cryptographer to review. I'm afraid that's not going to happen any time soon, so how should we proceed?

Yeah, I had similar concerns in #14249. The consensus there was that we aren't cryptographers, and using code that either easy for us to audit or well known and trusted (or both!) will be good enough. Not much else we can do!

What I did was to use the newly added implementation to additionally operate on a copy of the original GCM context (shadow context). That way I could compare the results of both implementations (SSE vs AVX) and log or panic if they did differ. With that in place I created heavy FS activity. That should have given quite a good coverage I hope.

I like this technique. I'm doing some rework on the algorithm selection at the moment (mostly just exploratory at this stage); that might be a good thing to include in the toolkit.

Given the mileage and the way i tested this, I feel quite comfortable about the correctness, but of course you can never know. The security implications (if any) of my changes on the other hand are unknown.

I'm inclined to trust the test vectors for general correctness. There's other classes of problem of course (timing, sidechannel) but we do the best we can.

Regarding the existing cycle implementation, do you know about any issues with it? I took special care to get the AVX part right, any bugs I'll try to fix.

There were some questions about it in #17058, but I wasn't paying close attention. Looks like it should be ok now.

Good work :)

@AttilaFueloep
Copy link
Contributor Author

@RobertMe Thanks for confirming and devoting a box for guinea-pigging my changes, much appreciated!

@AttilaFueloep
Copy link
Contributor Author

As a side note, as someone else noted in this thread, Linux crypto does not support SSE (didn't know that), so even switching to dmcrypt won't help. You'll end up with poor performance on LUKS devices as well. Dunno, but with TrueNAS SCALE being based on Linux, iXsystem might be interested in this change as well. It would make encryption on low power devices viable.

@AttilaFueloep
Copy link
Contributor Author

@robn

I'm not too worried as long as I can do a rough side-by-side scan and see that there's not obviously anything weird introduced.

That would be hard in this case due to the NASM -> GNU as conversion.

The consensus there was that we aren't cryptographers

There's even an acronym for that: IANAC. ;-)

and using code that either easy for us to audit or well known and trusted (or both!) will be good enough. Not much else we can do!

Agreed.

I like this technique. I'm doing some rework on the algorithm selection at the moment (mostly just exploratory at this stage); that might be a good thing to include in the toolkit.

Good to have feedback on this. So I'll add the debugging code to my refactoring. I think I should wire it up by adding a --enable-debug-aes-gcm flag to configure and two module parameters, say icp_debug_gcm_known_good_impl and icp_debug_gcm_impl_to_test, indicating which implementation to compare against and which implementation to test.

BTW I'm also touching the algorithm selection. We should coordinate to avoid double work.

I'm inclined to trust the test vectors for general correctness.

Yeah, the tests you added are a great improvement, it's really good to have them. Thanks for doing that!

There's other classes of problem of course (timing, sidechannel) but we do the best we can.

Agreed.

Thanks for confirming on the cycle implementation.

@AttilaFueloep
Copy link
Contributor Author

Just pushed a raw draft at https://github.com/AttilaFueloep/zfs/tree/zfs-refactor-gcm-simd. Passes minimal testing (create, write, and read files). At this point it needs cleanup and thorough testing.

@robn If you're curious you could have a peek.

@jamasi
Copy link

jamasi commented Aug 23, 2025

#17058 seems to be merged now, so what would be the next steps to get this here (and possibly also #10846) going further? Is there already an updated branch/patchset on top of zfs-2.3.4 that could be used to test on older/small NAS hardware?

@AttilaFueloep
Copy link
Contributor Author

@jamasi Still working on it. For the time being I rebased onto 2.3.4-release and master. Cherry picking commit 4fdac4a or d248d8a respectively should give a working version. Please note that both are only lightly tested (built, ran fio and a scrub on a raidz pool). You'll want to make sure that you have a backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Work in Progress Not yet ready for general review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ICP: Implement Larger Encrypt/Decrypt Chunks for Non-AVX Processors