Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aes: zeroize not fully removing key schedule from memory? #385

Closed
teythoon opened this issue Sep 22, 2023 · 34 comments · Fixed by #386
Closed

aes: zeroize not fully removing key schedule from memory? #385

teythoon opened this issue Sep 22, 2023 · 34 comments · Fixed by #386

Comments

@teythoon
Copy link
Contributor

When instantiating (and using) the AES cipher (it is very likely that others are affected as well, but I haven't checked), the key will be written to the stack across a large region of memory (across 3568 bytes in my test). The fact that that it is written over such a large region makes it likely that it will not be overwritten soon, if at all.

This is the result of a minimal test case that I have broken out of our test suite. I will attach the program to this bug report. The "screenshot" demonstrates the problem, and how to find the culprit of the leaks using rr, Mozillas recording and replaying debugger:

% cargo build --release
% rr record target/release/aes-secret-leak
rr: Saving execution to trace directory `/home/teythoon/.local/share/rr/aes-secret-leak-1'.
[stack]: 139264 bytes
7fff91462e90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91462ea0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91462ed0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91462fb0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91462fc0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff914630a0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff914630b0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463190  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff914631a0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463280  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463290  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff914638b0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff914638c0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463b80  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463b90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463c70  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
7fff91463c80  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
aes encrypt: secret leaked
% rr replay -d rust-gdb
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/teythoon/.local/share/rr/aes-secret-leak-1/mmap_hardlink_4_aes-secret-leak...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Really redefine built-in command "jump"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:49687
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/a9/9db3715218b641780b04323e4ae5953d68a927.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007fd38eba9a30 in _start () from /lib64/ld-linux-x86-64.so.2
(rr) watch *0x7fff91462e90 if *0x7fff91462e90 == 0x40404040
Hardware watchpoint 1: *0x7fff91462e90
(rr) c
Continuing.

Hardware watchpoint 1: *0x7fff91462e90

Old value = 0
New value = 1077952576
aes::ni::{impl#101}::new (key=0x5588ef9c9694 <.Lanon.063adcbe502cc43f97d580b43c4726e5.4>)
    at src/ni.rs:164
164                         round_keys: unsafe { $module::expand_key(key.as_ref()) },
(rr) bt
#0  aes::ni::{impl#101}::new (key=0x5588ef9c9694 <.Lanon.063adcbe502cc43f97d580b43c4726e5.4>) at src/ni.rs:164
#1  0x00005588ef9cbda3 in aes::ni::{impl#89}::new (key=0x5588ef9c9694 <.Lanon.063adcbe502cc43f97d580b43c4726e5.4>) at /home/teythoon/.local/stow/rustup/registry/src/index.crates.io-6f17d22bba15001f/aes-0.8.3/src/ni.rs:85
#2  aes::autodetect::{impl#73}::new (key=0x5588ef9c9694 <.Lanon.063adcbe502cc43f97d580b43c4726e5.4>) at /home/teythoon/.local/stow/rustup/registry/src/index.crates.io-6f17d22bba15001f/aes-0.8.3/src/autodetect.rs:96
#3  crypto_common::KeyInit::new_from_slice<aes::autodetect::Aes256> (key=&[u8](size=0)) at /home/teythoon/.local/stow/rustup/registry/src/index.crates.io-6f17d22bba15001f/crypto-common-0.1.6/src/lib.rs:133
#4  0x00005588ef9cab9a in aes_secret_leak::main () at src/main.rs:10
@teythoon
Copy link
Contributor Author

The reproducer: aes-secret-leak.zip

@newpavlov
Copy link
Member

Stack bleaching is outside of area of responsibility for algorithm implementation crates. Current version of Rust does not provide any good tools for calculating max stack usage for a given function, so we can not reliably do it on our level.

@tarcieri
Copy link
Member

When instantiating (and using) the AES cipher (it is very likely that others are affected as well, but I haven't checked), the key will be written to the stack across a large region of memory (across 3568 bytes in my test).

You can simply ask rustc the size of the type. No need to guess/use heuristics:

println!("Size of Aes256: {}", core::mem::size_of::<aes::Aes256>());

...prints "Size of Aes256: 960" on an x86_64 target.

Note that especially on x86, we optimize for performance over memory footprint. Each AES instance needs to store an entire key schedule with a 128-bit key for each round, expanded separately for both encryption and decryption. So yes, it's large, but that's how AES is commonly implemented on those targets in order to take advantage of hardware acceleration.

The fact that that it is written over such a large region makes it likely that it will not be overwritten soon, if at all.

If you enable the zeroize feature of the aes crate, the state will be zeroed out on drop. You shouldn't be depending on random stack frames to overwrite the state.

@tarcieri tarcieri closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2023
@teythoon
Copy link
Contributor Author

Have you even looked at the reproducer? I have zeroize enabled:

% grep zeroize aes-secret-leak/Cargo.toml
aes = { version = "0.8", features = ["zeroize"] }
cipher = { version = "0.4", features = ["std", "zeroize"] }

And in Sequoia, we heap-allocate the cipher context, and I'll update the reproducer to do the same. What we're seeing here is not covered by zeroize.

Updated reproducer: aes-secret-leak.zip

We test Nettle, OpenSSL, and Botan. None of those leave traces of the key laying around in either heap nor stack.

@tarcieri tarcieri reopened this Sep 22, 2023
@tarcieri
Copy link
Member

I’ll take a look when I have some time. Just to note the reproducer is far from minimal and includes a lot of superfluous, unrelated code which does not make it easy.

@tarcieri tarcieri changed the title AES (possibly other ciphers) spray the key across the stack aes: zeroize not fully removing key schedule from memory? Sep 22, 2023
@tarcieri
Copy link
Member

Also, can you give complete information about the target you were testing on, e.g. the target triple? The aes crate has many backends that vary target-by-target.

@teythoon
Copy link
Contributor Author

I have cut down the reproducer to only include the misbehaving code:

fn main() {
    use aes::cipher::KeyInit;
    let cipher = Box::new(aes::Aes256::new_from_slice(&NEEDLE).unwrap());
    drop(cipher);
}

const NEEDLE: &[u8] = b"@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@";

My triple is x86_64-unknown-linux-gnu.

@newpavlov
Copy link
Member

newpavlov commented Sep 22, 2023

@teythoon
This code first creates the cipher instance on stack and then moves it to heap. zeroize will erase the heap data, but will do nothing about the source data on stack. We can not do anything about it in Rust, since it does not have move constructors. After working with sensitive data you have to bleach the stack data separately.

@tarcieri
Copy link
Member

I'll also note we have fairly comprehensive tests of the zeroize behavior here:

https://github.com/RustCrypto/block-ciphers/blob/8d03900/aes/src/lib.rs#L161-L233

@tarcieri
Copy link
Member

tarcieri commented Sep 22, 2023

I tried running the example on a Debian 12 Docker container on an ARM64 Mac:

aes encrypt: passed

@teythoon I'm really going to need more information about your environment to figure this out Edit: whoops, missed you were testing on x86_64. Let me try a Docker container with x86_64 as the target arch.

@teythoon
Copy link
Contributor Author

We can not do anything about it in Rust, since it does not have move constructors.

Maybe there is a way to work around this limitation in the API by providing a way to (re)key after the object has been moved to the heap.

@tarcieri
Copy link
Member

It would probably make more sense to have an API to initialize a boxed instance of a cipher, which initializes it on the heap and then does the key schedule setup when the value is already living on the heap

@newpavlov
Copy link
Member

newpavlov commented Sep 22, 2023

It's quite hard to guarantee that Rust compiler will not place any sensitive data on stack and solutions probably will be quite fragile. I believe you should properly implement and use stack bleaching. It also will help with potential sensitive temporaries leaked to stack by compiler during encryption/decryption, especially when you will use higher-level algorithms like AEAD.

@tarcieri
Copy link
Member

tarcieri commented Sep 22, 2023

I reproduced the issue on x86. There do seem to be several copies of the key left on the stack.

The issue persists even if the code looks like this, so it's unrelated to boxing:

fn stuff() {
    use aes::cipher::KeyInit;
    aes::Aes256::new_from_slice(NEEDLE).unwrap();
}

fn main() -> Result<()> {
    stuff();
    scan("aes encrypt").unwrap();
    Ok(())
}

If I run with...

$ RUSTFLAGS="--cfg aes_force_soft" cargo run

...the problem does NOT occur. So it seems to possibly be something with the AES-NI backend, or failing that something with autodetection. (I now realize I didn't realize the aes_armv8 flag on ARM, so I should check that as well. Edit: it's also happening on ARM with aes_armv8 enabled, so potentially something in the autodetection code).

All that said, I agree with @newpavlov that clearing all sensitive data from the stack is out-of-scope for this crate. It's something best implemented, or the strategy best decided, by the toplevel binary, not a library crate.

Unless you see a bug in the zeroize tests, it would seem the long-term keys are being cleared, even if it's leaving data on the stack.

My disposition is to close this issue again, although if there's a trivial fix someone can spot, we can potentially implement it.

@teythoon
Copy link
Contributor Author

All that said, I agree with @newpavlov that clearing all sensitive data from the stack is out-of-scope for this crate. It's something best implemented, or the strategy best decided, by the toplevel binary, not a library crate.

We are also a library crate, and we are not going to shift the responsibility to our downstream users.

I do some stack clearing where my measurements have indicated leaks, but that is not trivial either. First, you need to figure out where clearing is necessary, then it is not obvious how much you need to clear. I think that the clearing should happen close to where the leaks happen, hence I turned to you.

Do you know of crates that do clear parts of their stack to look at for inspiration?

@newpavlov
Copy link
Member

newpavlov commented Sep 22, 2023

Proper stack bleaching requires to use an #[inline(never)] function which works with secrets, followed by execution of bleaching function. Amount of necessary bleaching is usually measured experimentally for each supported arch (including all possible backend variations). aes relies heavily on inlining to achieve performance when used as part of higher-level crates, so it's certainly a wrong place to do stack bleaching. We could do it in the AEAD crates, but we have a sizable surface API and bleaching is not exactly a zero-cost thing either. Ideally, it should be done once on a high enough level.

If you library is high-level enough, then it should be feasible for you to add stack bleaching to your functions. I would apply it for every function which works with sensitive data.

Do you know of crates that do clear parts of their stack to look at for inspiration?

I work with such code, but, unfortunately, can not share it and it's not a library, but an application. I think @tarcieri mentioned a while ago some Rust libraries which use stack bleaching, but I don't remember the names.

@tarcieri
Maybe we should add a stack bleaching crate to the utils repo?

@tarcieri
Copy link
Member

The general strategy I'd recommend is to use something like libfringe to allocate a new stack to run sensitive operations on, then upon completion zeroizing the space you allocated for that stack (though I haven't tried to implement this, and there may be newer alternatives to libfringe).

The only out-of-the-box solution I'm aware of is in the clear_on_drop crate, which provides a clear_stack function but leaves the number of pages to zeroize as an exercise to the caller, which isn't great.

Maybe we should add a stack bleaching crate to the utils repo?

That'd be cool

@teythoon
Copy link
Contributor Author

@newpavlov

This code first creates the cipher instance on stack and then moves it to heap.

I'm pretty sure that is not what's happening. Starting with -O1 rustc will optimize that copy away, constructing the object directly on the heap:

https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,selection:(endColumn:2,endLineNumber:3,positionColumn:2,positionLineNumber:3,selectionStartColumn:2,selectionStartLineNumber:3,startColumn:2,startLineNumber:3),source:'pub+fn+boxme(num:+usize)+-%3E+Box%3Cusize%3E+%7B%0A++++Box::new(num.overflowing_add(23).0)%0A%7D'),l:'5',n:'0',o:'Rust+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:r1720,deviceViewOpen:'1',filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,libs:!(),options:'-C+opt-level%3D1',overrides:!(),selection:(endColumn:9,endLineNumber:15,positionColumn:9,positionLineNumber:15,selectionStartColumn:9,selectionStartLineNumber:15,startColumn:9,startLineNumber:15),source:1),l:'5',n:'0',o:'+rustc+1.72.0+(Editor+%231)',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4

FWIW, I managed to get my stack bleaching workaround to work by bluntly zeroing more memory. Previously, I stopped my experiments at 4k which wasn't sufficient, and made me think there was a problem with my code or approach. Now I zero 8k, which gets the job done. So I guess the secret is leaked in a deep call stack.

I feel like this is one more reason to want the clearing to be done closer to where it is leaked, or preferably the buggy code identified and the leak plugged.

@tarcieri
Copy link
Member

tarcieri commented Sep 25, 2023

I confirmed it was unrelated to Box earlier.

The issue does NOT occur when aes_force_soft is enabled, and does not occur on ARMv8 targets when aes_armv8 is not explicitly enabled (which is effectively equivalent to aes_force_soft).

On x86, it occurs with RUSTFLAGS of --cfg target_feature="aes". The reproducer's claimed stack size is 8MB.

On ARMv8, it occurs with --cfg aes_armv8 --cfg target_feature="aes". The reproducer's claimed stack size is 135kB.

Using those flags should bypass CPU feature autodetection. So it would seem to be in the hardware backends. It's possible there's something wrong with the way SIMD registers are being zeroized, or there's just unexpected stack spilling in both backends.

Regardless, "fixing" this problem seems like it needs to happen on a backend-by-backend basis.

@newpavlov
Copy link
Member

@tarcieri
Can you check if this PR (I don't think it was published) has any effect?

@tarcieri
Copy link
Member

Given that the zeroize tests for the aes crate pass and are implemented specifically to check the hardware backends, I don't think it has anything to do with zeroize or long-term secrets. I would guess it has to do with transient secrets on the stack.

But I can give it a try.

@tarcieri
Copy link
Member

tarcieri commented Sep 25, 2023

Just tried it with zeroize v1.6.0 (https://github.com/RustCrypto/utils.git#895b4a51), no effect.

For posterity, here's what I'm getting:

x86_64 + AES-NI: RUSTFLAGS='--cfg target_feature="aes"'

[stack]: 8388608 bytes
4001893a10  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893a20  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893b10  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893b20  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893dc0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893dd0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893df0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e20  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e30  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e60  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e70  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893e80  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e90  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001893eb0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893ed0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893ee0  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001893ef0  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001893f20  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001893f80  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893f90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893fb0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893fc0  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893fd0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893fe0  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893ff0  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001894010  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001894030  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001894040  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001894050  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001894080  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001894100  48 49 49 49 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894160  48 49 49 49 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894260  70 41 41 41 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
40018942c0  70 41 41 41 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894e60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894e70  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894f50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894f60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895110  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895120  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895200  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895210  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895320  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895330  40 40 40 40 40 40 40 40  ed be 98 01 40 00 00 00   !!!!!!!!....!...
4001896e60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896e70  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896f50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896f60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897230  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897240  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897320  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897330  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
aes encrypt: secret leaked

ARMv8 Cryptography Extensions: RUSTFLAGS='--cfg target_feature="aes" --cfg aes_armv8''

[stack]: 135168 bytes
fffff874e4f0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff874e500  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff874ea90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff874eaa0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff874eb80  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff874eb90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff87507c0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff87507d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750980  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750990  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750b90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750ba0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750d50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
fffff8750d60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
aes encrypt: secret leaked

Portable Backend: RUSTFLAGS='--cfg aes_force_soft'

aes encrypt: passed

@tarcieri
Copy link
Member

The ARMv8 backend appears to leave 14 copies on the stack, which would correspond to the number of rounds in AES-256.

@newpavlov
Copy link
Member

newpavlov commented Sep 25, 2023

@tarcieri

RUSTFLAGS='--cfg target_feature="aes"'

Shouldn't it be -C target-feature=+aes (note the plus sign)? -C target-feature=aes does not enable the target feature.

@tarcieri
Copy link
Member

@newpavlov as far as I can tell the two syntaxes are identical. Regardless, there is no effect on the results regardless of if I use + or not.

@tarcieri
Copy link
Member

tarcieri commented Sep 25, 2023

I tried a bit of an experiment on this branch trying to improve the ARMv8 backend:

https://github.com/RustCrypto/block-ciphers/tree/aes/armv8-keep-key-schedule-off-stack

Commit 99de856 made things slightly better (13 copies rather than 14, and one is only partial):

[stack]: 135168 bytes
ffffd99a1dd0  00 00 00 00 40 40 40 40  00 40 40 40 40 00 00 00   ....!!!!.!!!!...
ffffd99a22d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a22e0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a23c0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a23d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a4000  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a4010  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a41c0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a41d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a43d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a43e0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a4590  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffd99a45a0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!

Commit b5b6339 tried to sprinkle some inline(always) on things made them worse than they were originally:

[stack]: 135168 bytes
ffffed1ea710  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ea720  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ea8d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ea8e0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ea8f0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ea900  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eaab0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eaac0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eaaf0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eab00  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eabe0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1eabf0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ec9c0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ec9d0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecb80  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecb90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecd90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecda0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecf50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffed1ecf60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!

That's 20 copies rather than the original 14, and way up from the original 13 just from adding some #[inline(always)] directives.

@newpavlov
Copy link
Member

newpavlov commented Sep 25, 2023

as far as I can tell the two syntaxes are identical.

No, they are not. As we can see here without + intrinsics do not get inlined. (I assume there is no difference between -C and --cfg)

@tarcieri
Copy link
Member

tarcieri commented Sep 25, 2023

Well, again, regardless, there is no effect on the output so it's a red herring.

x86_64 + AES-NI: RUSTFLAGS='-C target_feature=+aes'

[stack]: 8388608 bytes
4001893a10  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893a20  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893b10  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893b20  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893dc0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893dd0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893df0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e20  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e30  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893e60  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e70  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893e80  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893e90  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001893eb0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893ed0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893ee0  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001893ef0  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001893f20  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001893f80  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893f90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893fb0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001893fc0  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893fd0  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001893fe0  00 00 00 00 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001893ff0  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001894010  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001894030  00 00 00 00 00 00 00 00  40 40 40 40 40 40 40 40   ........!!!!!!!!
4001894040  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001894050  40 40 40 40 00 00 00 00  40 40 40 40 40 40 40 40   !!!!....!!!!!!!!
4001894080  40 40 40 40 00 00 00 00  40 40 40 40 00 00 00 00   !!!!....!!!!....
4001894100  48 49 49 49 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894160  48 49 49 49 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894260  70 41 41 41 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
40018942c0  70 41 41 41 40 40 40 40  40 40 40 40 40 40 40 40   ....!!!!!!!!!!!!
4001894e60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894e70  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894f50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001894f60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895110  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895120  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895200  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895210  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895320  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001895330  40 40 40 40 40 40 40 40  ed be 98 01 40 00 00 00   !!!!!!!!....!...
4001896e60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896e70  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896f50  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001896f60  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897230  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897240  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897320  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
4001897330  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
aes encrypt: secret leaked

ARMv8 Cryptography Extensions: RUSTFLAGS='-C target_feature=+aes --cfg aes_armv8'

[stack]: 135168 bytes
ffffc263bd90  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263bda0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263c330  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263c340  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263c420  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263c430  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e060  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e070  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e220  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e230  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e430  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e440  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e5f0  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
ffffc263e600  40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40   !!!!!!!!!!!!!!!!
aes encrypt: secret leaked

@newpavlov
Copy link
Member

newpavlov commented Sep 25, 2023

I think I've found why the issue happens on backends dependent on target feature. I started with the following simple code:

#[no_mangle]
pub unsafe fn aes_new(key: &[u8; 16]) -> aes::Aes128Enc {
    use aes::cipher::KeyInit;
    aes::Aes128Enc::new(key.into())
}

I compiled it as a shared library (with target-feature=+aes) and disassembled it. The relevant part looks like this:

0000000000001100 <aes_new>:
    1100:	53                   	push   rbx
    1101:	48 89 fb             	mov    rbx,rdi
    1104:	ff 15 ee 2e 00 00    	call   QWORD PTR [rip+0x2eee]   # 3ff8 <_GLOBAL_OFFSET_TABLE_+0x38>
    110a:	48 89 d8             	mov    rax,rbx
    110d:	5b                   	pop    rbx
    110e:	c3                   	ret    
    110f:	90                   	nop

0000000000001110 <_ZN61_$LT$aes..ni..Aes128Enc$u20$as$u20$crypto_common..KeyInit$GT$3new17h9aa25a0d33d4862bE>:
    1110:	48 89 f8             	mov    rax,rdi
    1113:	f3 0f 6f 06          	movdqu xmm0,XMMWORD PTR [rsi]
    1117:	66 0f 3a df c8 01    	aeskeygenassist xmm1,xmm0,0x1
    // key expansion impl

We can see that KeyInit::new function did not get inlined, despite it being marked with #[inline], LTO being enabled, and compilation being done with +aes. It looks like #[target_feature(enable = "aes")] somehow still prevents inlining of the function.

It does not cause stack spillage in this particular case, but by replacing Aes128Enc with Aes256, we get much worse results with a plenty of stack spilling:

Click to expand
0000000000001100 <aes_new>:
    1100:	41 56                	push   r14
    1102:	53                   	push   rbx
    1103:	48 81 ec e8 01 00 00 	sub    rsp,0x1e8
    110a:	48 89 fb             	mov    rbx,rdi
    110d:	4c 8d b4 24 f0 00 00 	lea    r14,[rsp+0xf0]
    1114:	00 
    1115:	4c 89 f7             	mov    rdi,r14
    1118:	ff 15 d2 2e 00 00    	call   QWORD PTR [rip+0x2ed2]        # 3ff0 <_GLOBAL_OFFSET_TABLE_+0x38>
    111e:	66 0f 38 db 84 24 00 	aesimc xmm0,XMMWORD PTR [rsp+0x100]
    1125:	01 00 00 
    1128:	66 0f 7f 84 24 e0 00 	movdqa XMMWORD PTR [rsp+0xe0],xmm0
    112f:	00 00 
    1131:	66 0f 38 db 84 24 10 	aesimc xmm0,XMMWORD PTR [rsp+0x110]
    1138:	01 00 00 
    113b:	66 0f 7f 84 24 d0 00 	movdqa XMMWORD PTR [rsp+0xd0],xmm0
    1142:	00 00 
    1144:	66 0f 38 db 84 24 20 	aesimc xmm0,XMMWORD PTR [rsp+0x120]
    114b:	01 00 00 
    114e:	66 0f 7f 84 24 c0 00 	movdqa XMMWORD PTR [rsp+0xc0],xmm0
    1155:	00 00 
    1157:	66 0f 38 db 84 24 30 	aesimc xmm0,XMMWORD PTR [rsp+0x130]
    115e:	01 00 00 
    1161:	66 0f 7f 84 24 b0 00 	movdqa XMMWORD PTR [rsp+0xb0],xmm0
    1168:	00 00 
    116a:	66 0f 38 db 84 24 40 	aesimc xmm0,XMMWORD PTR [rsp+0x140]
    1171:	01 00 00 
    1174:	66 0f 7f 84 24 a0 00 	movdqa XMMWORD PTR [rsp+0xa0],xmm0
    117b:	00 00 
    117d:	66 0f 38 db 84 24 50 	aesimc xmm0,XMMWORD PTR [rsp+0x150]
    1184:	01 00 00 
    1187:	66 0f 7f 84 24 90 00 	movdqa XMMWORD PTR [rsp+0x90],xmm0
    118e:	00 00 
    1190:	66 0f 38 db 84 24 60 	aesimc xmm0,XMMWORD PTR [rsp+0x160]
    1197:	01 00 00 
    119a:	66 0f 7f 84 24 80 00 	movdqa XMMWORD PTR [rsp+0x80],xmm0
    11a1:	00 00 
    11a3:	66 0f 38 db 84 24 70 	aesimc xmm0,XMMWORD PTR [rsp+0x170]
    11aa:	01 00 00 
    11ad:	66 0f 7f 44 24 70    	movdqa XMMWORD PTR [rsp+0x70],xmm0
    11b3:	66 0f 38 db 84 24 80 	aesimc xmm0,XMMWORD PTR [rsp+0x180]
    11ba:	01 00 00 
    11bd:	66 0f 7f 44 24 60    	movdqa XMMWORD PTR [rsp+0x60],xmm0
    11c3:	66 0f 38 db 84 24 90 	aesimc xmm0,XMMWORD PTR [rsp+0x190]
    11ca:	01 00 00 
    11cd:	66 0f 7f 44 24 50    	movdqa XMMWORD PTR [rsp+0x50],xmm0
    11d3:	66 0f 38 db 84 24 a0 	aesimc xmm0,XMMWORD PTR [rsp+0x1a0]
    11da:	01 00 00 
    11dd:	66 0f 7f 44 24 40    	movdqa XMMWORD PTR [rsp+0x40],xmm0
    11e3:	66 0f 38 db 84 24 b0 	aesimc xmm0,XMMWORD PTR [rsp+0x1b0]
    11ea:	01 00 00 
    11ed:	66 0f 7f 44 24 30    	movdqa XMMWORD PTR [rsp+0x30],xmm0
    11f3:	66 0f 38 db 84 24 c0 	aesimc xmm0,XMMWORD PTR [rsp+0x1c0]
    11fa:	01 00 00 
    11fd:	66 0f 7f 44 24 20    	movdqa XMMWORD PTR [rsp+0x20],xmm0
    1203:	0f 28 84 24 f0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xf0]
    120a:	00 
    120b:	0f 29 04 24          	movaps XMMWORD PTR [rsp],xmm0
    120f:	0f 28 84 24 d0 01 00 	movaps xmm0,XMMWORD PTR [rsp+0x1d0]
    1216:	00 
    1217:	0f 29 44 24 10       	movaps XMMWORD PTR [rsp+0x10],xmm0
    121c:	ba f0 00 00 00       	mov    edx,0xf0
    1221:	48 89 df             	mov    rdi,rbx
    1224:	4c 89 f6             	mov    rsi,r14
    1227:	ff 15 b3 2d 00 00    	call   QWORD PTR [rip+0x2db3]        # 3fe0 <memcpy@GLIBC_2.14>
    122d:	0f 28 04 24          	movaps xmm0,XMMWORD PTR [rsp]
    1231:	0f 29 83 f0 00 00 00 	movaps XMMWORD PTR [rbx+0xf0],xmm0
    1238:	0f 28 84 24 e0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xe0]
    123f:	00 
    1240:	0f 29 83 00 01 00 00 	movaps XMMWORD PTR [rbx+0x100],xmm0
    1247:	0f 28 84 24 d0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xd0]
    124e:	00 
    124f:	0f 29 83 10 01 00 00 	movaps XMMWORD PTR [rbx+0x110],xmm0
    1256:	0f 28 84 24 c0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xc0]
    125d:	00 
    125e:	0f 29 83 20 01 00 00 	movaps XMMWORD PTR [rbx+0x120],xmm0
    1265:	0f 28 84 24 b0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xb0]
    126c:	00 
    126d:	0f 29 83 30 01 00 00 	movaps XMMWORD PTR [rbx+0x130],xmm0
    1274:	0f 28 84 24 a0 00 00 	movaps xmm0,XMMWORD PTR [rsp+0xa0]
    127b:	00 
    127c:	0f 29 83 40 01 00 00 	movaps XMMWORD PTR [rbx+0x140],xmm0
    1283:	0f 28 84 24 90 00 00 	movaps xmm0,XMMWORD PTR [rsp+0x90]
    128a:	00 
    128b:	0f 29 83 50 01 00 00 	movaps XMMWORD PTR [rbx+0x150],xmm0
    1292:	0f 28 84 24 80 00 00 	movaps xmm0,XMMWORD PTR [rsp+0x80]
    1299:	00 
    129a:	0f 29 83 60 01 00 00 	movaps XMMWORD PTR [rbx+0x160],xmm0
    12a1:	0f 28 44 24 70       	movaps xmm0,XMMWORD PTR [rsp+0x70]
    12a6:	0f 29 83 70 01 00 00 	movaps XMMWORD PTR [rbx+0x170],xmm0
    12ad:	0f 28 44 24 60       	movaps xmm0,XMMWORD PTR [rsp+0x60]
    12b2:	0f 29 83 80 01 00 00 	movaps XMMWORD PTR [rbx+0x180],xmm0
    12b9:	0f 28 44 24 50       	movaps xmm0,XMMWORD PTR [rsp+0x50]
    12be:	0f 29 83 90 01 00 00 	movaps XMMWORD PTR [rbx+0x190],xmm0
    12c5:	0f 28 44 24 40       	movaps xmm0,XMMWORD PTR [rsp+0x40]
    12ca:	0f 29 83 a0 01 00 00 	movaps XMMWORD PTR [rbx+0x1a0],xmm0
    12d1:	0f 28 44 24 30       	movaps xmm0,XMMWORD PTR [rsp+0x30]
    12d6:	0f 29 83 b0 01 00 00 	movaps XMMWORD PTR [rbx+0x1b0],xmm0
    12dd:	0f 28 44 24 20       	movaps xmm0,XMMWORD PTR [rsp+0x20]
    12e2:	0f 29 83 c0 01 00 00 	movaps XMMWORD PTR [rbx+0x1c0],xmm0
    12e9:	0f 28 44 24 10       	movaps xmm0,XMMWORD PTR [rsp+0x10]
    12ee:	0f 29 83 d0 01 00 00 	movaps XMMWORD PTR [rbx+0x1d0],xmm0
    12f5:	48 89 d8             	mov    rax,rbx
    12f8:	48 81 c4 e8 01 00 00 	add    rsp,0x1e8
    12ff:	5b                   	pop    rbx
    1300:	41 5e                	pop    r14
    1302:	c3                   	ret    
    1303:	66 2e 0f 1f 84 00 00 	nop    WORD PTR cs:[rax+rax*1+0x0]
    130a:	00 00 00 
    130d:	0f 1f 00             	nop    DWORD PTR [rax]

0000000000001310 <_ZN61_$LT$aes..ni..Aes256Enc$u20$as$u20$crypto_common..KeyInit$GT$3new17h5a792345766b66d5E>:
    1310:	48 89 f8             	mov    rax,rdi
    1313:	f3 0f 6f 06          	movdqu xmm0,XMMWORD PTR [rsi]
    1317:	f3 0f 6f 4e 10       	movdqu xmm1,XMMWORD PTR [rsi+0x10]
    131c:	66 0f 3a df d1 01    	aeskeygenassist xmm2,xmm1,0x1
    // key expansion impl

It's interesting that the keys inversion (aesimc code) does get inlined, but not the expansion.

@newpavlov
Copy link
Member

newpavlov commented Sep 25, 2023

It looks like we have missed a bunch of #[inline]s for backend KeyInit impls and LTO does not help for some reason.

#386 should fix it, at least it works for the Aes256 code above. @tarcieri @teythoon can you verify it with your code?

@tarcieri
Copy link
Member

I noted in #386 it does not fix it, though I'd agree it seems to be related to inlining

@newpavlov
Copy link
Member

newpavlov commented Sep 25, 2023

Yeah, encryption of blocks still unnecessarily copies keys to the stack. We can see it with this function:

#[no_mangle]
pub unsafe fn aes_enc(cipher: &aes::Aes256, blocks: &mut [aes::Block]) {
    use aes::cipher::BlockEncrypt;
    cipher.encrypt_blocks(blocks);
}

It generates the following assembly:

Click to expand
aes_enc:
	sub	rsp, 120
	mov	eax, edx
	and	eax, 7
	cmp	rdx, 8
	jb	.LBB0_3
	mov	rcx, rdx
	shr	rcx, 3
	movaps	xmm0, xmmword ptr [rdi]
	movaps	xmmword ptr [rsp + 96], xmm0
	movaps	xmm0, xmmword ptr [rdi + 16]
	movaps	xmmword ptr [rsp + 80], xmm0
	movaps	xmm0, xmmword ptr [rdi + 32]
	movaps	xmmword ptr [rsp + 64], xmm0
	movaps	xmm0, xmmword ptr [rdi + 48]
	movaps	xmmword ptr [rsp], xmm0
	movaps	xmm0, xmmword ptr [rdi + 64]
	movaps	xmmword ptr [rsp - 16], xmm0
	movaps	xmm0, xmmword ptr [rdi + 80]
	movaps	xmmword ptr [rsp - 64], xmm0
	movaps	xmm0, xmmword ptr [rdi + 96]
	movaps	xmmword ptr [rsp - 32], xmm0
	movaps	xmm0, xmmword ptr [rdi + 112]
	movaps	xmmword ptr [rsp + 48], xmm0
	movaps	xmm0, xmmword ptr [rdi + 128]
	movaps	xmmword ptr [rsp - 80], xmm0
	movaps	xmm0, xmmword ptr [rdi + 144]
	movaps	xmmword ptr [rsp - 48], xmm0
	movaps	xmm0, xmmword ptr [rdi + 160]
	movaps	xmmword ptr [rsp - 96], xmm0
	movaps	xmm0, xmmword ptr [rdi + 176]
	movaps	xmmword ptr [rsp - 112], xmm0
	movaps	xmm0, xmmword ptr [rdi + 192]
	movaps	xmmword ptr [rsp - 128], xmm0
	movaps	xmm0, xmmword ptr [rdi + 208]
	movaps	xmmword ptr [rsp + 32], xmm0
	movdqa	xmm0, xmmword ptr [rdi + 224]
	movdqa	xmmword ptr [rsp + 16], xmm0
	lea	r8, [rsi + 112]
	movdqa	xmm12, xmmword ptr [rsp + 32]
	movdqa	xmm7, xmmword ptr [rsp + 16]
	.p2align	4, 0x90
.LBB0_2:
	movdqu	xmm5, xmmword ptr [r8 - 112]
	movdqu	xmm6, xmmword ptr [r8 - 96]
	movdqu	xmm3, xmmword ptr [r8 - 80]
	movdqu	xmm4, xmmword ptr [r8 - 64]
	movdqu	xmm1, xmmword ptr [r8 - 48]
	movdqu	xmm2, xmmword ptr [r8 - 32]
	movdqu	xmm15, xmmword ptr [r8 - 16]
	movdqu	xmm8, xmmword ptr [r8]
	movdqa	xmm9, xmmword ptr [rsp + 96]
	pxor	xmm5, xmm9
	pxor	xmm6, xmm9
	movdqa	xmm10, xmmword ptr [rsp + 80]
	aesenc	xmm5, xmm10
	aesenc	xmm6, xmm10
	movdqa	xmm0, xmmword ptr [rsp + 64]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm11, xmm0
	movdqa	xmm0, xmmword ptr [rsp]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 16]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 64]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 32]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm13, xmm0
	movdqa	xmm14, xmmword ptr [rsp + 48]
	aesenc	xmm5, xmm14
	aesenc	xmm6, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 48]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	aesenc	xmm5, xmm12
	aesenc	xmm6, xmm12
	aesenclast	xmm5, xmm7
	aesenclast	xmm6, xmm7
	movdqu	xmmword ptr [r8 - 112], xmm5
	movdqu	xmmword ptr [r8 - 96], xmm6
	movdqa	xmm5, xmm9
	pxor	xmm3, xmm9
	pxor	xmm4, xmm9
	movdqa	xmm6, xmm10
	aesenc	xmm3, xmm10
	aesenc	xmm4, xmm10
	movdqa	xmm9, xmm11
	aesenc	xmm3, xmm11
	aesenc	xmm4, xmm11
	movdqa	xmm10, xmmword ptr [rsp]
	aesenc	xmm3, xmm10
	aesenc	xmm4, xmm10
	movdqa	xmm11, xmmword ptr [rsp - 16]
	aesenc	xmm3, xmm11
	aesenc	xmm4, xmm11
	movdqa	xmm0, xmmword ptr [rsp - 64]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	aesenc	xmm3, xmm13
	aesenc	xmm4, xmm13
	movdqa	xmm13, xmm14
	aesenc	xmm3, xmm14
	aesenc	xmm4, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm14, xmmword ptr [rsp - 48]
	aesenc	xmm3, xmm14
	aesenc	xmm4, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	aesenc	xmm3, xmm12
	aesenc	xmm4, xmm12
	aesenclast	xmm3, xmm7
	aesenclast	xmm4, xmm7
	movdqu	xmmword ptr [r8 - 80], xmm3
	movdqu	xmmword ptr [r8 - 64], xmm4
	pxor	xmm1, xmm5
	pxor	xmm2, xmm5
	aesenc	xmm1, xmm6
	aesenc	xmm2, xmm6
	aesenc	xmm1, xmm9
	aesenc	xmm2, xmm9
	aesenc	xmm1, xmm10
	aesenc	xmm2, xmm10
	aesenc	xmm1, xmm11
	aesenc	xmm2, xmm11
	movdqa	xmm3, xmmword ptr [rsp - 64]
	aesenc	xmm1, xmm3
	aesenc	xmm2, xmm3
	movdqa	xmm4, xmmword ptr [rsp - 32]
	aesenc	xmm1, xmm4
	aesenc	xmm2, xmm4
	aesenc	xmm1, xmm13
	aesenc	xmm2, xmm13
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	aesenc	xmm1, xmm14
	aesenc	xmm2, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	aesenc	xmm1, xmm12
	aesenc	xmm2, xmm12
	aesenclast	xmm1, xmm7
	aesenclast	xmm2, xmm7
	movdqu	xmmword ptr [r8 - 48], xmm1
	movdqu	xmmword ptr [r8 - 32], xmm2
	pxor	xmm15, xmm5
	pxor	xmm8, xmm5
	aesenc	xmm15, xmm6
	aesenc	xmm8, xmm6
	aesenc	xmm15, xmm9
	aesenc	xmm8, xmm9
	aesenc	xmm15, xmm10
	aesenc	xmm8, xmm10
	aesenc	xmm15, xmm11
	aesenc	xmm8, xmm11
	aesenc	xmm15, xmm3
	aesenc	xmm8, xmm3
	movdqa	xmm1, xmm4
	aesenc	xmm15, xmm4
	aesenc	xmm8, xmm4
	aesenc	xmm15, xmm13
	aesenc	xmm8, xmm13
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	aesenc	xmm15, xmm14
	aesenc	xmm8, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	aesenc	xmm15, xmm12
	aesenc	xmm8, xmm12
	aesenclast	xmm15, xmm7
	aesenclast	xmm8, xmm7
	movdqu	xmmword ptr [r8 - 16], xmm15
	movdqu	xmmword ptr [r8], xmm8
	sub	r8, -128
	dec	rcx
	jne	.LBB0_2
.LBB0_3:
	test	rax, rax
	je	.LBB0_11
	movabs	rcx, 1152921504606846968
	and	rdx, rcx
	shl	rdx, 4
	add	rsi, rdx
	movdqa	xmm14, xmmword ptr [rdi]
	movdqa	xmm13, xmmword ptr [rdi + 16]
	movdqa	xmm12, xmmword ptr [rdi + 32]
	movdqa	xmm11, xmmword ptr [rdi + 48]
	movdqa	xmm10, xmmword ptr [rdi + 64]
	movdqa	xmm9, xmmword ptr [rdi + 80]
	movdqa	xmm8, xmmword ptr [rdi + 96]
	movdqa	xmm7, xmmword ptr [rdi + 112]
	movdqa	xmm6, xmmword ptr [rdi + 128]
	movdqa	xmm5, xmmword ptr [rdi + 144]
	movdqa	xmm4, xmmword ptr [rdi + 160]
	movdqa	xmm3, xmmword ptr [rdi + 176]
	movdqa	xmm2, xmmword ptr [rdi + 192]
	movdqa	xmm1, xmmword ptr [rdi + 208]
	movdqa	xmm0, xmmword ptr [rdi + 224]
	movdqu	xmm15, xmmword ptr [rsi]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi], xmm15
	cmp	eax, 1
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 16]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 16], xmm15
	cmp	eax, 2
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 32]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 32], xmm15
	cmp	eax, 3
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 48]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 48], xmm15
	cmp	eax, 4
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 64]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 64], xmm15
	cmp	eax, 5
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 80]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 80], xmm15
	cmp	eax, 6
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 96]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 96], xmm15
.LBB0_11:
	add	rsp, 120
	ret

The function is fully inlined, so the issue may be related to rust-lang/rust#88930.

Since it looks like a compiler quirk and stack bleaching is out of scope for cipher implementation crates (though we should minimize it if possible), I think we can close this issue?

@tarcieri
Copy link
Member

@newpavlov might be worth updating that issue with this case?

Otherwise yeah, I tried to get rid of the only obvious actual stack usage in https://github.com/RustCrypto/block-ciphers/tree/aes/armv8-keep-key-schedule-off-stack (which I might still keep working on) and that didn't seem to help, so I don't think there's anything else we can do short of an upstream fix from rustc.

@newpavlov
Copy link
Member

might be worth updating that issue with this case?

Done.

I tried to get rid of the only obvious actual stack usage in aes/armv8-keep-key-schedule-off-stack

It could be worth to construct similar disassembled examples as above to see whether these changes have effect or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants