Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sha256 #187

Closed
wants to merge 7 commits into from
Closed

Implement sha256 #187

wants to merge 7 commits into from

Conversation

gsoltis
Copy link

@gsoltis gsoltis commented May 25, 2016

block_data_order algorithm ported from https://en.wikipedia.org/wiki/SHA-2

Let me know what you think! Aiming at #62

Is there a benchmark test already? Or does one need to be written?

@briansmith
Copy link
Owner

Thanks!

The benchmarks are in https://github.com/briansmith/crypto-bench. I suggest comparing your implementation to, in particular, the one in rust-crypto. IMO, if unless/until we have a clear plan for getting faster-than-rust-crypto (which is pretty fast) implementations, we should consider just using rust-crypto for the rare cases where we can't use the asm code.

Let me know the benchmark result. I'll find some time to look at the code soon.

@coveralls
Copy link

coveralls commented May 25, 2016

Coverage Status

Coverage increased (+0.1%) to 81.007% when pulling 6f64368 on gsoltis:sha256 into 51e6a18 on briansmith:master.

@gsoltis
Copy link
Author

gsoltis commented May 25, 2016

It looks like rust-crypto is about 2x faster than my implementation. So are you thinking to just declare a dependency on that library (and I guess remove the existing sha1 implementation)? Or just leave the asm implementation (which, if I understand, is generated by the perl scripts from crypto/sha/asm) ?

@briansmith
Copy link
Owner

briansmith commented May 25, 2016

It looks like rust-crypto is about 2x faster than my implementation. So are you thinking to just declare a dependency on that library (and I guess remove the existing sha1 implementation)? Or just leave the asm implementation (which, if I understand, is generated by the perl scripts from crypto/sha/asm) ?

Our goal is to always use the best implementation for each target.

There are four ASM implementations of each digest function: x86, x86-64, 32-bit ARM, and AAarch64. Actually, there are more, because within each implementation there are build-time (e.g. x86 has a SSE2 built-time switch) and/or runtime switches (based on CPU feature detection).

It might be the case that in some cases, for some target, the rust-crypto implementation is faster than the ASM implementation for that target. In that case, we should just delete the ASM code for that target and use rust-crypto instead. That has to be determined on a target-by-target basis.

There are some targets where OpenSSL has asm implementations but we don't have the asm implementation in ring. The primary case is MIPS. If the OpenSSL MIPS asm code is faster than rust-crypto (on MIPS), then we should import in the MIPS asm code from OpenSSL.

There are some targets for which we don't have asm code at all, e.g. WebAssembly. For these, ring currently just doesn't built. These targets are the ones where having a fast Rust implementation is important (OTOH, how important are these targets, really?).

As far as how to integrate a fast Rust implementation goes, for the cases where we don't already have an ASM implementation faster than what we can do in Rust, we have various options:

  1. We can add a new implementation in ring, like you proposed here.
  2. We can copy/paste the block_data_order function (which probably has a different name) from another Rust implementation, respecting its license. (In practice, this would mean choosing an ISC- or MIT- or BSD- licensed library to share the code with.)
  3. We can add a dependency on another crypto library (only for targets where we don't implement everything internally.)

The difficulty with option 3 is that usually crypto libraries don't expose the block_data_order type function that would be ideal for ring. Thus, option 1 seems like the best option to me. However, option 1 is a lot of work, because it is pretty hard to make a new implementation faster than every other Rust implementation. Thus, option 2 might be the most pragmatic choice.

What do you think?

@briansmith
Copy link
Owner

BTW, if you're interested in making your SHA-256 implementation 2x as fast, I'd be down to helping you with that too. Just know it will probably take some serious effort.

@gsoltis
Copy link
Author

gsoltis commented May 26, 2016

Option 2 does sound the most pragmatic.

I should mention, my own goals are to learn Rust, gain experience with native-code toolchains, and get more hands-on experience with the ssl protocol. So, I'm not super-confidant that I'll be able to get a 2x speedup out of my implementation, but I'll probably take a stab at doing at least some optimization, at least for the experience if nothing else.

If there are other more straightforward issues you think would be good to tackle, I'd be happy to spend time on those as well, especially if the solution here is to use something that exists already. Otherwise, as I said, I'll probably at least take a first pass at trying to optimize my implementation a little.

@gsoltis
Copy link
Author

gsoltis commented May 31, 2016

Quick update: I tried some limited manual unrolling in block_data_order_safe, from 64 iterations to 32, and that seemed to make it roughly 10% faster. So, this seems like a worthwhile avenue to pursue, and as you pointed out, seems to be what other implementations do.

@gsoltis
Copy link
Author

gsoltis commented Jun 2, 2016

Unrolled all 64 rounds, in chunks of 4 via macro. A little less readable now, but significantly faster. Still not quite as fast as rust-crypto. I'll probably look at unrolling the w calculation next.

@coveralls
Copy link

coveralls commented Jun 2, 2016

Coverage Status

Changes Unknown when pulling 99ab5d3fc0e6c56397ff028fc3b8deb6c709760e on gsoltis:sha256 into * on briansmith:master*.

@gsoltis
Copy link
Author

gsoltis commented Jun 3, 2016

Looks like the build issues are a problem w/ nightly? stable is passing...

@briansmith
Copy link
Owner

briansmith commented Jun 3, 2016

Awesome!

If you have time, could you please split your change into two:

  1. Replacing Wrapping with direct use of the wrapping_xxx() functions.
  2. Your unrolling.

I'd like to keep each set of optimizations in a separate commit so we can measure the impact each optimization has. That will make it easier to guide future work (on other digest algorithms).

Now that your state w shouldn't need 64 entries. Almost all high-performance implementations use a 16-element w. This smaller state then makes register allocation, and sometimes auto-vectorization, easier.

I see that you defined your macro inside the function, but that isn't how macros work in Rust, at least as far as the namespacing goes. I recommend that you put the macro at the top level of the file.

@gsoltis
Copy link
Author

gsoltis commented Jun 5, 2016

Reworked the commits a little, and unrolled w, using 16 slots, as suggested. I haven't cleaned up the macro placement yet. But, the benchmark is a lot closer now! The exact numbers vary a bit, but on my ~6 year old macbook pro, I get rust-crypto being somewhere around 10-15% faster. Much better than 2x faster.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 81.358% when pulling 2fcbba7 on gsoltis:sha256 into a25f86f on briansmith:master.

@briansmith
Copy link
Owner

Awesome. Being within 10-15% faster is quite reasonable. I'm pretty burned out today after working on a bunch of other stuff, but I'll look at this as soon as I get a chance.

@samscott89
Copy link
Contributor

I've got another optimisation which puts us within about 3% of the rust-crypto (on my machine at least). A simple trick to optimise the Sigma functions. I'll PR to this branch.

I get 213 MB/s -> 229 MB/s versus 235 MB/s for rust-crypto (on sha256::_1000).

Incidentally, if rust-crypto made the same change they would see a 20MB/s speedup...

Optimised Sigma functions.
@gsoltis
Copy link
Author

gsoltis commented Jun 13, 2016

Ok, I've merged @samscott89's PR, which does make a difference.

I looked at moving the macros out to the top level, but it appears that since they use local variables in the function, I need to modify them to take the identifiers for those variables. I'll add a commit to show you what I mean. If you have a preference on style, we can take that commit or not.

@coveralls
Copy link

coveralls commented Jun 13, 2016

Coverage Status

Changes Unknown when pulling 3d9fba2 on gsoltis:sha256 into * on briansmith:master*.

@gsoltis
Copy link
Author

gsoltis commented Jun 13, 2016

re: last commit, note having to pass in $block, whereas before, it wasn't required. I don't have a strong opinion on the style either way, but if we go with moving the macros out, I'll have to rework the one remaining one.

@coveralls
Copy link

coveralls commented Jun 13, 2016

Coverage Status

Changes Unknown when pulling e3ec283 on gsoltis:sha256 into * on briansmith:master*.

@briansmith
Copy link
Owner

Please see my latest comments in #199. Should I be reviewing the commits in this PR, or the ones in #199?

@briansmith
Copy link
Owner

I am actually kind of unsure about what happened with this PR in relation to #199 but it seems like we all decided to work in #199. Sorry about any confusion! I am as confused as anybody, I think.

Regardless, thank you very much for the PR!

@briansmith briansmith closed this Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants