Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/c32 optimizations #3075

Merged
merged 17 commits into from
Apr 20, 2022
Merged

Feat/c32 optimizations #3075

merged 17 commits into from
Apr 20, 2022

Conversation

zone117x
Copy link
Member

@zone117x zone117x commented Mar 15, 2022

Various optimizations in the C32 address decoding paths. Use a decoding table (the inverse of the encoding table/alphabet) in the decoding hot path (previously an O(n) utf8 char scan was used). Reduce some allocs & copies.

@zone117x zone117x requested review from kantai and jcnelson March 15, 2022 22:46
Comment on lines -73 to -77
let norm_str: String = input_str
.to_uppercase()
.replace("O", "0")
.replace("L", "1")
.replace("I", "1");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reductions of the char scans and allocs here is one of the primary optimizations in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the normalization code go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This normalization fn was no longer called once the encoding logic was changed to use a faster C32 alphabet lookup using the C32_CHARACTERS_MAP table. If you take a look at the code there, you'll see where the lowercase and special chars are included. I'm happy to add it back (will be dead code though).

let iter_c32_digits_opts: Vec<Option<usize>> = c32_normalize(input_str)
.chars()
.rev()
.map(|x| C32_CHARACTERS.find(x))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor of this find from O(n) to O(1) in the hot path is one of the primary optimizations in this PR

/// table[pair.0.to_ascii_lowercase() as usize] = i;
/// }
/// ```
const C32_CHARACTERS_MAP: [i8; 128] = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with trying both lazy_static and const fn for this, neither seemed right, so I took the table result and inlined it. The rust code doc has compilation validation support (at least in the tools I'm using). So seems like this is the best approach. But I'm open to alternatives!

@codecov
Copy link

codecov bot commented Mar 15, 2022

Codecov Report

Merging #3075 (b1061ef) into develop (1e0b80b) will increase coverage by 0.03%.
The diff coverage is 91.59%.

@@             Coverage Diff             @@
##           develop    #3075      +/-   ##
===========================================
+ Coverage    83.72%   83.76%   +0.03%     
===========================================
  Files          259      260       +1     
  Lines       200777   200932     +155     
===========================================
+ Hits        168093   168302     +209     
+ Misses       32684    32630      -54     
Impacted Files Coverage Δ
stacks-common/src/address/mod.rs 81.17% <ø> (ø)
stacks-common/src/address/c32_old.rs 89.77% <89.77%> (ø)
stacks-common/src/address/c32.rs 97.55% <97.82%> (-0.34%) ⬇️
stacks-common/src/util/hash.rs 82.47% <100.00%> (-1.11%) ⬇️
clarity/src/vm/ast/traits_resolver/mod.rs 94.44% <0.00%> (-3.97%) ⬇️
src/net/neighbors.rs 36.84% <0.00%> (-1.59%) ⬇️
src/net/poll.rs 60.91% <0.00%> (-1.53%) ⬇️
src/net/dns.rs 90.25% <0.00%> (-0.86%) ⬇️
clarity/src/vm/database/key_value_wrapper.rs 96.51% <0.00%> (-0.32%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e0b80b...b1061ef. Read the comment docs.

Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is all consensus critical code you're touching. We use the bare addresses in the chainstate MARF, which affects the block state root hashes. At a minimum, I need to see unit tests demonstrating that this new code behaves exactly the same way as the old code.

@zone117x
Copy link
Member Author

The existing tests cover these paths pretty extensively, e.g. one for the normalization code:
https://github.com/stacks-network/stacks-blockchain/blob/03107740595e5359f2381dfb97b5ee153125c44a/stacks-common/src/address/c32.rs#L383-L404

I'm happy to add more, but not really seeing any gaps in testing, would appreciate some pointers here.

@jcnelson
Copy link
Member

A couple things:

  • Because this PR touches consensus-critical code to implement better performance, I'd like to know what exactly the performance impact is. If it's not very big in practice, then I'd rather not merge this PR. It's better to have slightly-slower but known-good code in the consensus-critical paths than fast code that we're less than 100% certain of. Like, does this PR improve the node bootup speed? Does it improve p2p or RPC responsiveness? Does it improve the miner speed?

  • If this PR yields a significant improvement in performance, then the test I'd like to see added is to do the following:

    • Keep the old code around, but decorated as #[cfg(test)].
    • Generate lots and lots of addresses with the old code, and verify that they decode via the new code exactly the same way as they would in the old code.
    • Generate lots and lots of addresses with the new code, and verify that they decode via the old code exactly the same way as they would in the new code.

    We do something like this with the MARF PR I have open, for example -- it keeps all the old code paths around, and verifies that the new code paths generate exactly the same trie root hashes as the old code paths given the same key/value insert schedule.

@zone117x
Copy link
Member Author

Sounds reasonable, keeping the old code around would also help with generating benchmarks. Btw, this code is in a library "stacks-common" which is used in projects outside the stacks-node, so I wouldn't be surprised if there was a perf gain that doesn't help with the stacks-node operations you've listed, but could be significant for other use cases. Seems like these would still be good things to upstream (assuming the consensus-critical change requirements you've outlined are met).

@zone117x
Copy link
Member Author

@jcnelson how does this look? Added tests to generate addrs with new and old code, decode both with new and old code, verifying they all match. I ran it with 500,000 iterations of random input, and tests pass (takes a few minutes). I committed a version of the test that performs 5000 iterations.

Then added a c32_bench which shows around a 4x perf increase:

C32 Decoding/Legacy/S01JQF9J50WT0EZSN30CH1KQSGR9H0PPHNCHPH0AG                                                                             
                        time:   [1.9297 us 1.9448 us 1.9707 us]
C32 Decoding/Updated/S01JQF9J50WT0EZSN30CH1KQSGR9H0PPHNCHPH0AG                                                                              
                        time:   [499.92 ns 502.18 ns 505.40 ns]
C32 Decoding/Legacy/S62TSHKFJ7BRQFTFJ6BN0645E85NPYZ43G0AD7D71                                                                             
                        time:   [1.9495 us 1.9565 us 1.9633 us]
C32 Decoding/Updated/S62TSHKFJ7BRQFTFJ6BN0645E85NPYZ43G0AD7D71                                                                            
                        time:   [497.96 ns 499.04 ns 500.14 ns]
C32 Decoding/Legacy/S83AHFYJ817AYTZKKVSMR1XTEG31QQDF6BQP5PN70                                                                             
                        time:   [1.9602 us 1.9659 us 1.9712 us]
C32 Decoding/Updated/S83AHFYJ817AYTZKKVSMR1XTEG31QQDF6BQP5PN70                                                                              
                        time:   [498.07 ns 499.26 ns 500.42 ns]
C32 Decoding/Legacy/S1TXJX2Z31R3XQW7VT15XR01JYHAW2P4056GTGSG                                                                             
                        time:   [1.9368 us 1.9417 us 1.9466 us]
C32 Decoding/Updated/S1TXJX2Z31R3XQW7VT15XR01JYHAW2P4056GTGSG                                                                            
                        time:   [496.47 ns 497.27 ns 498.07 ns]
C32 Decoding/Legacy/SG3KAEP0GX797TPWTRX3BB1DKSQ0NPPWVDHY3TW09                                                                             
                        time:   [1.9559 us 1.9619 us 1.9679 us]
C32 Decoding/Updated/SG3KAEP0GX797TPWTRX3BB1DKSQ0NPPWVDHY3TW09                                                                             
                        time:   [498.71 ns 499.79 ns 500.87 ns]

image

.map(|ref_x| *ref_x)
.rev()
.filter_map(|x| C32_CHARACTERS_MAP.get(*x as usize))
.filter_map(|v| u8::try_from(*v).ok())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to depend on u8::try_from(*v) returning None if *v is -1. Can this code be more explicit about this? Such as:

.filter_map(|v| {
   if *v < 0 {
       None
   }
   else {
       // infallible due to the above check
       Some(*v as u8)
    }
})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u8::try_from is very explicit about conversions: negative inputs will always error, as will any other input outside the range of u8. x as u8 is a much more convoluted conversion: 1000 as u8 returns 232 (it neither is a saturating conversion, nor a panic).

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c687a7113c8758f439bfd06a96a4957c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be better to use an array of Option(u8) rather than -1i8 for the "null" sentinel values (this is what the b58 table does). I'll give that a try when I address the other comments and see if there's any unexpected perf issues with that approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the table from [i8; 128] to [Option<u8>; 128] which gets rid of this integer conversion, and the benchmark shows a slight perf increase.

Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but please address my comments before merging. Thanks!

Copy link
Member

@kantai kantai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, the only comments I think need to be addressed are the cfg(test) decoration for the old code, and including the normalization rules in the docstring.

@zone117x zone117x force-pushed the feat/c32-optimizations branch from ab381fc to 5fcdb07 Compare April 19, 2022 17:31
@zone117x zone117x merged commit d2f31ca into develop Apr 20, 2022
@zone117x zone117x deleted the feat/c32-optimizations branch April 20, 2022 12:41
@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Nov 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants