Skip to content

Commit 25a0f56

Browse files
committed
p2qrh: updates stack_element_size_performance_tests.adoc based on
performance and failure tests.
1 parent 0793e2d commit 25a0f56

File tree

1 file changed

+219
-38
lines changed

1 file changed

+219
-38
lines changed

bip-0360/ref-impl/rust/docs/stack_element_size_performance_tests.adoc

Lines changed: 219 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
:scrollbar:
2+
:data-uri:
3+
:toc2:
4+
:linkattrs:
5+
6+
= Stack Element Size Performance Tests
17

28
:numbered:
39

@@ -9,7 +15,7 @@ Subsequently, there is a need to determine the performance and stability related
915

1016
== Regression Tests
1117

12-
The following regression tests failed with `MAX_SCRIPT_ELEMENT_SIZE` set to 8000
18+
The following regression tests failed with `MAX_SCRIPT_ELEMENT_SIZE` set to 8000 .
1319

1420
[cols="1,1,2"]
1521
|===
@@ -19,28 +25,63 @@ The following regression tests failed with `MAX_SCRIPT_ELEMENT_SIZE` set to 8000
1925
|rpc_createmultisig.py | lines 75-75 | No exception raised: redeemScript exceeds size limit: 684 > 520"
2026
|===
2127

22-
== Performance Analysis
28+
**Analysis**
29+
30+
These 4 tests explicitly test for a stack element size of 520 and are expected to fail with a stack element size of 8Kb.
31+
Subsequently, no further action needed.
32+
33+
34+
== Performance Tests
35+
36+
=== OP_SHA256
2337

38+
The following Bitcoin script is used to conduct this performance test:
2439

40+
-----
41+
<pre-image array> OP_SHA256 OP_DROP OP_1
42+
-----
43+
44+
When executed, this script adds the pre-image array of arbitrary data to the stack.
45+
Immediately after, a SHA256 hash function pops the pre-image array off the stack, executes a hash and adds the result to the top of the stack.
46+
The `OP_DROP` operation removes the hash result from the stack.
2547

2648

27-
=== Results Summary
49+
==== Results Summary
2850

51+
[cols="3,1,1,1,1,1,1,1,1,1", options="header"]
2952
|===
30-
| Preimage Bytes | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total|
31-
| 1 | 637.28 | 1,569,165.30 | 0.3% | 8,736.00 | 1,338.55 | 6.526 | 832.00 | 0.0% | 5.53 |
32-
| 64 | 794.85 | 1,258,098.46 | 0.4% | 11,107.00 | 1,666.92 | 6.663 | 827.00 | 0.0% | 5.61 |
33-
| 65 | 831.95 | 1,201,996.30 | 0.5% | 11,144.00 | 1,698.26 | 6.562 | 841.00 | 0.0% | 5.53 |
34-
| 7500 | 19,172.67 | 52,157.58 | 0.5% | 285,220.02 | 40,203.63 | 7.094 | 1,636.02 | 0.4% | 5.49 |
53+
| Stack Element Size (Bytes) | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op |miss% | total
54+
| 1 | 637.28 | 1,569,165.30 | 0.3% | 8,736.00 | 1,338.55 | 6.526 | 832.00 | 0.0% | 5.53
55+
| 64 | 794.85 | 1,258,098.46 | 0.4% | 11,107.00 | 1,666.92 | 6.663 | 827.00 | 0.0% | 5.61
56+
| 65 | 831.95 | 1,201,996.30 | 0.5% | 11,144.00 | 1,698.26 | 6.562 | 841.00 | 0.0% | 5.53
57+
| 100 | 794.82 | 1,258,139.86 | 0.2% | 11,139.00 | 1,673.89 | 6.655 | 837.00 | 0.0% | 5.50
58+
| 520 | 1,946.67 | 513,697.88 | 0.2% | 27,681.00 | 4,095.57 | 6.759 | 885.00 | 0.0% | 5.50
59+
| 8000 | 20,958.63 | 47,713.05 | 2.7% | 304,137.02 | 43,789.86 | 6.945 | 1,689.02 | 0.4% | 5.63
3560
|===
3661

37-
==== key
62+
**Analysis**
63+
64+
The following observations are made from the performance test:
65+
66+
. **Performance Scaling**: The increase from 520 bytes to 8000 bytes (15.4x size increase) results in approximately 9.8x performance degradation (19,173 ns/op vs 1,947 ns/op).
67+
This represents sub-linear scaling, which suggests the implementation handles large data efficiently.
3868

39-
[cols="1,6"]
69+
. **Instruction Count Scaling**: Instructions per operation increase from 27,681 to 285,220 (10.3x increase), closely matching the performance degradation, indicating the bottleneck is primarily computational rather than memory bandwidth.
70+
71+
. **Throughput Impact**: Operations per second decrease from 513,698 op/s to 52,158 op/s, representing a 9.8x reduction in throughput.
72+
73+
. **Cache Efficiency**: The IPC (Instructions Per Cycle) remains relatively stable (6.759 to 7.094), suggesting good CPU pipeline utilization despite the increased data size.
74+
75+
. **Memory Access Patterns**: The branch mis-prediction rate increases slightly (0.0% to 0.4%), indicating minimal impact on branch prediction accuracy.
76+
77+
78+
**key**
79+
80+
[cols="1,6", options="header"]
4081
|===
4182
| Metric | Description
42-
| ns/op | Nanoseconds per operation - the average time it takes to complete one benchmark iteration, measured in billionths of a second
43-
| op/s | Operations per second - the throughput rate showing how many benchmark iterations can be completed per second
83+
| ns/op | Nanoseconds per operation - average time it takes to complete one benchmark iteration
84+
| op/s | Operations per second - throughput rate showing how many benchmark iterations can be completed per second
4485
| err% | Error percentage - statistical margin of error in the measurement, indicating the reliability of the benchmark results
4586
| ins/op | Instructions per operation - the number of CPU instructions executed for each benchmark iteration
4687
| cyc/op | CPU cycles per operation - the number of CPU clock cycles consumed for each benchmark iteration
@@ -50,25 +91,25 @@ The following regression tests failed with `MAX_SCRIPT_ELEMENT_SIZE` set to 8000
5091
| total | Total benchmark time - the total wall-clock time spent running the entire benchmark in seconds
5192
|===
5293

53-
=== Detailed Results
94+
==== Detailed Results
5495

55-
==== Stack Element Size = 1 Byte
96+
===== Stack Element Size = 1 Byte
5697

57-
|==
58-
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
59-
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
60-
| 637.28 | 1,569,165.30 | 0.3% | 8,736.00 | 1,338.55 | 6.526 | 832.00 | 0.0% | 5.53 | `VerifyP2WSHBench`
61-
|==
98+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
99+
|===
100+
|ns/op |op/s |err% |ins/op |cyc/op |IPC |bra/op |miss% |total
101+
|637.28 |1,569,165.30 |0.3% |8,736.00 |1,338.55 |6.526 |832.00 |0.0% |5.53
102+
|===
62103

63-
==== Stack Element Size = 64 Bytes
104+
===== Stack Element Size = 64 Bytes
64105

65-
|==
66-
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
67-
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
68-
| 794.85 | 1,258,098.46 | 0.4% | 11,107.00 | 1,666.92 | 6.663 | 827.00 | 0.0% | 5.61 | `VerifyP2WSHBench`
69-
|==
106+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
107+
|===
108+
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total
109+
| 794.85 | 1,258,098.46 | 0.4% | 11,107.00 | 1,666.92 | 6.663 | 827.00 | 0.0% | 5.61
110+
|===
70111

71-
===== Explanation
112+
====== Explanation
72113

73114
Even though 64 bytes doesn't require padding (it's exactly one SHA256 block), the ins/op still increases from 8,736 to 11,107 instructions. Here's why:
74115

@@ -111,27 +152,116 @@ Even though 64 bytes doesn't require padding (it's exactly one SHA256 block), th
111152
The increase from 8,736 to 11,107 instructions (~27% increase) suggests that even without padding overhead, the additional data movement and processing of "real" data vs padded data adds significant instruction count.
112153
This is a good example of how seemingly small changes in input size can affect the underlying implementation's code paths and optimization strategies.
113154

114-
==== Stack Element Size = 65 Bytes
155+
===== Stack Element Size = 65 Bytes
115156

116157
1 byte more than the SHA256 _block_ size
117158

118-
|==
119-
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
120-
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
121-
| 831.95 | 1,201,996.30 | 0.5% | 11,144.00 | 1,698.26 | 6.562 | 841.00 | 0.0% | 5.53 | `VerifyP2WSHBench`
122-
|==
159+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
160+
|===
161+
|ns/op |op/s |err% |ins/op |cyc/op |IPC |bra/op | miss% | total
162+
| 831.95 | 1,201,996.30 |0.5% |11,144.00 |1,698.26 | 6.562 |841.00 | 0.0% | 5.53
163+
|===
164+
165+
===== Stack Element Size = 100 Bytes
166+
167+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
168+
|===
169+
|ns/op |op/s |err% |ins/op |cyc/op |IPC |bra/op | miss% | total
170+
| 794.82 | 1,258,139.86 | 0.2% | 11,139.00 | 1,673.89 | 6.655 | 837.00 | 0.0% | 5.50
171+
|===
172+
173+
===== Stack Element Size = 520 Bytes
174+
175+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
176+
|===
177+
|ns/op |op/s |err% |ins/op |cyc/op |IPC |bra/op | miss% | total
178+
| 1,946.67 | 513,697.88 | 0.2% | 27,681.00 | 4,095.57 | 6.759 | 885.00 | 0.0% | 5.50
179+
|===
180+
181+
===== Stack Element Size = 8000 Bytes
182+
183+
[cols="2,1,1,1,1,1,1,1,1", options="header"]
184+
|===
185+
|ns/op |op/s |err% |ins/op |cyc/op |IPC |bra/op | miss% | total
186+
| 20,958.63 | 47,713.05 | 2.7% | 304,137.02 | 43,789.86 | 6.945 | 1,689.02 | 0.4% | 5.63
187+
|===
188+
189+
=== OP_DUP OP_SHA256
190+
191+
NOTE: This test is likely irrelevant as per latest BIP-0360: _To prevent OP_DUP from creating an 8 MB stack by duplicating stack elements larger than 520 bytes we define OP_DUP to fail on stack elements larger than 520 bytes_.
192+
193+
This test builds off the previous (involving the hashing of large stack element data) by duplicating that stack element data.
194+
195+
The following Bitcoin script is used to conduct this performance test:
196+
197+
-----
198+
<pre-image array> OP_DUP OP_SHA256 OP_DROP OP_1
199+
-----
200+
201+
When executed, this script adds the pre-image array of arbitrary data to the stack.
202+
Immediately after, a `OP_DUP` operation duplicates the pre-image array on the stack.
203+
Then, a SHA256 hash function pops the pre-image array off the stack, executes a hash and adds the result to the top of the stack.
204+
The `OP_DROP` operation removes the hash result from the stack.
123205

124-
==== Stack Element Size = 7500 Bytes
206+
==== Results Summary
125207

126-
|==
127-
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
128-
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
129-
| 19,172.67 | 52,157.58 | 0.5% | 285,220.02 | 40,203.63 | 7.094 | 1,636.02 | 0.4% | 5.49 | `VerifyP2WSHBench`
130-
|==
208+
[cols="3,1,1,1,1,1,1,1,1,1", options="header"]
209+
|===
210+
| Stack Element Size (Bytes) | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op |miss% | total
211+
| 1 | 714.83 | 1,398,937.33 | 0.7% | 9,548.00 | 1,488.22 | 6.416 | 1,012.00 | 0.0% | 5.57
212+
| 64 | 858.44 | 1,164,905.19 | 0.4% | 11,911.00 | 1,800.87 | 6.614 | 999.00 | 0.0% | 5.11
213+
| 65 | 868.40 | 1,151,539.31 | 0.8% | 11,968.00 | 1,814.31 | 6.596 | 1,019.00 | 0.0% | 5.56
214+
| 100 | 864.33 | 1,156,966.91 | 0.4% | 11,963.00 | 1,809.16 | 6.612 | 1,015.00 | 0.0% | 5.49
215+
| 520 | 2,036.64 | 491,005.94 | 0.7% | 28,615.00 | 4,266.27 | 6.707 | 1,073.00 | 0.0% | 5.52
216+
| 8000 | 20,883.10 | 47,885.61 | 0.2% | 306,887.04 | 43,782.35 | 7.009 | 2,089.02 | 0.3% | 5.53
217+
|===
218+
219+
==== Analysis
220+
221+
The following observations are made from the performance test (in comparison to the `OP_SHA256` test):
222+
223+
. OP_DUP Overhead: The OP_DUP operation adds overhead by duplicating the stack element, which requires:
224+
* Memory allocation for the duplicate
225+
* Data copying from the original to the duplicate
226+
* Additional stack manipulation
131227

228+
. Size-Dependent Impact on ns/op:
229+
* For small elements (1-100 bytes): Significant overhead (4.4% to 12.2%)
230+
* For medium elements (520 bytes): Moderate overhead (4.6%)
231+
* For large elements (8000 bytes): Negligible difference (-0.4%)
232+
233+
. Instruction Count Impact:
234+
* 8000 bytes: 304,137 → 306,887 instructions (+2,750 instructions)
235+
* The additional instructions for OP_DUP are relatively small compared to the SHA256 computation
236+
237+
. Memory Operations:
238+
+
239+
The OP_DUP operation primarily affects memory operations rather than computational complexity.
240+
This explains why the impact diminishes with larger data sizes where SHA256 computation dominates the performance.
241+
242+
This analysis shows that the OP_DUP operation has a measurable but manageable performance impact, especially for larger stack elements where the computational overhead of SHA256 dominates the overall execution time.
132243

133244
=== Procedure
134245

246+
* Testing is done using functionality found in the link:https://github.com/jbride/bitcoin/tree/p2qrh[p2qrh branch] of Bitcoin Core.
247+
248+
* Compilation of Bitcoin Core is done using the following `cmake` flags:
249+
+
250+
-----
251+
$ cmake \
252+
-B build \
253+
-DWITH_ZMQ=ON \
254+
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
255+
-DBUILD_BENCH=ON
256+
-----
257+
258+
* Bench tests are conducted similar to the following :
259+
+
260+
-----
261+
$ export PREIMAGE_SIZE_BYTES=8000
262+
$ ./build/bin/bench_bitcoin --filter=VerifyP2WSHBench -min-time=5000
263+
-----
264+
135265
== Failure Analysis
136266

137267
Goals:
@@ -141,6 +271,57 @@ Goals:
141271
* Detect memory errors (e.g., invalid reads/writes, use-after-free) that might arise from modified stack handling.
142272
* Assess performance impacts (e.g., memory allocation overhead) in critical paths like transaction validation.
143273

274+
=== Memory Errors
275+
276+
AddressSanitizer is a fast, compiler-based tool (available in GCC/Clang) for detecting memory errors with lower overhead than Valgrind.
277+
278+
==== Results
279+
280+
No memory errors or leaks were revealed by AddressSanetizer when running the `OP_SHA256` bench test for 30 minutes.
281+
282+
==== Procedure
283+
284+
AddressSanitizer is included with Clang/LLVM
285+
286+
. Compilation of Bitcoin Core is done using the following `cmake` flags:
287+
+
288+
-----
289+
$ cmake -B build \
290+
-DWITH_ZMQ=ON \
291+
-DBUILD_BENCH=ON \
292+
-DCMAKE_C_COMPILER=clang \
293+
-DCMAKE_CXX_COMPILER=clang++ \
294+
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
295+
-DSANITIZERS=address,undefined
296+
297+
$ cmake --build build -j$(nproc)
298+
-----
299+
300+
. Check that ASan is statically linked to the _bench_bitcoin_ exeutable:
301+
+
302+
-----
303+
$ nm build/bin/bench_bitcoin | grep asan | more
304+
0000000000148240 T __asan_address_is_poisoned
305+
00000000000a2fe6 t __asan_check_load_add_16_R13
306+
307+
...
308+
309+
000000000316c828 b _ZZN6__asanL18GlobalsByIndicatorEmE20globals_by_indicator
310+
0000000003170ccc b _ZZN6__asanL7AsanDieEvE9num_calls
311+
-----
312+
313+
. Set the following environment variable:
314+
+
315+
-----
316+
$ export ASAN_OPTIONS="halt_on_error=0:detect_leaks=1:log_path=/tmp/asan_logs/asan"
317+
-----
318+
+
319+
Doing so ensures that _address sanitizer_ :
320+
321+
.. avoids halting on the first error
322+
.. is enable memory leak detection
323+
.. writes ASAN related logs to a specified directory
324+
144325
== Test Environment
145326

146327
* Fedora 42

0 commit comments

Comments
 (0)