-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Issue To Be Solved
Cadence storage layer's use of CBOR can be optimized to improve speed, memory use, and stored data size.
Suggestion
Performance gains are possible by tweaking the CBOR representation of Cadence Value
types. For example, replacing CBOR map with CBOR array when feasible.
In addition to performance gains, replacing CBOR map with CBOR array preserves order (e.g. DictionaryValue.Entries
).
UPDATE: Additional performance gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).
No changes to fxamacker/cbor are required to support these optimizations.
The current CBOR representation in Cadence is very nice — so this optimization is kind of like taking a nicely normalized RDBMS schema and denormalizing some of it to gain speed.
Performance Comparisons
A preliminary comparison (from PR #752) with non-production data showed potential performance improvements. Although faster execution speed was the primary objective, memory use and stored data size also improved.
Comparisons using production data (in PR #761) shows performance gains are cumulative. On a very large value (776,155 bytes), optimization of composite + dictionary types showed:
- encoding is 44.96% faster, 41.99% less memory, 46.02% fewer allocs
- decoding is 34.48% faster, 29.67% less memory, 18.56% fewer allocs
⚠️ NOTE: "faster" refers to time delta in benchstat
, so "50% faster" here means 2x speedup.
Additional performance gains are possible by optimizing remaining Value
types. Smaller data will probably begin to catch up in percent improved as more remaining types are optimized. Larger data with big composites and dictionaries will show more modest gains with remaining types.
Cumulative performance gains from PR #752, PR #761, PR #778, and PR #788 confirmed expectations about small data catching up in performance gains.
For a 167-byte LinkValue (extracted from a mainnet snapshot):
- encoding is 50.54% faster, 57.81% less kB per alloc, 54.46% fewer allocs/op
- decoding is 31.43% faster, 47.60% less kB per alloc, same allocs/op
For a 776155-byte CompositeValue (extracted from a mainnet snapshot):
- encoding is 45.23% faster, 48.25% less kB per alloc, 51.57% fewer allocs/op
- decoding is 35.83% faster, 31.23% less kB per alloc, 18.56% fewer allocs/op
Additional performance gains are possible by:
- optimizing code in the Cadence storage layer
- adding higher-performance API to fxamacker/cbor
- using higher-performance API of fxamacker/cbor when available
Click to expand:
Preliminary Comparisons (non-production dictionary data)
Stored CBOR Data Size Comparisons (for dictionaries, non-production data)
Dictionary Data | Old Size (bytes) | New Size (bytes) | Delta |
---|---|---|---|
Small benchmark | 3361 | 1941 | -42.25% |
Large benchmark | 348802 | 199602 | -42.77% |
Size reduction is data dependent. Comparisons using production data is recommended.
Encoding Speed & Memory Comparisons (for dictionaries, non-production data)
name old time/op new time/op delta
EncodingSmallValue-4 199µs ± 0% 137µs ± 0% -31.26% (p=0.000 n=10+9)
EncodingLargeValue-4 19.4ms ± 0% 13.1ms ± 1% -32.25% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
EncodingSmallValue-4 57.6kB ± 0% 36.2kB ± 0% -37.12% (p=0.000 n=10+10)
EncodingLargeValue-4 4.67MB ± 0% 3.55MB ± 0% -23.93% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
EncodingSmallValue-4 1.10k ± 0% 0.80k ± 0% -27.46% (p=0.000 n=10+10)
EncodingLargeValue-4 92.1k ± 0% 70.8k ± 0% -23.12% (p=0.000 n=10+10)
Decoding Speed & Memory Comparisons (for dictionaries, non-production data)
name old time/op new time/op delta
DecodingSmallValue-4 192µs ± 0% 145µs ± 0% -24.32% (p=0.000 n=9+10)
DecodingLargeValue-4 19.9ms ± 1% 14.7ms ± 0% -26.26% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
DecodingSmallValue-4 73.9kB ± 0% 60.7kB ± 0% -17.84% (p=0.000 n=10+10)
DecodingLargeValue-4 7.49MB ± 0% 5.75MB ± 0% -23.23% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
DecodingSmallValue-4 1.57k ± 0% 1.36k ± 0% -13.36% (p=0.000 n=10+10)
DecodingLargeValue-4 143k ± 0% 122k ± 0% -14.62% (p=0.000 n=10+10)
Benchmarks used linux_amd64 with Go 1.15.10. See Caveats.
Performance of encoding and decoding is data dependent. Comparisons using production data is recommended.
Comparisons using production data (cumulative gains from from PR #752, PR #761, PR #778, and PR #788):
LinkValue 🚀
Encoding Comparisons
name old time/op new time/op delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 20.2µs ± 1% 10.0µs ± 0% -50.54% (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 20.7µs ± 1% 10.0µs ± 0% -51.44% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 5.18kB ± 0% 2.18kB ± 0% -57.81% (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 5.50kB ± 0% 2.17kB ± 0% -60.60% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
EncodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 101 ± 0% 46 ± 0% -54.46% (p=0.000 n=10+10)
EncodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 101 ± 0% 46 ± 0% -54.46% (p=0.000 n=10+10)
Decoding Comparisons
name old time/op new time/op delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 11.9µs ± 0% 8.2µs ± 1% -31.43% (p=0.000 n=8+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 11.9µs ± 1% 8.1µs ± 1% -31.84% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 4.50kB ± 0% 2.36kB ± 0% -47.60% (p=0.000 n=10+10)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 4.50kB ± 0% 2.36kB ± 0% -47.60% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
DecodeCBOR/link_v3_2392f05c3b72f235_167bytes-4 55.0 ± 0% 55.0 ± 0% ~ (all equal)
DecodeCBOR/link_v3_3a791fe1b8243e73_192bytes-4 55.0 ± 0% 55.0 ± 0% ~ (all equal)
Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.
CompositeValue without dictionaries 🚀
Encoding Comparisons
name old time/op new time/op delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 899µs ± 1% 486µs ± 1% -45.93% (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 20.8µs ± 1% 11.8µs ± 1% -43.30% (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 15.8µs ± 1% 8.4µs ± 0% -47.12% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 247kB ± 2% 138kB ± 0% -44.32% (p=0.000 n=10+9)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 4.93kB ± 2% 2.40kB ± 0% -51.30% (p=0.000 n=9+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 3.77kB ± 0% 1.62kB ± 0% -57.15% (p=0.000 n=9+10)
name old allocs/op new allocs/op delta
EncodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 5.03k ± 1% 2.54k ± 0% -49.50% (p=0.000 n=10+10)
EncodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 104 ± 1% 55 ± 0% -46.91% (p=0.000 n=10+10)
EncodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 76.0 ± 0% 37.0 ± 0% -51.32% (p=0.000 n=10+10)
Decoding Comparisons
name old time/op new time/op delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 762µs ± 1% 516µs ± 1% -32.27% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 17.2µs ± 1% 11.6µs ± 1% -32.59% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 11.9µs ± 1% 8.0µs ± 1% -32.96% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 298kB ± 0% 212kB ± 0% -28.81% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 5.58kB ± 0% 3.87kB ± 0% -30.56% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 3.91kB ± 0% 2.57kB ± 0% -34.36% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
DecodeCBOR/comp_v2_96d7e06eaf4b3fcf_14850bytes-4 5.05k ± 0% 4.65k ± 0% -7.98% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_99dc360eee32dcec_160bytes-4 97.0 ± 0% 89.0 ± 0% -8.25% (p=0.000 n=10+10)
DecodeCBOR/comp_v3_b52a33b7e56868f6_139bytes-4 56.0 ± 0% 53.0 ± 0% -5.36% (p=0.000 n=10+10)
Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.
Large CompositeValue with dictionaries and etc. 🚀
Encoding Comparisons
name old time/op new time/op delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 35.5ms ± 1% 19.4ms ± 1% -45.23% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 7.62MB ± 4% 3.94MB ± 0% -48.25% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
EncodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 141k ± 0% 68k ± 0% -51.57% (p=0.000 n=10+10)
Decoding Comparisons
name old time/op new time/op delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 40.4ms ± 1% 25.9ms ± 1% -35.83% (p=0.000 n=9+10)
name old alloc/op new alloc/op delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 15.0MB ± 0% 10.3MB ± 0% -31.23% (p=0.000 n=9+10)
name old allocs/op new allocs/op delta
DecodeCBOR/comp_v3_d99d7e6b4dad41e1_776155bytes-4 265k ± 0% 216k ± 0% -18.56% (p=0.000 n=9+10)
Benchmarks used Go 1.15.10 on linux_amd64 using production data. See Caveats.
Proposed Changes
Encoding and decoding arrays are faster than maps. So the proposed change replaces CBOR maps with CBOR arrays wherever feasible.
As mentioned earlier, this change has the added benefit of preserving the order of DictionaryValue.Entries
.
The changes are very simple and the minimally modified tests pass. Modified files are limited to encoding.go, encoding_test.go, and decoding.go.
For example, changes to func (e *Encoder) prepareDictionaryValue
might include:
-
declaring
entries
as an array here:
cadence/runtime/interpreter/encode.go
Line 686 in 3bbc8c8
entries := make(map[string]interface{}, v.Entries.Len()) -
replacing
Content: cborMap{
withContent: cborArray{
here:
cadence/runtime/interpreter/encode.go
Lines 762 to 768 in 3bbc8c8
return cbor.Tag{ Number: cborTagDictionaryValue, Content: cborMap{ encodedDictionaryValueKeysFieldKey: keys, encodedDictionaryValueEntriesFieldKey: entries, }, }, nil
As mentioned earlier, additional speed gains (especially for decoding) are possible by removing backwards compatible decoding (thanks @turbolent) and not saving key strings twice (thanks @SupunS).
Caveats
Changing the CBOR representation of data is a breaking change -- hopefully, it's just an internal implementation detail within the storage layer. Guidance from @turbolent would be appreciated to make sure I didn't miss anything.
I don't know how closely the speed gains will translate to other platforms (other CPU archs, compiled to wasm, and etc.)
Having this change to CBOR representation completed before I proceed with API changes to my CBOR library can save time.
Performance comparisons are data dependent. Comparisons should use realistic Cadence data (most common mix of data types, values, & sizes encountered in production).
Updates (not exhaustive list)
April 9, 2021 at 2:30 PM PT - Include cumulative benchmark comparisons to master (commit 7c7dd55) with optimizations from PR #752, PR #761, PR #778, and PR #788 using production data (values extracted from mainnet snapshot).
April 5, 2021 at 6:03 AM PT - Include benchmark comparisons using production data. Remove section "Data Used for Benchmark Comparisons" because it described non-production data.