Skip to content

Conversation

@fxamacker
Copy link
Member

@fxamacker fxamacker commented Mar 1, 2023

Description

Updates #2157

Cadence Compact Format (CCF) is a data format designed for compact, efficient, and deterministic encoding of Cadence external values.

CCF obsoletes JSON-Cadence Data Interchange Format (JSON-CDC) for use cases that do not require JSON.

Unlike JSON-CDC, etc. CCF Specifications explicitly defines requirements for:

  • well-formed encodings
  • valid encodings
  • deterministic encodings

CCF is hybrid data format. CCF-based messages can be:

  • fully self-describing: ~2x smaller than JSON-CDC for events
  • partially self-describing: ~14x smaller than JSON-CDC for events

Both CCF modes are more compact than JSON-based messages. CCF-based protocols can send Cadence metadata just once for all messages of that type. Malformed data can be detected without Cadence metadata and without creating Cadence objects.

CCF codec implements CCF Specification (RC1), which is temporarily at github.com/fxamacker/ccf_draft. The CCF specs
will be hosted under github.com/onflow after it is updated, cleaned up, and reformatted.

API

CCF codec provides the same API as the existing JSON-Cadence Data Interchange codec.

Given this, integration effort to replace the old codec should be minimal.

PRELIMINARY COMPARISONS

Comparisons used 48,309 events from a single mainnet transaction.

There were 9 event types. To simplify benchmark code, the first event's value in each of the 9 event types was used.

CCF's partially self-describing mode (aka "detached" mode) would be even smaller than this (e.g. maybe less than 1/14 the size of JSON when Flow eventually supports detached mode).

SIZE COMPARISON

Encoding | Num Events | Encoded size | Comments
-------- | ---------- | ------------ | --------
JSON     |     48,309 |   13,858,836 | JSON-Cadence Data Interchange
CCF      |     48,309 |    6,159,931 | CCF in fully self-describing mode

CCF SPEED AND MEMORY COMPARISONS

This isn't apples to apples comparison. JSON data isn't sorted, etc.

  • CCF encoder sorts data to encode event data deterministically.
  • CCF decoder also verifies event data is sorted, well-formed, and valid.
  • CCF encoding is less than 1/2 size of JSON-Cadence Data Interchange.
$ benchstat bench_json_events_48k.log bench_ccf_events_48k.log 
goos: linux
goarch: amd64
pkg: github.com/onflow/cadence/encoding/ccf
cpu: 13th Gen Intel(R) Core(TM) i5-13600K
                     │ bench_json_events_48k.log │      bench_ccf_events_48k.log       │
                     │          sec/op           │   sec/op     vs base                │
EncodeBatchEvents-20                 96.61m ± 4%   70.73m ± 3%  -26.79% (p=0.000 n=10)
DecodeBatchEvents-20                 647.7m ± 3%   157.5m ± 3%  -75.68% (p=0.000 n=10)
geomean                              250.1m        105.5m       -57.81%

                     │ bench_json_events_48k.log │       bench_ccf_events_48k.log       │
                     │           B/op            │     B/op      vs base                │
EncodeBatchEvents-20                32.45Mi ± 0%   25.82Mi ± 0%  -20.45% (p=0.000 n=10)
DecodeBatchEvents-20               234.97Mi ± 0%   56.16Mi ± 0%  -76.10% (p=0.000 n=10)
geomean                             87.32Mi        38.08Mi       -56.39%

                     │ bench_json_events_48k.log │      bench_ccf_events_48k.log       │
                     │         allocs/op         │  allocs/op   vs base                │
EncodeBatchEvents-20                 756.6k ± 0%   370.4k ± 0%  -51.05% (p=0.000 n=10)
DecodeBatchEvents-20                 4.746M ± 0%   1.288M ± 0%  -72.86% (p=0.000 n=10)
geomean                              1.895M        690.7k       -63.55%

Benchmarked using Go 1.19.6, linux_amd64, i5-13600k. Results are subject to change because CCF codec reference implementation in Go has not yet been reviewed.

NEXT STEPS

👉 After this PR is merged, there's additional work that should be completed before using or deploying this codec.

  • Add more tests to handle edge cases (as time allows). Currently, go test -cover reports 77.3% for CCF codec.

  • Add and run fuzz tests. It would be very unusual for fuzzing to not find any problems with a brand new codec for a new data format. I think Cadence Team is best qualified for this task (i.e. expert in Cadence types & values).

  • Integration with FVM to use CCF codec for events.

  • Integration work related to converting CCF encodings to JSON-Cadence Data Interchange.

  • Add integration tests in relevant projects.

EDIT: updated benchmarks to match commit f911063


  • Targeted PR against master branch
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work
  • Code follows the standards mentioned here
  • Updated relevant documentation
  • Re-reviewed Files changed in the Github PR explorer
  • Added appropriate labels

This is to allow CCF codec to use new StreamEncoder.Close()
function provided by latest fxamacker/cbor.
- added NewMeteredUFix64FromUint64 to create metered UFix64 from uint64.
- added NewMeteredFix64FromInt64 to create metered Fix64 from int64.
Cadence Compact Format (CCF) is a data format designed for compact,
efficient, and deterministic encoding of Cadence external values.

CCF obsoletes JSON-Cadence Data Interchange Format (JCDIF) for
use cases that do not require JSON.

Unlike JCDIF, CCF Specifications explicitly defines requirements for
- well-formed encodings
- valid encodings
- deterministic encodings

CCF is hybrid data format. CCF-based messages can be:
- fully self-describing or
- partially self-describing.

Both CCF modes are more compact than JSON-based messages. CCF-based
protocols can send Cadence metadata just once for all messages of
that type. Malformed data can be detected without Cadence metadata
and without creating Cadence objects.

CCF codec implements CCF Specification (RC1), which is
temporarily at github.com/fxamacker/ccf_draft. The CCF specs
will be hosted under github.com/onflow after it is updated,
cleaned up, and reformatted.
CCF obsoletes JSON-Cadence Data Interchange Format for use cases
that do not require JSON.  Given this, preliminary comparisons are
described here for the CCF codec implementing CCF Specifications (RC1).

PRELIMINARY COMPARISONS

Comparisons used 48,309 events from a single mainnet transaction.

There were 9 event types. To simplify benchmark code, the first event's
value in each of the 9 event types was used.

CCF's partially self-describing mode (aka "detached" mode) would be
even smaller than this (e.g. maybe less than 1/4 the size of JSON when
Flow eventually supports detached mode).

SIZE COMPARISON

Encoding | Num Events | Encoded size | Comments
-------- | ---------- | ------------ | --------
JSON     |     48,309 |   13,858,836 | JSON-Cadence Data Interchange
CCF      |     48,309 |    6,159,931 | CCF in fully self-describing mode

CCF SPEED AND MEMORY COMPARISONS

This isn't apples to apples comparison.   JSON data isn't sorted, etc.
- CCF encoder sorts data to encode event data deterministically.
- CCF decoder also verifies event data is sorted, well-formed, and valid.
- CCF encoding is less than 1/2 size of JSON-Cadence Data Interchange.

ENCODER COMPARISON

48k_events_encode_json.log │  48k_events_encode_ccf.log
          sec/op           │   sec/op       vs base
       89.84m ± 17%            69.28m ± 3%  -22.88%

48k_events_encode_json.log │  48k_events_encode_ccf.log
            B/op           │     B/op        vs base
       32.45Mi ± 0%           25.82Mi ± 0%  -20.45%

48k_events_encode_json.log │  48k_events_encode_ccf.log
         allocs/op         │  allocs/op     vs base
        756.6k ± 0%            370.4k ± 0%  -51.05%

DECODER COMPARISON

48k_events_decode_json.log │   48k_events_decode_ccf.log
          sec/op           │   sec/op       vs base
        646.2m ± 8%            158.3m ± 5%  -75.50%

48k_events_decode_json.log │   48k_events_decode_ccf.log
           B/op            │     B/op        vs base
      234.97Mi ± 0%            56.16Mi ± 0%  -76.10%

48k_events_decode_json.log │   48k_events_decode_ccf.log
        allocs/op          │  allocs/op     vs base
        4.746M ± 0%            1.288M ± 0%  -72.86%

Benchmarked using Go 1.19.6, linux_amd64, i5-13600k.  Results are
subject to change because CCF codec reference implementation in Go
has not yet been reviewed or merged into onflow/cadence yet.
Added comment stating that cadence.String and cadence.Character
must be valid UTF-8 and it is the application's responsibility
to provide the CCF encoder with valid UTF-8 strings.

"Valid CCF Encoding Requirements" in CCF Specification states:

    "Encoders are not required to check for invalid input items
    (e.g. invalid UTF-8 strings, duplicate dictionary keys, etc.)
    Applications MUST NOT provide invalid items to encoders."
@fxamacker fxamacker requested a review from turbolent as a code owner March 1, 2023 20:44
@fxamacker fxamacker self-assigned this Mar 1, 2023
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

Cadence Benchstat comparison

This branch with compared with the base branch onflow:master commit 1541525
The command for i in {1..N}; do go test ./... -run=XXX -bench=. -benchmem -shuffle=on; done was used.
Bench tests were run a total of 7 times on each branch.

Collapsed results for better readability

old.txtnew.txt
time/opdelta
CheckContractInterfaceFungibleTokenConformance-2113µs ± 0%116µs ± 0%~(p=1.000 n=1+1)
ContractInterfaceFungibleToken-237.8µs ± 0%37.9µs ± 0%~(p=1.000 n=1+1)
ExportType/composite_type-2381ns ± 0%350ns ± 0%~(p=1.000 n=1+1)
ExportType/simple_type-259.1ns ± 0%52.8ns ± 0%~(p=1.000 n=1+1)
InterpretRecursionFib-22.39ms ± 0%2.70ms ± 0%~(p=1.000 n=1+1)
NewInterpreter/new_interpreter-21.11µs ± 0%1.10µs ± 0%~(p=1.000 n=1+1)
NewInterpreter/new_sub-interpreter-2584ns ± 0%593ns ± 0%~(p=1.000 n=1+1)
ParseArray-27.90ms ± 0%8.18ms ± 0%~(p=1.000 n=1+1)
ParseDeploy/byte_array-211.8ms ± 0%11.9ms ± 0%~(p=1.000 n=1+1)
ParseDeploy/decode_hex-21.26ms ± 0%1.19ms ± 0%~(p=1.000 n=1+1)
ParseFungibleToken/With_memory_metering-2183µs ± 0%189µs ± 0%~(p=1.000 n=1+1)
ParseFungibleToken/Without_memory_metering-2145µs ± 0%165µs ± 0%~(p=1.000 n=1+1)
ParseInfix-26.90µs ± 0%7.69µs ± 0%~(p=1.000 n=1+1)
QualifiedIdentifierCreation/One_level-22.35ns ± 0%2.35ns ± 0%~(p=1.000 n=1+1)
QualifiedIdentifierCreation/Three_levels-2139ns ± 0%137ns ± 0%~(p=1.000 n=1+1)
RuntimeResourceDictionaryValues-25.02ms ± 0%5.01ms ± 0%~(p=1.000 n=1+1)
RuntimeScriptNoop-28.52µs ± 0%4.01µs ± 0%~(p=1.000 n=1+1)
SuperTypeInference/arrays-2314ns ± 0%315ns ± 0%~(p=1.000 n=1+1)
SuperTypeInference/composites-2137ns ± 0%134ns ± 0%~(p=1.000 n=1+1)
SuperTypeInference/integers-298.1ns ± 0%99.2ns ± 0%~(p=1.000 n=1+1)
ValueIsSubtypeOfSemaType-297.9ns ± 0%92.0ns ± 0%~(p=1.000 n=1+1)
 
alloc/opdelta
CheckContractInterfaceFungibleTokenConformance-249.2kB ± 0%49.2kB ± 0%~(p=1.000 n=1+1)
ContractInterfaceFungibleToken-223.3kB ± 0%23.3kB ± 0%~(p=1.000 n=1+1)
ExportType/composite_type-2136B ± 0%136B ± 0%~(all equal)
ExportType/simple_type-20.00B 0.00B ~(all equal)
InterpretRecursionFib-21.00MB ± 0%1.00MB ± 0%~(all equal)
NewInterpreter/new_interpreter-2768B ± 0%768B ± 0%~(all equal)
NewInterpreter/new_sub-interpreter-2200B ± 0%200B ± 0%~(all equal)
ParseArray-22.65MB ± 0%2.86MB ± 0%~(p=1.000 n=1+1)
ParseDeploy/byte_array-24.26MB ± 0%4.09MB ± 0%~(p=1.000 n=1+1)
ParseDeploy/decode_hex-2214kB ± 0%214kB ± 0%~(p=1.000 n=1+1)
ParseFungibleToken/With_memory_metering-228.9kB ± 0%28.9kB ± 0%~(all equal)
ParseFungibleToken/Without_memory_metering-228.9kB ± 0%28.9kB ± 0%~(p=1.000 n=1+1)
ParseInfix-21.91kB ± 0%1.91kB ± 0%~(p=1.000 n=1+1)
QualifiedIdentifierCreation/One_level-20.00B 0.00B ~(all equal)
QualifiedIdentifierCreation/Three_levels-264.0B ± 0%64.0B ± 0%~(all equal)
RuntimeResourceDictionaryValues-22.28MB ± 0%2.28MB ± 0%~(p=1.000 n=1+1)
RuntimeScriptNoop-22.70kB ± 0%2.70kB ± 0%~(all equal)
SuperTypeInference/arrays-296.0B ± 0%96.0B ± 0%~(all equal)
SuperTypeInference/composites-20.00B 0.00B ~(all equal)
SuperTypeInference/integers-20.00B 0.00B ~(all equal)
ValueIsSubtypeOfSemaType-248.0B ± 0%48.0B ± 0%~(all equal)
 
allocs/opdelta
CheckContractInterfaceFungibleTokenConformance-2806 ± 0%806 ± 0%~(all equal)
ContractInterfaceFungibleToken-2370 ± 0%370 ± 0%~(all equal)
ExportType/composite_type-23.00 ± 0%3.00 ± 0%~(all equal)
ExportType/simple_type-20.00 0.00 ~(all equal)
InterpretRecursionFib-218.9k ± 0%18.9k ± 0%~(all equal)
NewInterpreter/new_interpreter-213.0 ± 0%13.0 ± 0%~(all equal)
NewInterpreter/new_sub-interpreter-24.00 ± 0%4.00 ± 0%~(all equal)
ParseArray-259.6k ± 0%59.6k ± 0%~(all equal)
ParseDeploy/byte_array-289.4k ± 0%89.4k ± 0%~(all equal)
ParseDeploy/decode_hex-263.0 ± 0%63.0 ± 0%~(all equal)
ParseFungibleToken/With_memory_metering-2768 ± 0%768 ± 0%~(all equal)
ParseFungibleToken/Without_memory_metering-2768 ± 0%768 ± 0%~(all equal)
ParseInfix-248.0 ± 0%48.0 ± 0%~(all equal)
QualifiedIdentifierCreation/One_level-20.00 0.00 ~(all equal)
QualifiedIdentifierCreation/Three_levels-22.00 ± 0%2.00 ± 0%~(all equal)
RuntimeResourceDictionaryValues-236.9k ± 0%36.9k ± 0%~(p=1.000 n=1+1)
RuntimeScriptNoop-243.0 ± 0%43.0 ± 0%~(all equal)
SuperTypeInference/arrays-23.00 ± 0%3.00 ± 0%~(all equal)
SuperTypeInference/composites-20.00 0.00 ~(all equal)
SuperTypeInference/integers-20.00 0.00 ~(all equal)
ValueIsSubtypeOfSemaType-21.00 ± 0%1.00 ± 0%~(all equal)
 

@codecov
Copy link

codecov bot commented Mar 1, 2023

Codecov Report

Merging #2364 (6e5cb15) into master (1541525) will decrease coverage by 0.31%.
The diff coverage is 72.36%.

@@            Coverage Diff             @@
##           master    #2364      +/-   ##
==========================================
- Coverage   78.58%   78.28%   -0.31%     
==========================================
  Files         315      325      +10     
  Lines       68892    72307    +3415     
==========================================
+ Hits        54137    56603    +2466     
- Misses      12946    13627     +681     
- Partials     1809     2077     +268     
Flag Coverage Δ
unittests 78.28% <72.36%> (-0.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
encoding/ccf/encode_typedef.go 60.19% <60.19%> (ø)
encoding/ccf/decode.go 61.75% <61.75%> (ø)
types.go 85.79% <64.70%> (+2.00%) ⬆️
encoding/ccf/encode_type.go 71.80% <71.80%> (ø)
encoding/ccf/decode_typedef.go 72.37% <72.37%> (ø)
encoding/ccf/decode_type.go 74.76% <74.76%> (ø)
encoding/ccf/encode.go 77.11% <77.11%> (ø)
encoding/ccf/traverse_value.go 98.30% <98.30%> (ø)
encoding/ccf/ccf_type_id.go 100.00% <100.00%> (ø)
encoding/ccf/simple_type_utils.go 100.00% <100.00%> (ø)
... and 2 more

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@j1010001 j1010001 added the E&V Team Execution / Verification / Edge Team label Apr 3, 2023
Composite initializer parameters have natual sorting,
and shouldn't be changed.

Thanks Bastian for spotting this during discussion!
Prior to this change, more initializers were allowed.

Currently, Cadence doesn't support more than one initializer and
adding this limit to CCF removes the need to sort initializers.

Thanks Bastian for suggesting this!
Cadence recently added FunctionType.TypeParameters, so
add support for it in CCF codec.
Add support for nullable types:
- type bound in FunctionType.TypeParameters
- type of RestrictedType
- borrow type of CapabilityType
- add more tests
Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! 👏

I reviewed everything except the test cases again and it's looking great!

Except for some minor suggestions, the only work remaining I can see is adding more tests for the unhappy paths of the decoder, but maybe it's better to get this PR in and add those in a follow-up PR, because this one is already very large.

One part I'm still trying to wrap my head around is the special handling of optional and reference types (needToEncodeRuntimeType, getTypeToEncodeAsCCFInlineType, getOptionalInnerTypeToEncodeAsCCFInlineType). Maybe we can have a follow-up sync around this next week.

fxamacker and others added 10 commits April 11, 2023 14:27
Co-authored-by: Bastian Müller <bastian@axiomzen.co>
Cadence recently improved PathValue in PR 2427, so the CCF codec
was modified to use the updated PathValue.

Also added more tests to encode different types of PathValue.
Co-authored-by: Bastian Müller <bastian@axiomzen.co>
Co-authored-by: Bastian Müller <bastian@axiomzen.co>
Thanks @turbolent for identifying potential coding changes in the future
that could require determinism inside a function when metering is added.

  "Even though I can't currently see a problem, i.e. the function only has
  local side-effects, this might not stay true forever
  (e.g. we are going to add metering)."
Reduce default max elements limit to 20_000_000 for
arrays and maps.

These limits are large (and can be reduced more if needed):
- Current grpc limit is 20 MB and these limits are large enough
to support unrealistic CCF message with zero-overhead and
elements of 1 byte size using up entire 20 MB grpc limit.
- It would typically take many thousands of "normal" CCF-encoded events
to get near the 20 MB grpc limit with one transaction.

Also added comments explaining the security considerations for
having limits.

Thanks @turbolent for reminder to document limits!
As a CBOR security consideration, make CCF decoder reject messages
containing indefinite length CBOR byte string, text string, arrays,
and maps.
Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent changes look great! 👌

Copy link
Member

@SupunS SupunS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @fxamacker! 👏👏

This prevents an edge case from encoding more data than necessary.

getTypeToEncodeAsCCFInlineType() returns runtime type to be encoded
after removing redundant type info that is present in staticType
because staticType is already encoded at higher level.

This applies to optional type container that is present in both static
type and runtime type.

This commit unwraps reference type from static type if present and
tries again because reference type is only present in static type.
@fxamacker
Copy link
Member Author

fxamacker commented Apr 13, 2023

@turbolent @SupunS Thanks for reviews, fuzzing, and meetings!

After merging this, I'll open followup issue & PR to add API to allow some default CBOR limits to be overridden (e.g. NewDecoderWithOptions(), etc.) and address related topics mentioned by @turbolent.

I spotted an edge case where it was possible for encoder to needlessly encode redundant data, related to reference type. Can you take a look at commit 51552f1 to confirm it makes sense before I merge this PR? 🙏

Thanks @turbolent for spotting this!
Replaced recursive call with for-loop.

Thanks @turbolent for suggesting this!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

E&V Team Execution / Verification / Edge Team Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants