feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

aaronc · 2022-02-02T01:38:05Z

Description

uint64 values are used in the ORM as auto-incrementing primary keys. Always using 8 bytes for these values is a bit of a waste of space. Unfortunately, varint encoding does not support ordered prefix iteration.

This PR introduces a compact, well-ordered variable length encoding for uint32 and uint64 types. fixed32 and fixed64 integers are still encoded as 4 and 8 byte fixed-length big-endian arrays. With this, users have a choice of encoding based on what type of data they are storing. An auto-incrementing primary key should prefer the variable length uint64 whereas a fixed precision decimal might want to use fixed64.

See the golden test updates to see how this reduces key lengths.

This encoding works by using the first two bits to encode the buffer length (4 possible lengths). I'm not sure if my choice of 2,4,6 and 9 bytes is the right choice of 4 lenths for uint64 - there are many alternate choices. I could have also chosen 3 bits and allowed for 8 possible lengths, but way waste an extra bit? Input on the right design parameters would be appreciated.

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed ! in the type prefix if API or client breaking change
confirmed all author checklist items have been addressed
reviewed state machine logic
reviewed API design and naming
reviewed documentation is accurate
reviewed tests and test coverage
manually tested (if applicable)

codecov · 2022-02-02T01:47:48Z

Codecov Report

Merging #11090 (db3a82d) into master (8110576) will decrease coverage by 0.17%.
The diff coverage is 75.40%.

@@            Coverage Diff             @@
##           master   #11090      +/-   ##
==========================================
- Coverage   65.96%   65.79%   -0.18%     
==========================================
  Files         698      684      -14     
  Lines       70810    69487    -1323     
==========================================
- Hits        46709    45718     -991     
+ Misses      21235    20973     -262     
+ Partials     2866     2796      -70

Impacted Files	Coverage Δ
x/authz/keeper/msg_server.go	`0.00% <ø> (ø)`
x/authz/keeper/keeper.go	`67.25% <33.33%> (-0.21%)`	⬇️
x/authz/simulation/operations.go	`67.29% <64.28%> (-5.17%)`	⬇️
orm/encoding/ormfield/uint32.go	`73.77% <71.92%> (-26.23%)`	⬇️
orm/encoding/ormfield/uint64.go	`77.93% <74.80%> (-22.07%)`	⬇️
orm/encoding/ormfield/codec.go	`96.00% <100.00%> (+0.34%)`	⬆️
x/authz/authorization_grant.go	`59.45% <100.00%> (+24.32%)`	⬆️
x/authz/client/testutil/grpc.go	`98.26% <100.00%> (ø)`
x/authz/client/testutil/query.go	`100.00% <100.00%> (ø)`
x/authz/client/testutil/test_helpers.go	`100.00% <100.00%> (ø)`
... and 20 more

amaury1093

Reminds of me https://docs.substrate.io/v3/advanced/scale-codec/#compactgeneral-integers :) Too bad that one uses LE

amaury1093 · 2022-02-02T14:49:12Z

orm/encoding/ormfield/uint64.go

+	return c.FixedBufferSize(), nil
+}
+
+// EncodeCompactUint64 encodes uint64 values in 2,4,6 or 9 bytes.


As you wrote, compact encoding is more suitable for small numbers (e.g. auto-incrementing ids). In which case, we can maybe save the maximum space for smaller ints, namely, for the 2 MSB:

00 -> 1 byte, for n<64

01 -> 2 bytes, for n<16384

10 -> 3 bytes, for n<2^22

11 -> 9 bytes for uint64 or 5 bytes for uint32

We'll have a the best space saving for integers up until 2^22 (4million), then it'll be worst than fixed length. And I guess most ids will stay <4M.

I was thinking the 64 values in byte 1 is too small of an ID space to really worry about. But if we did take this approach I would push the third case to 4 or 5 bytes maybe

My thinking is that numbers <64 will still be the most used overall, so we should save max space on those.

I would push the third case to 4 or 5 bytes maybe

Could you explain why?

See this spreadsheet: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c/

When we have just a few values it's kilobytes of storage difference betwen 1 and 2 bytes.

In what cases do you think numbers <64 would be most used? In the SDK we are using uint64's for proposals and votes in the x/group module and those will likely number in the thousands and millions respectively over time.

The cases where I think small numbers will be used quite often are maybe the Market or Class id's in the Regen credits module. But still in that module, I expect things like credit batches and orders to number in the thousands if not millions.

Thanks for the estimations! I see your point of KBs of diff for small IDs vs GBs of diff for lots of them.

In what cases do you think numbers <64 would be most used? ... those will likely number in the thousands and millions respectively over time.

Actually million is the order of magnitude I had in mind too for IDs on chain, which is why I proposed 1,2,3,9 (optimized for <4M).

But the million-to-billion range is where 1,2,3,9 really underperforms over other ones. And I think we should maybe consider that range to be plausible too. I'm good with 2,4,6,9 👍, seems like the best compromise overall.

orm/encoding/ormfield/codec_test.go

aaronc · 2022-02-04T15:54:33Z

I actually think the values in this PR (2,4,6,9) scale the best for uint64, even though if there are not a lot of entities something like 1,2,3,9 is a bit better for that case. But it is megabytes of storage difference when we have a few entities vs gigabytes of difference when there are a lot. I think more than 1 billion entities is not unrealistic nowadays so that storage difference may someday be meaningful

I did a quick comparison of the different choices here: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c

…-varint

amaury1093

Approving, okay for 2,4,6,9 for uint64. Left 2 questions

amaury1093 · 2022-02-04T16:31:39Z

orm/encoding/ormfield/codec.go

-	case protoreflect.Uint64Kind, protoreflect.Fixed64Kind:
-		return Uint64Codec{}, nil
+	case protoreflect.Uint32Kind:
+		return CompactUint32Codec{}, nil


is it easy to make the encoding of N<2**32-1 numbers the same in both CompactUint{32,64}Codec? I feel like people will be confused if the same number is encoded in 2 different compact ways depending if it's uint32 or uint64.

if we keep 2,4,6,9 for uint64, then only 2,4 modes will be valid for CompactUint32Codec. It will be less space-performant for sure, but maybe worth the mental overhead?

hmm... these aren't really ever intended to be used by people. it's sort of like a database internal detail which you should only know if you're making a custom implementation. users should just need to know that the database made a performant and correct choice and have some guidance on when to use which data type to get which performance characteristics

I don't think it's only database-internal. Storage keys relate to proofs, so IBC, UI client libs, light clients etc. Those developers will be re-implementing this, at least once per language.

true... but not sure it's worth the compromise. it will mean large uin32's will need 6 bytes...

orm/encoding/ormfield/codec.go

aaronc · 2022-02-07T15:36:37Z

Going to go ahead and merge this. We can tweak parameters in a follow-up if we find better ones up until when this goes into production

aaronc added 3 commits February 1, 2022 15:21

WIP on compact ordered varints

7432f77

fixes and tests

495df9c

docs

1c32408

aaronc requested a review from alexanderbez as a code owner February 2, 2022 01:38

github-actions bot added the C:orm label Feb 2, 2022

aaronc mentioned this pull request Feb 2, 2022

ORM v1.0 Readiness Checklist #11088

Closed

29 tasks

aaronc requested review from technicallyty, amaury1093, fdymylja and robert-zaremba February 2, 2022 01:54

amaury1093 reviewed Feb 2, 2022

View reviewed changes

amaury1093 assigned amaury1093 and likhita-809 Feb 3, 2022

aaronc added 3 commits February 4, 2022 11:00

add manual test cases

a4a6535

Merge branch 'master' of github.com:cosmos/cosmos-sdk into aaronc/orm…

1800669

…-varint

update golden file

81f9c41

github-actions bot added C:CLI C:x/gov labels Feb 4, 2022

amaury1093 approved these changes Feb 4, 2022

View reviewed changes

fdymylja approved these changes Feb 7, 2022

View reviewed changes

aaronc and others added 2 commits February 7, 2022 10:32

Merge branch 'master' into aaronc/orm-varint

e0b816e

revert

a76995d

aaronc removed C:CLI C:x/gov labels Feb 7, 2022

aaronc added the A:automerge Automatically merge PR once all prerequisites pass. label Feb 7, 2022

Merge branch 'master' into aaronc/orm-varint

db3a82d

mergify bot merged commit 1944a08 into master Feb 7, 2022

mergify bot deleted the aaronc/orm-varint branch February 7, 2022 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

aaronc commented Feb 2, 2022

codecov bot commented Feb 2, 2022 •

edited

Loading

amaury1093 left a comment

amaury1093 Feb 2, 2022 •

edited

Loading

aaronc Feb 2, 2022

amaury1093 Feb 3, 2022

aaronc Feb 4, 2022

amaury1093 Feb 4, 2022

aaronc commented Feb 4, 2022

amaury1093 left a comment

amaury1093 Feb 4, 2022

aaronc Feb 4, 2022

amaury1093 Feb 7, 2022

aaronc Feb 7, 2022

aaronc commented Feb 7, 2022

feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

Conversation

aaronc commented Feb 2, 2022

Description

Author Checklist

Reviewers Checklist

codecov bot commented Feb 2, 2022 • edited Loading

Codecov Report

amaury1093 left a comment

Choose a reason for hiding this comment

amaury1093 Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronc commented Feb 4, 2022

amaury1093 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronc commented Feb 7, 2022

codecov bot commented Feb 2, 2022 •

edited

Loading

amaury1093 Feb 2, 2022 •

edited

Loading