Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(orm)!: ordered variable length encoding for uint32 and uint64 types #11090

Merged
merged 9 commits into from
Feb 7, 2022

Conversation

aaronc
Copy link
Member

@aaronc aaronc commented Feb 2, 2022

Description

uint64 values are used in the ORM as auto-incrementing primary keys. Always using 8 bytes for these values is a bit of a waste of space. Unfortunately, varint encoding does not support ordered prefix iteration.

This PR introduces a compact, well-ordered variable length encoding for uint32 and uint64 types. fixed32 and fixed64 integers are still encoded as 4 and 8 byte fixed-length big-endian arrays. With this, users have a choice of encoding based on what type of data they are storing. An auto-incrementing primary key should prefer the variable length uint64 whereas a fixed precision decimal might want to use fixed64.

See the golden test updates to see how this reduces key lengths.

This encoding works by using the first two bits to encode the buffer length (4 possible lengths). I'm not sure if my choice of 2,4,6 and 9 bytes is the right choice of 4 lenths for uint64 - there are many alternate choices. I could have also chosen 3 bits and allowed for 8 possible lengths, but way waste an extra bit? Input on the right design parameters would be appreciated.


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@github-actions github-actions bot added the C:orm label Feb 2, 2022
@aaronc aaronc mentioned this pull request Feb 2, 2022
29 tasks
@codecov
Copy link

codecov bot commented Feb 2, 2022

Codecov Report

Merging #11090 (db3a82d) into master (8110576) will decrease coverage by 0.17%.
The diff coverage is 75.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11090      +/-   ##
==========================================
- Coverage   65.96%   65.79%   -0.18%     
==========================================
  Files         698      684      -14     
  Lines       70810    69487    -1323     
==========================================
- Hits        46709    45718     -991     
+ Misses      21235    20973     -262     
+ Partials     2866     2796      -70     
Impacted Files Coverage Δ
x/authz/keeper/msg_server.go 0.00% <ø> (ø)
x/authz/keeper/keeper.go 67.25% <33.33%> (-0.21%) ⬇️
x/authz/simulation/operations.go 67.29% <64.28%> (-5.17%) ⬇️
orm/encoding/ormfield/uint32.go 73.77% <71.92%> (-26.23%) ⬇️
orm/encoding/ormfield/uint64.go 77.93% <74.80%> (-22.07%) ⬇️
orm/encoding/ormfield/codec.go 96.00% <100.00%> (+0.34%) ⬆️
x/authz/authorization_grant.go 59.45% <100.00%> (+24.32%) ⬆️
x/authz/client/testutil/grpc.go 98.26% <100.00%> (ø)
x/authz/client/testutil/query.go 100.00% <100.00%> (ø)
x/authz/client/testutil/test_helpers.go 100.00% <100.00%> (ø)
... and 20 more

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return c.FixedBufferSize(), nil
}

// EncodeCompactUint64 encodes uint64 values in 2,4,6 or 9 bytes.
Copy link
Contributor

@amaury1093 amaury1093 Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wrote, compact encoding is more suitable for small numbers (e.g. auto-incrementing ids). In which case, we can maybe save the maximum space for smaller ints, namely, for the 2 MSB:

  • 00 -> 1 byte, for n<64
  • 01 -> 2 bytes, for n<16384
  • 10 -> 3 bytes, for n<2^22
  • 11 -> 9 bytes for uint64 or 5 bytes for uint32

We'll have a the best space saving for integers up until 2^22 (4million), then it'll be worst than fixed length. And I guess most ids will stay <4M.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the 64 values in byte 1 is too small of an ID space to really worry about. But if we did take this approach I would push the third case to 4 or 5 bytes maybe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that numbers <64 will still be the most used overall, so we should save max space on those.

I would push the third case to 4 or 5 bytes maybe

Could you explain why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this spreadsheet: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c/

When we have just a few values it's kilobytes of storage difference betwen 1 and 2 bytes.

In what cases do you think numbers <64 would be most used? In the SDK we are using uint64's for proposals and votes in the x/group module and those will likely number in the thousands and millions respectively over time.

The cases where I think small numbers will be used quite often are maybe the Market or Class id's in the Regen credits module. But still in that module, I expect things like credit batches and orders to number in the thousands if not millions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the estimations! I see your point of KBs of diff for small IDs vs GBs of diff for lots of them.

In what cases do you think numbers <64 would be most used? ... those will likely number in the thousands and millions respectively over time.

Actually million is the order of magnitude I had in mind too for IDs on chain, which is why I proposed 1,2,3,9 (optimized for <4M).

But the million-to-billion range is where 1,2,3,9 really underperforms over other ones. And I think we should maybe consider that range to be plausible too. I'm good with 2,4,6,9 👍, seems like the best compromise overall.

orm/encoding/ormfield/codec_test.go Show resolved Hide resolved
@aaronc
Copy link
Member Author

aaronc commented Feb 4, 2022

I actually think the values in this PR (2,4,6,9) scale the best for uint64, even though if there are not a lot of entities something like 1,2,3,9 is a bit better for that case. But it is megabytes of storage difference when we have a few entities vs gigabytes of difference when there are a lot. I think more than 1 billion entities is not unrealistic nowadays so that storage difference may someday be meaningful

I did a quick comparison of the different choices here: https://docs.google.com/spreadsheets/d/1PkIzCVkFwTD-POyAGk7idpOuAbK20cS0aopNUL6SW1c

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, okay for 2,4,6,9 for uint64. Left 2 questions

case protoreflect.Uint64Kind, protoreflect.Fixed64Kind:
return Uint64Codec{}, nil
case protoreflect.Uint32Kind:
return CompactUint32Codec{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it easy to make the encoding of N<2**32-1 numbers the same in both CompactUint{32,64}Codec? I feel like people will be confused if the same number is encoded in 2 different compact ways depending if it's uint32 or uint64.

if we keep 2,4,6,9 for uint64, then only 2,4 modes will be valid for CompactUint32Codec. It will be less space-performant for sure, but maybe worth the mental overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... these aren't really ever intended to be used by people. it's sort of like a database internal detail which you should only know if you're making a custom implementation. users should just need to know that the database made a performant and correct choice and have some guidance on when to use which data type to get which performance characteristics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's only database-internal. Storage keys relate to proofs, so IBC, UI client libs, light clients etc. Those developers will be re-implementing this, at least once per language.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true... but not sure it's worth the compromise. it will mean large uin32's will need 6 bytes...

orm/encoding/ormfield/codec.go Show resolved Hide resolved
@aaronc
Copy link
Member Author

aaronc commented Feb 7, 2022

Going to go ahead and merge this. We can tweak parameters in a follow-up if we find better ones up until when this goes into production

@aaronc aaronc added the A:automerge Automatically merge PR once all prerequisites pass. label Feb 7, 2022
@mergify mergify bot merged commit 1944a08 into master Feb 7, 2022
@mergify mergify bot deleted the aaronc/orm-varint branch February 7, 2022 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:orm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants