Efficiency Heuristics #6630
gatesn
started this conversation in
Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Vortex arrays support fully lazy compute, even with lazy slicing, filtering and taking.
This is nice in that it allows us to orchestrate decompression plans on the CPU, without ever touching the underlying buffers, enabling us to plumb these arrays through to the GPU for execution. However it also means that we can end up with very "inefficient" array representations, for example a length=1_000_000 array sliced down to 10 elements.
This is a similar(-ish) problem to when a query engine should re-partition batches after a sparse filter operator.
I'd like to propose the idea of adding two heuristics to arrays: storage efficiency and compute efficiency. These heuristics can be used by applications at various points in their pipeline.
For example, if I'm about to send an array over-the-wire I probably want to ensure I'm not serializing 1m rows to logically send 1k rows (storage efficiency == 1000.0). But maybe I'm ok with ~3x storage efficiency if it means I can avoid full decompression costs.
If I'm about to cache an array to be re-used over and over again, I may want to execute it until its compute efficiency is <1 .0. That is, I do not want to decompress more rows than I get back. So either make my array canonical (compute efficiency == 1.0), or I'm ok to keep more efficient arrays such as Dictionary where efficiency <1.0.
These are crude measures, but they get us a bit closer to allowing users to declare when and why arrays should be decompressed now that all compute is lazy. In the previous world (before the operators change), compute was semi-lazy, but it wasn't obvious when compute would be run, or when it would be deferred. Now that it's always deferred, we need a way to express when it should in fact be run.
Storage Efficiency
This is a measure roughly describing how much data is held in-memory vs how much data is logically represented.
A canonical primitive array has efficiency == 1.0
A dictionary array can be very efficient <<1.0
A Zstd array may also be very efficient ~0.3 with 30% compression ratio.
A sliced zstd array (since it cannot push-down slicing) will have efficiency >1.0 if for a 30% compression ratio we slice down to less than 30% of the rows.
I don't know quite how to measure this, or combine efficiencies of array children.
Compute Efficiency
This is a measure of how much data needs to be decompressed in order to access the logical result.
A canonical primitive array has efficiency == 1.0 (decompress 100 rows to get back 100 rows)
A dictionary array can have efficiency <1.0 (if we somehow adjust for the size of the dictionary values vs codes?)
A Zstd array with length=100 would have efficiency == 1.0 (decompress 100 rows to get back 100 rows)
A sliced(len=30) zstd array(len=100) would have efficiency == 3.0 (decompress 100 rows to get back 30 rows)
Again, I don't quite know how to measure this or combine efficiencies of children.
Beta Was this translation helpful? Give feedback.
All reactions