-
Notifications
You must be signed in to change notification settings - Fork 130
feat: write nested struct layouts #4942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| vortex_array::compute::mask( | ||
| projection.as_ref(), | ||
| &Mask::from_buffer(!validity.to_bool().boolean_buffer()), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only safe when we know that the projection has the same length as the struct itself. If we're projecting an UNNEST(a) that will not hold. We don't have UNNEST but i suspect we'll want to add it so we can pushdown into list elements
robert3005
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. Some of the code comments are not correct though
| chunks.push(chunk); | ||
| } | ||
|
|
||
| let collected = ChunkedArray::try_new(chunks, _dtype)?.to_array(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we canonicalize here instead of writing a flat chunked array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice that ends up happening b/c the child here is the CompressStrategy, so it will canonicalize + compress internally.
In general I tried to mirror this on the RepartitionStrategy which yields ChunkedArray as well. I think the LayoutStrategies are a bit weird tbh and there are a few unexpected things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought repartition strategy will have a chunked strategy as a child so it will flatten the chunked array, in this case I think we go straight to flat.
GithubArchive is an archival dataset of Github event logs, hosted at https://www.gharchive.org/. It's also distributed as part of RealNest from CWI. This is our first benchmark to test nested data access with Vortex. I want to get this in before we merge #4942 or any other follow on work for nested data support. TODOs: - [x] Add more data files. Currently queries finish way too fast - [x] Add a few more queries --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
|
I suppose this is something that Operator work is supposed to help with? Though that really seems more focused on memoizing decodes in a single split task, rather than across split tasks? |
|
If we only get 8192 codes we should only decode those codes. I know duckdb have a decompression cache on the fsst array to avoid any double decompression which is similar to us caching canonicalisations. I would still try first to avoid canonicalising values unconditionally and performing the take first |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
|
This is about values not codes though. The issue is that every chunk goes through ToArrow, which canonicalizes. The entire values array gets decompressed for each chunk in that case. As a test, when I switch the DictReader to instead eagerly canonicalize the values, Q3 goes from being 40% slower to nearly 2x faster than baseline. I'm going to revert that commit for now, but I think this is a problem that is independent of nested struct splitting and is something we'll have to handle more fundamentally with DictLayouts. |
This reverts commit 570d987. Signed-off-by: Andrew Duffy <andrew@a10y.dev>




Part of #4889
To better support writing nested data, here we update our builtin StructLayout in a backwards-compatible fashion.
StructStrategywill shred struct fields into their own StructLayout recurseivelyI use the
RealNestdataset to evaluate, which contains a copy of ~200k github pull request webhook events. Nested struct layout reduces file size over the previous strategy by about ~10%, and also makes pushdown into the nested columns possible.Some open questions
UNNESTor something else that increases the result size, it is hard to map the validity buffer onto the projection_eval result