-
Notifications
You must be signed in to change notification settings - Fork 130
feat: ArrayOperations infallible, eager validation + new_unchecked #4177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The only thing to note is where we might perform the validation. Can someone craft a bad file, send it to a server running vortex, and crash it? |
|
I guess this is really two questions
For (1) this will panic instead of returning an error, but most server frameworks already catch_unwind to prevent panics inside of handlers from crashing the whole server, so I'm not too concerned about this. FWIW, Arrow does do this as well. So then the question is (2): does doing this mean that we're doing less validation than before. I think the answer is no? We currently trust the buffers coming back out of a file, just like our in-memory Array constructors only do constant-time validation of their inputs. So for example, if someone tries to build a VarBinViewArray of type Utf8, we never validate that the bytes actually hold UTF8. This I think is probably bad, and we should probably be eagerly checking preconditions. And then have a set of unchecked constructors to call internally when we're doing operations on trusted data that we know the operation maintains the invariant. I don't think returning Result is actually doing anything currently to prevent writing corrupted files though, so I see it as orthogonal? |
4fcdc72 to
3ac4ea0
Compare
3ac4ea0 to
e64694c
Compare
dab0372 to
d774e31
Compare
|
Pushed a commit to fix the last doc failure |
3315924 to
6fca302
Compare
|
@gatesn I think you're right and we probably need better validation in our
Even before this change, there are a ton of places that we check these things lazily and would panic, not return error. I think it'd be worth a modest penalty to check some of these things eagerly. The alternative is that you can potentially read and propagate corrupt data and won't know unless you run a a compute function that would trigger it somewhere. That does bring up the question of how to handle unparseable things/messages that your client doesn't know about but wants to propagate unmodified... |
c148959 to
be9bafd
Compare
|
@gatesn I'm working over in #4216 to improve validation. I think the ideal guarantee to provide clients is
But things get a bit weird when we factor extensions into the mix, because in theory someone can add an encoding that doesn't semantically meet the full contract, but compiles fine and may cause a panic at runtime. For example with ListArray, I'd like to verify that offsets children are non-negative and monotonic, and don't point past the end of |
be9bafd to
d3b7e7d
Compare
|
If you're loading extensions from native code shared libraries.... I think all bets are off anyway! The bigger concern I have is I simply don't understand how you can perform validation during deserialization without effectively decompressing everything. How can you demonstrate that offsets are unique and monotonic? Even asking for min(X) defers to stats, which we load from the file and never verify. Asserting monotonicity is essentially iterating over the array. I think you're holding the validation to be a little too strong. Why does monotonicity matter? We care about it to perform e.g. search_sorted, but if it's not sorted, then the results are garbage. Maybe that's fine? It's not a panic, it's just nonsense. Basically, what are the terms of this validation contract? Is it just panic prevention? |
|
I think the terms of the validation contract I'm proposing are: when I build an XYZArray, either from components or from deserialization, and it succeeds, the thing I am holding is a valid XYZEncoding and satisfies its constraints. I have two concerns with the status quo
This is hard, because I agree with you, that in the limit you can't really do better than decoding to primitive array, or iterating VBView views buffer. But, and FWIW, this is exactly what our ToArrowCanonical kernel already does. We use the arrow-rs safe constructors, which internally do validation on all offsets buffers, BinaryView pointers, utf-8 codepoints, etc. But we don't do any of this when scanning into Vortex. I want to run full SQL benchmarks but what I'm actually wondering is if we interleave the IO and CPU operations properly, can we do this without destroying scan performance? It'd impose a cost on time-to-first-batch but if you can amortize it with the IO then maybe it actually doesn't impose a big penalty? |
d3b7e7d to
df58778
Compare
df58778 to
74cbbea
Compare
Benchmarks: TPC-H SF=1 on NVMETable of Results
|
Benchmarks: TPC-H SF=1 on S3Table of Results
|
CodSpeed Performance ReportMerging #4177 will improve performances by 37.72%Comparing Summary
Benchmarks breakdown
|
Benchmarks: Clickbench on NVMETable of Results
|
Benchmarks: TPC-DS SF=1 on NVMETable of Results
|
ArrayOperations currently return VortexResult<>, but they really should just be infallible. A failed array op is generally indicative of programmer or encoding error. There's really nothing interesting we can do to handle an out-of-bounds slice() or scalar_at. There's a lot that falls out of this, like fixing a bunch of tests, tweaking our scalar value casting to return Option instead of Result, etc. Signed-off-by: Andrew Duffy <andrew@a10y.dev>
74cbbea to
c0b668f
Compare
) FLUP to #4177 This removes **all** possible sources of panic inside the deserialization pathway, returning errors. This also adds more explicit and complete validation of all encodings within their `try_new` operation. Currently, we do "best effort" on these validations, including things like - Checking Validity is the appropriate length for the array - Checking min/max values on offsets - UTF-8 validation on VarBinView and validating that all views reference valid ranges of the buffers A lot of this was new validation, some of it was moving validation from just SerdeVTable into the array type's `try_new` constructor. Tried to add sufficient doc comments. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>
) FLUP to #4177 This removes **all** possible sources of panic inside the deserialization pathway, returning errors. This also adds more explicit and complete validation of all encodings within their `try_new` operation. Currently, we do "best effort" on these validations, including things like - Checking Validity is the appropriate length for the array - Checking min/max values on offsets - UTF-8 validation on VarBinView and validating that all views reference valid ranges of the buffers A lot of this was new validation, some of it was moving validation from just SerdeVTable into the array type's `try_new` constructor. Tried to add sufficient doc comments. Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
5f6d775 to
cdc94ae
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
b42e575 to
590c65d
Compare
ArrayOperations currently return VortexResult<>, but they really should just be infallible. A failed array op is generally indicative of programmer or encoding error. There's really nothing interesting we can do to handle an out-of-bounds slice() or scalar_at.
There's a lot that falls out of this, like fixing a bunch of tests, tweaking our scalar value casting to return Option instead of Result, etc.