-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dictionary type from Arrow logical type #4729
Comments
I'm really not sure about this for a couple of reasons:
Perhaps you could expand upon what it is you are trying to achieve here |
Thanks @tustvold for the quick reply! I'm mainly throwing out ideas for discussion here, since we've been struggling with dictionary types for a while now in our internal integration with Arrow & DF.
Hmm why it would add API complexity? I'm thinking
I'm hoping the compiler is smart enough to optimize this away. I briefly tried some similar ideas here in Godbolt and it looks like LLVM can indeed do that. I also ran some benchmarks with basic
Yes this is true, but I think it can be avoided though perhaps with some trait level default methods.
TBH I don't have good ideas on this without breaking type safety. One thing we can do is to follow Arrow C++ and let compiler to optimize away the branching.
Right, but I don't think this can be used as a strong reason to not doing it though. The idea is partially inspired by Velox which is compatible with Arrow.
Yes in certain cases it will be more beneficial to unpack primitive dictionary arrays first, as you've shown in #4407 :). This is orthogonal to this issue though, since the caller can decide whether to encode dictionary primitives or not. I think primitive dictionaries may still more beneficial in some other cases (e.g., filter)
That is true. In this specific case I'm talking about DataFusion though, which uses DataType as logical type I think? |
TLDR I would strongly resist changes to this design without extremely compelling empirical data to support it. Supporting dictionaries in the limited fashion we do has already caused more pain and misery than any other single piece of functionality, and I'm frankly somewhat terrified of broadening the scope of dictionary support further, especially in a manner that would implicate downstreams that may not even care about dictionaries...
Have you tried just not using dictionaries? I'm being somewhat glib, but do you have empirical data to back up their usage? The reason I ask is this is now the direction that we're taking with IOx. We've found that even for strings, unless the data is extremely low cardinality, the overheads of handling dictionaries massively outweigh any gains (by orders of magnitude). Once StringView lands we intend to replace almost all of our use of string dictionaries with them.
Because currently if I have a PrimitiveArray, I know it has a contiguous data buffer, with efficient per-element access, defined values for null slots, etc... All of this is part of the current API contract. If instead the encoding is indeterminate, how to efficiently process the data also becomes indeterminate 😅 On a more holistic level, pushing dictionaries into PrimitiveArray and friends would implicitly force all kernels to support dictionaries. This would either need to be custom logic, something we have found to be extremely painful to maintain and result in huge amounts of codegen, or would silently materialize the dictionary, potentially blowing the memory budget and having poor performance. The current approach where type coercion machinery explicitly materializes dictionaries when necessary, with this explicitly called out in the plan and optimized to be only once, seems a better approach imo...
Why would filtering a primitive dictionary be faster than filtering primitives, they're the same thing?
This has not been my experience, LLVM is extremely sensitive to any branches within the loops of kernels. Even the branching of
You need to specify
I would be very surprised if that is getting optimised correctly, I also can't seem to find any uses of it within the C++ kernels - https://github.com/search?q=repo%3Aapache%2Farrow%20GetValueIndex&type=code.
Perhaps not a strong reason, but diverging from standard practice normally warrants at least some additional care 😄
DataFusion doesn't really have an explicit notion of logical types, but the various frontends such as SQL do have an implicit notion based around SQL datatypes. Edit: you may notice I have a bit of a chip on my shoulder when it comes to dictionaries, for which I apologise, I've just found them to be a perennial source of frustration... |
Understood. Like I said, I mainly want to invite a discussion on this topic and see if there is anything we can do to improve dictionary support in Arrow :)
To better integrate with DF, we are about to do early unpacking of dictionary arrays where value type is primitive type, and see how it goes. We still plan to keep the dictionary arrays with string & binary type as it is though. I'm surprised to hear that even string dictionaries do not perform well in your case. I can collect more data points on this.
Assuming we want to treat dictionary array as a first-class citizen. In the current model, some kernels may support it while others don't, which causes this inconsistency that we've seen often in the past. Like you said, the new model forces all the kernels to support dictionary, but wouldn't this better than having the inconsistencies? It may not necessarily mean more complexities, since some kernels like filter can directly operate on the unified API without needing to be aware of the underlying encoding.
Because the filter predicate only need to be evaluated on the dictionary values, whose cardinality could be much lower? especially if the predicate is a complex one like a UDF.
Hmm I tried it except instead of
Yea it would be too bad if it isn't optimized 😂 . This function is used in
Totally understand. We too share the frustration on this topic. |
I think kernels materializing dictionaries implicitly is a worse UX than DF explicitly doing this with type coercion. In both cases the performance will be poor, at least currently it is explicit and DF can avoid doing this more than once. Broadly speaking I think all the kernels where it makes sense to accommodate dictionaries, now support dictionaries in some form?
Aah sorry I thought you meant filter in arrow parlance, i.e. a selection kernel. Yes there are broadly speaking two types of kernels where dictionaries work well:
However, this requires explicit handling of the "dictionary" case. The proposed new model, so much as I understand it would not achieve this, and would be no better than DF coercing both inputs non-dictionary types?
Oh it is getting triggered, it is just generating very sub-optimal code 😄 There's a veritable wall of memory shuffle operators, I honestly have a hard time following what LLVM is doing...
I would not expect array kernels to be calling such a method, rather I'd expect them to be vectorised and operating on the underlying buffers, I could be wrong though... |
I have thoughts about this topic which I will write up coherently tomorrow. |
We ran into some issues with dictionary arrays of primitive value types when migrating to use DF's new aggregation implementation. Unpacking the dictionaries early helped us to bypass the error. We are still in the process of migrating so we'll see 😂
I think this part can be hidden from the kernels. Perhaps we can implement a Slightly out of topic :) I'm hoping that we can also define some function API that allows people to implement kernels and UDFs easier. Something like: /// A function that only takes a single argument
pub trait Function<T: TypeTrait, R: TypeTrait> {
fn call(&self, arg: &T::Native, result: &mut R::Native) -> bool;
fn call_batch(&self, arg: &PlainVector, result: &mut MutableVector) {
// default implementation here
} Unifying the dictionary array may help a bit towards that direction.
Yea I have no idea on why LLVM does that. But it generates the same code even without the |
You will probably need to also pull out the null check, i.e. remove all branches from the loop, then you'll get something sensible
I like this idea, I think @alamb also has some thoughts on this. This might be a less intrusive way to get better coverage of dictionaries. Ideally this might integrate with Datum, but I've so far found handling this well to vary depending on the kernel. Certainly for unary kernels something based off AnyDictionaryArray should work well - #4707 |
OpinionPutting aside all practical implementation considerations initially, I think that removing
For example, it likely makes sense for the parquet reader to provide dictionary encoded strings (to match what came out of parquet), and then unpack this data once it hits some kernel that doesn't support Dictionary encoding or the data is filtered it down where the dictionary encoding overhead outweighs its benefits As pointed out by above, the major implication is that all the kernels would have to "support" Dictionary encoded data. I don't think this is as bad as it may initially seem: kernels without support for specific encodings could unpack the dictionaries (aka cast to the value type) and proceed. This is my understanding of how DuckDB works. The primary benefit of the current situation is that it is backwards compatible Steps forwardSo in my mind, if there was some way to get from today to an API where the encoding wasn't part of the types that would be a great. I was thinking last night maybe we could somehow use the new |
And compatible with the rest of the arrow ecosystem...
IMO the issue here is that DF is using the physical arrow type as its logical type system, what if we were to introduce an explicit logical type abstraction to DataFusion? This would not break arrow compatibility whilst also achieving these benefits? |
This is an intriguing idea. DataFusion already has the |
Could probably hide more than just dictionaries, offset sizes, and potentially things like decimal precision might also be candidates |
Instead of adding the new logical type in DF, what if we introduce it in |
An alternative might be to implement |
Yes, I like this idea. I agree we can start with DF and eventually move this to |
FWIW I plan to start writing up some of this context as a DataFusion issue hopefully later today |
I filed apache/datafusion#7421 with a discussion in DataFusion |
FYI @yukkit has created a PR showing how |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We've seen a lot of issues related to dictionary array support in both
arrow-rs
and DataFusion. One of main reasons, I think, is thatarrow-rs
treats dictionary type as part of its logical type.A better approach IMO is to consider dictionary type as a physical type property and hide it from the logical
DataType
. Correspondingly,arrow-rs
shouldn't maintain separateInt32Array
andInt32DictionaryArray
, etc, but rather unifying the two and hide the encoding details inside the array implementation.Describe the solution you'd like
Dictionary
from ArrowDataType
.Describe alternatives you've considered
Not doing it, and live with the complexities.
Additional context
The text was updated successfully, but these errors were encountered: