-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster joins meta issue #2340
Comments
Here is a basic benchmark I would start with (it can be extended easily):
It already shows one thing we know of - the joins are sensitive to order of arguments. I.e. have a look at @shashi - I think this is a good benchmark to start from for a reference. In R data table has the following example timings (using 1 thread): sorted data:
unsorted data:
So clearly it takes advantage of the fact that data is sorted. But for unsorted data it would seem that we should almost as good as And in H2O benchmarks we still have a problam with 50GB data with memory management. |
If we were to go for a shared package that would perform joins, then the API from DataFrames.jl perspective that would be nice would look like:
where:
The return value should be a tuple of two vectors:
so that after applying these vectors to the rows of the original tables we can in the future also non-equi joins could be supported (and because of such extensions I believe that it would be better to have a package doing this as all this is non-DataFrames.jl specific). |
It was mentioned to check SplittApplyCombine.jl |
My recent testing (I think with TypedTables) showed the same thing, so probably not a mistake. (I did think it was relatively performant in some tests done in the past, though, so I was a bit confused, but I haven’t dug into it yet). |
If you used it naively with a DataFrame though it would perform extremely poorly! |
No - I used what you show in the manual:
|
OK - try the upcoming SplitApplyCombine 1.1.2. I fixed some type instabilities that speeds up the first example by 3x. |
OK here are some benchmarks I just ran in Julia 1.5: julia> using SplitApplyCombine, DataFrames, BenchmarkTools
julia> @btime SplitApplyCombine.innerjoin(identity, identity, tuple, 1:10^7, 1:10^6);
466.977 ms (2000074 allocations: 189.24 MiB)
julia> @btime SplitApplyCombine.innerjoin(identity, identity, tuple, 1:10^6, 1:10^7); # This is the slow way for SplitApplyCombine (like you say, easy to fix)
4.172 s (20000092 allocations: 1.74 GiB)
julia> @btime SplitApplyCombine.innerjoin(identity, identity, tuple, 1:10^6, 1:10^6);
187.345 ms (2000074 allocations: 189.24 MiB)
julia> df1 = DataFrame(x=1:10^7); df2 = DataFrame(x=1:10^6);
julia> @btime DataFrames.innerjoin(df1, df2, on=:x);
9.057 s (19999687 allocations: 854.87 MiB)
julia> @btime DataFrames.innerjoin(df2, df1, on=:x);
724.453 ms (1999687 allocations: 768.87 MiB)
julia> @btime DataFrames.innerjoin(df2, df2, on=:x);
324.604 ms (1999687 allocations: 168.22 MiB) ProfileView showed most of the SplitApplyCombine time was dealing with hashes, hashtables and pushing the data to the output array. Altering the approach may lead to some performance benefits. I didn't gain any particular insights from doing the same for DataFrames but @bkamins you might understand the bottlenecks better than me. |
In DataFrames.jl this should be similar - hashing is the biggest cost. The timings you show look "good" in comparison to data.table. So my conclusion is: that using SplitApplyCombine.jl could be a good backend to use for joining in DataFrames.jl, provided that you would be OK to (of course I can join the efforts, but I have learnt that it is best to have a commitment from the core developer for large changes in the package):
Sorry for the long list of requirements here, but I want Julia data ecosystem to be best in class 😄 and I know how hard such things can turn out to be to maintain in the long run. And thank you for your fast response. I believe that with points 1-4 (which seem relatively easy to achieve with the infrastructure you have now) we can easily compete with data.table in terms of performance of joins. If we would agree to this on my side the first thing I will do is add SplitApplyCombine.jl as a dependency of DataFrames.jl and integrate your engine for If anyone has some comment on this plan please chime in. And thank you @andyferris for your quick response - this is much appreciated and I believe that if you have some time to work on it we can relatively quickly have something that makes a new release that will be probably in 1-2 months look much better in H2O benchmarks. |
@andyferris - as a note to remember to check this - we should benchmark the performance for column/columns of: |
OK great - happy to help out and contributions are more than welcome. I should point out that I've found it difficult to put in siginficant time into open source work over the last 18 months or so. I'm still active but not at the level I was. Just as a preamble - originally everything that went into SplitApplyCombine was designed so that it could be potentially be ported to
Yes, this would be nice. The package generally follows the kinds of patterns I saw in Base in Julia 1.0 (see preamble). My understanding of the community's direction is we'd be just as happy to have mutlithreading by default for
Yes I think this is easier with
Sure, I was just looking at that tonight. (I will need my beauty sleep though... 😴)
Yes - for sure. The algorithms can get cleverer and I think there are lots of good approaches here. I had good success in
I agree that we'll have to do something to make this wrangling easier. Like I said above, it would be good if the semantics of the generic function remained intact for all the methods... and I always intended to use something like
For sure the Pooled and Categorical stuff will be slower than necessary, until they get specialized methods. I think I did one benchmark for grouping with strings that didn't seem spectacular. Similarly we need benchmarks for grouping by multiple columns - I think this was slower than I expected but that might be magically fixed in Julia 1.5? The other thing to be wary of is performance with |
This is problematic because it adds another pass (less a problem as this is cheap) and eats up memory (this is a real issue - we currently fail 50GB H2O tests because our current join algorithm is too memory greedy). As a first steps could be:
|
Sorry I probably wasn't clear - by "something like" I meant a lazy version of
Yes you nerd sniped me, I'm working on it now, nearly done :) (I'll sleep another day...)
I imagine it should go very similarly to how AcceleratedArrays handles joins with |
WIP: julia> @btime SplitApplyCombine.innerjoin(identity, identity, tuple, 1:10^7, 1:10^6);
521.482 ms (2000074 allocations: 189.24 MiB)
julia> @btime SplitApplyCombine.innerjoin(identity, identity, tuple, 1:10^6, 1:10^7);
514.161 ms (2000074 allocations: 189.24 MiB) |
OK that's published. I wish I understood how to profile allocations. This (and also |
@vtjnash - is there some best practice how to track the source of allocations that @andyferris observes. Any hints would be helpful. Thank you! |
still - I would check if this would not be slower than two vectors, as it seems that two vector approach should be more cache friendly for later processing. |
I have no special insights that I think Andy doesn't already have. Huda talked about some of the tools recently (https://www.youtube.com/watch?v=S5R8zXJOsUQ) and others have been working on a memory profile sometimes (JuliaLang/julia#33467) |
I did some benchmarks with IndexedTables, while there the order of tables does not affect the performance, it has similar or worse performance in good cases. Is the plan to use SplitApplyCombine.jl to do this and SAC itself becomes aware of AcceleratedArrays (or just uses the right Base API that AcceleratedArrays magically speed up) so Tables implementations like CSV can use that to attach indexing metadata? @andyferris Benchmarks looking pretty good! But in this case an algorithm similar to merge sort could be faster than hashing right? |
So my plan (and question to the community) is if we decide to have SplitApplyCombine.jl as a place to develop these algorithms as I believe we should bet on one package and jointly contribute to improve it. Then all the things you ask for (like being indexing-aware or using a different algorithms in different cases) should be just gradually added to it by many contributors. In short the question is - do we agree to choose SplitApplyCombine.jl as the place for such functionality? I vote yes! |
FYI AcceleratedArrays already accelerates SplitApplyCombine (
Definitely - |
I have worked with SplitApplyCombine.jl code, DataFrames.jl code and checked the H2O benchmarks (if someone wants - I have generated the 0.5GB test set they use). The conclusions are:
|
The obvious algorithm would be: combine the unique pools of both, sort them (optional, depends on whether you need |
Two quick thoughts from experiments I have been doing today:
This is the expensive part. Combining the pools like this can be done efficiently only if one of pools has low cardinality.
This is also out of question in practice. In H2O benchmarks data.table does a join of 10^7 vs. 10^7 table in 2.3 second. Just doing sorting would take 1.5 second. And this is the best case. E.g. for a join of 10^7 vs 10^4 it takes 0.3s. So here sorting the 10^7 table is prohibitive. In general have a look at https://h2oai.github.io/db-benchmark/ in joins/0.5GB section for the timings (as I have said - I have these data sets so I can share them, but they are to big to attach them here) |
Here is a quick test (probably not optimal but showing the issue):
|
Yeah agree. My attempt so far has been to stay within the collection-of-rows abstraction and use dispatch patterns to (a) fetch the one-or-few joining columns from each table and (b) choose the best algorithm. This is currently possible using TypedTables, SplitApplyCombine and AcceleratedArrays but that approach I believe could extend to DataFrames and PooledArrays without much trouble. This is the
Also agree. I was looking at PooledArrays.jl, I see it's generally incremental and inserting the elements one-by-one (short glance indicated that |
As for integration with SplitApplyCombine.jl here is a first pass #2359. I just used broadcasted (in general |
Cool. I'm supposing that the high-cardinality arena is where you would see the row-column-mangling slowest anyway, so that's not too discouraging right? Did you do any profiling to determine where the time is spent?
Oh that's good to hear - I wasn't sure if people would appreciate the |
Yeah true - I need to use that in SplitApplyCombine for |
You reminded me, I should finish implementing that for |
So - how far away from "good" (e.g. data.table) performance are we? I'm looking at grouping (it's a bit simpler) and using the general approach you outline I get roughly 2x improvement: julia> a = rand(1:10^6, 10^7);
julia> @btime SplitApplyCombine.groupfind2(a);
1.535 s (83 allocations: 219.31 MiB)
julia> @btime SplitApplyCombine.groupfind(a);
2.819 s (3664416 allocations: 338.50 MiB) (If anyone is curious the newer algorithm is here) |
Nice generic code - you are a master :). Now regarding performance vs. data.table. I have distilled out the key part. Here are current (i.e. v1.1.3) timings:
Comments:
Now the same in R reconstructed:
So the gap is really wide - data.table is very good here. (and we are still working with |
What goes almost as good as data.table is:
Now, how do I get it? The strategy is:
Of course in point 3 if we can be faster with the strategy with single vector giving permutation for groups then we can use it as well. The point is that unique can be cheaply detected (i.e. the cost of conversion if it happens that we have non-unique collection is quite low comparable to other costs we have to pay). I hope this helps 😄 |
Yes, that helps a lot. I was already thinking that I can port that |
Sharing common join code in SplitApplyCombine seems like a good idea. My two cents:
|
I just wanted to mention that StructArrays implements the sorting approach, computing a I would actually be pretty happy to get rid of that code, so I'd be vary much in favor of having |
I have checked it earlier:
So for two structures it is 5 seconds + the cost of merging. My current implementation above does everything in 3.4 seconds so the cost is a bit above a cost of single EDIT just note that this 3.4 seconds timing is not deployed anywhere yet - I put is as a benchmark for @andyferris, as he is working on the SplitApplyCombine.jl joins update currently (of course I can make a PR with this change, but I feel that it is cleaner if @andyferris makes a change that is consistent with his approach about how the exact design be in SplitApplyCombine.jl). |
I think you meant |
sure - copy/paste issue. I fix it above. |
Also - after discussion on Discourse it would be good to allow non-equi joins (even if they would not be super fast - in many cases people have relatively small tables so even O( |
@andyferris - what do you think the plan for the next steps should be here? We clearly have a path to go in the long term, but in short term do you think that I should do some temporary solution in DataFrames.jl or updates to SplitApplyCombine.jl can be expected soon enough so that we should wait for them (no rush of course - I am asking to plan the work in DataFrames.jl). |
@bkamins that’s a good question. Where is the “3.4 seconds” code, by the way? It would be good if I could compare it to my prototype for SplitApplyCombine (and how all the column-row-column handling stuff goes). I suppose such an integration can be implemented slowly - there’s a few basic functions like |
I have not pushed it anywhere as it was dirty. Here is the code:
|
We can check it in the future, but I believe it is less of a priority as we already have this done quite efficiently (of course maybe things can be improved also here) |
It seems the compiler does not fully inline
|
You mean constant propagate? It is interesting to me that this stuff doesn't work well with default and keyword arguments (I'd expect the equivalent of In general... I'm curious how important the order out of |
I've been on vacation for the past week or so, but I'm trying to catch back up on things. Two of the basic questions I have are where the core operations/functions are going to live, and what exactly will they be? I ask because I wonder if we could have some generic function definitions in DataAPI.jl that would allow experimental overload by custom array types to try and optimize certain operations (grouping, joining, etc.). I guess I'm also wondering if that would even make sense, or do we think we can make a set of optimized implementations that could operate on any array type? I've started playing around with making the |
Hey Jacob - from my point of view SplitApplyCombine was created to be a home for generic operations like group and join on generic collections. If Of course faster (you said multithreaded?) implementations for certain array types is always good, so adding methods to those generic functions makes sense to me (and is easy to prototype if you're willing to take SplitApplyCombine as a dependency - AcceleratedArrays is the current "poster child" for this - I think it makes sense to have different packages to supply the verbs/functions and for others to supply the nouns/datatypes). There may also be deficiencies in terms of what generic functions are needed for tabular manipulation, it's likely we'd want some more (e.g. the only joins are
I published a few performance improvements based on earlier parts of this discussion but the more recent stuff is still WIP (code pasted and/or linked above). |
Yes - I mean constant propagation and pruning branches in the code (in my first implementation
I assume that @andyferris will update SplitApplyCombine.jl based on the discussion (@andyferris - thank you for working on this). To my understanding in the short term the changes that show promise of significant improvement are:
For me the point 1. above would be a priority as it is pretty clear what needs to be done here and gives a lot of performance boost (so if SplitApplyCombine.jl would be released with changes described in 1. I would then update DataFrames.jl to use it instead of what we have in Now, what I believe we need in the long term is described in #2340 (comment) but this is a lot of work so I assume we are going to handle this incrementally. |
This avoids allocating a Vector for the case where l does not have multiple indices with the same value. For the smoke-test benchmark in <JuliaData/DataFrames.jl#2340 (comment)>, this reduces allocations by half and overall runtime by 10%.
Here are the things I think joins can take advantage of to improve performance:
join
s onGroupedDataFrame
on
variableon
variable defines unique unique rows in one or both of the joined tablesinnerjoin
which is probably most common)Please comment on what you think or have implemented to move forward with this issue (also keeping in mind that in the long term we might want multi-threading here). Also we should think of memory footprint of the algorithms we use.
The text was updated successfully, but these errors were encountered: