-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance with sparse arrays #6
Comments
Thanks. It's returning a sparse array, where the simple version without dims just converts to dense array. Turns out SparseArrays defines its own version of But we are calling I'm not sure what the cleanest solution is, we might need to add a method for I'm not sure the approach I've taken to |
By the way |
Thanks for the quick response, and the package! It's really nice to see progress being made on labelled arrays in Julia. I don't think I've been using julia heavily enough recently to provide much insight for the implementation. That said, I was expecting the to look much more like Are you sure the reduction is causing the performance issue? I agree it explains the difference in output type, but there are similar performance issues with other operations like `copy` benchmarkjulia> @benchmark copy($sparsear)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 7
--------------
minimum time: 29.745 ms (2.43% GC)
median time: 40.765 ms (30.34% GC)
mean time: 41.141 ms (31.30% GC)
maximum time: 91.379 ms (68.68% GC)
--------------
samples: 122
evals/sample: 1
julia> @benchmark copy($sparsed)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 8
--------------
minimum time: 3.843 s (0.02% GC)
median time: 3.849 s (0.20% GC)
mean time: 3.849 s (0.20% GC)
maximum time: 3.855 s (0.38% GC)
--------------
samples: 2
evals/sample: 1 `abs.` benchmarkjulia> @benchmark abs.(sparsear)
BenchmarkTools.Trial:
memory estimate: 152.64 MiB
allocs estimate: 9
--------------
minimum time: 41.412 ms (1.47% GC)
median time: 52.996 ms (25.29% GC)
mean time: 53.401 ms (25.88% GC)
maximum time: 110.823 ms (65.40% GC)
--------------
samples: 94
evals/sample: 1
julia> @benchmark abs.(sparsed)
BenchmarkTools.Trial:
memory estimate: 763.09 MiB
allocs estimate: 13
--------------
minimum time: 1.969 s (0.49% GC)
median time: 2.025 s (3.01% GC)
mean time: 2.028 s (2.98% GC)
maximum time: 2.090 s (5.29% GC)
--------------
samples: 3
evals/sample: 1 To me, the memory usage suggests an intermediate dense array is being created.
For my use case, the observations are barcodes for cells and the variable are gene identifiers. Access via string names is important for user interface, though any heavy computation should be accessing the values by position. I think I was expecting xarray/ pandas like behavior, where an index (probably hash table) would be used for string labelled dimensions. |
The sparse array created in similar() is very likely to be causing the performance issue for mean. That's why base avoids it. But there may be more issues with copy etc. For sure. I haven't even thought about sparse arrays before this conversation because I don't use them. There might need to be some architectural changes. The dims do actually change with The lookup is very rudimentary at the moment. It will be fast enough with numbers. If you want fast hashes, AcceleratedArrays.jl should work inside dims, and trying it would be interesting. But I haven't worked on that yet. Xarray and Pandas have developer teams and this was just me hacking away for a week lol!! I just need GeoData.jl for the modelling packages that I'm actually paid to write and others I do research with, and DimensionalData.jl is a bonus. It wont be practically comparable for a few years, and only then if enough people jump in and help (or the same happens with an alternative package). |
Thanks for recommending AcceleratedArrays.jl! I've been wondering if a package like that existed for a while now.
For sure! I'm definitely not expecting a single developer
Definitely agree that the dimension should stay. The main thing was that a single label on that dimension stayed, which you've got a plan for.
I did a little more digging into what's happening in mean, and I think there is a small effect from using Benchmarksjulia> ddinit = Base.reducedim_init(t -> t/2, +, sparsed, 1)
sainit = Base.reducedim_init(t -> t/2, +, sparsear, 1);
julia> @benchmark Statistics.mean!($ddinit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.666 s (0.00% GC)
median time: 1.673 s (0.00% GC)
mean time: 1.674 s (0.00% GC)
maximum time: 1.685 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
julia> @benchmark Statistics.mean!($sainit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.677 s (0.00% GC)
median time: 1.677 s (0.00% GC)
mean time: 1.696 s (0.00% GC)
maximum time: 1.734 s (0.00% GC)
--------------
samples: 3
evals/sample: 1 It looks like the big change in time is actually coming from the Benchmarksjulia> @benchmark sum!($sainit, $sparsed)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.672 s (0.00% GC)
median time: 1.675 s (0.00% GC)
mean time: 1.680 s (0.00% GC)
maximum time: 1.692 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
julia> @benchmark sum!($ddinit, $sparsear)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.222 ms (0.00% GC)
median time: 4.337 ms (0.00% GC)
mean time: 4.365 ms (0.00% GC)
maximum time: 7.741 ms (0.00% GC)
--------------
samples: 1144
evals/sample: 1 The main time discrepancy ends up coming from different methods being called by Why I say this is a more general issue of dispatch is methods specialized on the wrapped array aren't being called. I imagine similar issues will crop up with other array types, since there isn't really a step to allow dispatch based on the array holding the data. This does seem tough to solve in an elegant way. Btw, AxisArrays seems to have the same performance issue 😉. |
Thanks for digging into that. That makes sense. It's specialised methods being dispatched on the SparseArray vs general methods on AbstractDimensionalArray. Fixing this will be a design goal when I rewrite the math/stats methods. But I feel better if AxisArrays does it too lol Let me know how you go using AcceleratedArrays, and do put in an issue if any changes need to happen here for it to work. |
This should be fixed by fda792b |
Thanks for that! It fixes some cases but others, like Without using something like Cassette, can you think of anyway this could be done less manually? Maybe just a public interface for "registering" a function with DimensionalData could make this easier? BTW, I gave AcceleratedArrays a shot, but didn't see a speedup. I think this is due to how |
Really? It should be even closer using Also thanks for checking out AcceleratedArrays. Yeah |
Ok Edit: Copy should also be fixed in #13. Let me know if any other methods give you trouble. I guess this process has to be piecemeal as its hard to know where generic methods aren't good enough and I don't want bloat for no reason, or have time to test everything. |
Oh, I mean the single argument version of Benchmarkjulia> @benchmark mean($sparsed)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 6
--------------
minimum time: 1.614 s (0.00% GC)
median time: 1.623 s (0.00% GC)
mean time: 1.623 s (0.00% GC)
maximum time: 1.633 s (0.00% GC)
--------------
samples: 4
evals/sample: 1
julia> @benchmark mean($sparsear)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 4
--------------
minimum time: 4.018 ms (0.00% GC)
median time: 4.112 ms (0.00% GC)
mean time: 4.185 ms (0.00% GC)
maximum time: 9.738 ms (0.00% GC)
--------------
samples: 1193
evals/sample: 1 I think it just needs something like
Huh. Why isn't |
Oh right without dims. Those are just missing methods. There are probably 50 of them lets make a list... Obviously everything I allready have dims methods for.
|
Ok I've added single argument methods for everything that already had dims methods to that pull request. Thanks for the input, keep them coming :) |
I'm seeing performance drops of about three orders of magnitude when using sparse data in a
DimensionalArray
vs. just performing operations on the sparse array directly. This doesn't happen with dense arrays. I've included an example below formean
, but I see the same thing for other reductions likesum
ormaximum
and even methods likecopy
.This was run with
DimensionalData
v0.1.1
andJulia
v1.2.0
.The text was updated successfully, but these errors were encountered: