-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argsort on int32_t severely underperforms #109
Comments
HI @KungFuJesus. We haven't compared performance of Your benchmark numbers look right, and I don't believe we claimed |
I'm sure this project never made the claim, I just recall Phoronix touting something like a 10x gain when numpy used it in place of std::sort. I'm sure there'll be nuance there. The scalar replacement does explain why me disabling GDS mitigations didn't seem to make a difference, though. The additional memory requirement is acceptable in my case. I do wonder if the sorting network approach could be applied to a radix sort for the moves. E.g. instead of comparisons for the sorting network, the bucket positions get computed. I'm like 90% sure IPP's using purely scalar operations for everything but a linear sequence generation. |
I believe that was for quicksort, not argsort.
If additional space is acceptable, then you could try using key-value sort |
I had tried that initially to try to get 32 bit values for indices and it wasn't working for me correctly for whatever reason. I didn't try it with size_t indices, though. |
32-bit indices weren't supported until recently: #108. Did you try it after this was merged? |
Yes, last commit I was trying against was: And it was wrong, and not wrong because it was unstable, but something very wonky. I'm computing a voxelization index and scanning through these voxel indices post sort with std::upper_bound to get their groupings. I went from something like 40k unique indices down to 2 or 3. Let me reconstruct the code I used to test, it should be a simple thing to verify correctness/incorrectness for if there was a mistake on my end. edit: this is what I used std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
std::vector<uint32_t> idxs(numVals);
std::iota(idxs.begin(), idxs.end(), 0);
x86simdsort::keyvalue_qsort<uint32_t, int32_t>(idxs.data(), vals, numVals);
return idxs;
} |
Yeah, even with signed 32 bit types for the index I'm getting 4 unique voxel nodes instead of 47198 like there should be for this particular set of arguments. Also flipping the keys and argument values around results 3 unique nodes, so, I don't think that's it, either. |
If you can't reproduce this issue with random data I can gladly provide you this data in ascii since it's just indices, but it's going to be about 123 million integers. |
your argsort implementation looks incorrect. You want to sort
Couple of things to note: |
|
numVals == ~103 million, which is certainly smaller than the 2.5ish billion value of UINT32_MAX. Like I said before, I switched around vals and idxs in case I was confusing keys and values and still ended up with only 3 unique values. I haven't printed even a small slice of the output yet, I'm mostly just looking at what my std::upper_bound scan is concluding. Did you want a gzip compressed text document of the vals array? |
In this circumstance it's ok to be destructive to vals as this is a temporary voxel node index calculation that doesn't survive past this. It's merely there to get row indices for a bunch of 3d points to arrange spatially in a grid. Err wait...no that's not strictly true. OK, so I may need to make an explicit copy. Let me try that. |
Ok, yeah that was it. Thanks for the tip. Here's the timing: So radix sort is still winning there by a fair margin :(. Now, I did revert to the commit just prior to your PR to see what the performance was like with actual gathers (and GDS mitigations disabled, of course), and at least with the benchmarks it looks ~1.8x faster, so maybe it might have been beating IPP's index radix sort if we could rely on the microcode mitigations not being loaded. |
Another thing going for this is that the reason it wasn't working is I was doing what the code was already doing, deriving the post sorted values. So, there's some savings there too but maybe not enough to beat radix index sort alone. |
Yeah, that is unfortunate. Hopefully future hardware won't need these mitigations and we can have a version of argsort with the gather instruction. For there is scope of performance improvement for the [32-bit, 32-bit] key value sort, which we are working on. Will keep this open till then. |
@KungFuJesus PR #120 improves 32-bit key value sorting by a fair bit. Do you mind timing your benchmarks with this patch to see if it improves anything at your end? |
Sure, I'll check it out today and see the improvements. The mention of zmm registers I assume means everything is still AVX512 only - is this sort implementation possible on AVX2 era instructions or is that forever stuck on scalar implementation? |
Hmm, I'm not seeing any significant improvements from this. I am using the key-value sort explicitly rather than argsort in my code, does that matter? edit: Ok, I am seeing a marginal improvement for a much much larger set: Before:
After:
Is this the expected gain or should we have seen something more significant? |
#119 adds AVX2 versions but the performance gains won't match AVX-512.
On random data, I am seeing a 2x improvement in my benchmarking. What does your data distribution look like? Are you still benchmarking this piece of code where both the key and value vectors are 32-bit?
|
Also, what is the array size? |
It's 3d point cloud data that's likely to have a partial amount of order to it. The point locations themselves are used to compute an integer voxel node index by dividing by a uniform grid length on all 3 dimensions. I have a variant that produces this from the GPU and the data gets returned back as the rays intersect, with an atomic value on the gpu determining insertion index (making it slightly more random). The CPU version comes back in the order that the rays are cast, but the rays are formed in a 2d grid on a plane from some distance away, no guarantee for the data to be in voxel node order at all.
Yes, I made sure of that, we're operating on unsigned 32 bit indices and the values themselves (the voxel nodes) are signed. Not sure if that matters at all, though.
This particular example is 891416312 nodes. Though I tried it on a significantly smaller dataset (41445847 node indices) so that it wasn't going to be strictly a memory bandwidth benchmark and it looked to be roughly the same time before and after (maybe 100ms worse). |
Hmm not sure if I am measuring it wrong but I am seeing results differently. PR #121 compares copying the original array and sort indices using
EDIT: these are measured for various array sizes: [128, 256, 512, 1024, 5k, 100k, 1million, 10m and 100m] |
I can provide a dataset, if that would help. I'm fairly confident sharing it would be acceptable given that they are just arbitrary voxel indices. |
sure. I can take a look. |
So here's a smaller example. A quick note, this does happen to be faster overall because it applies the sorting to the keys, which is why I'm using it over IPP's radix sort while on AVX512. However, the IPP IndexRadixSortAscend had been faster than argsort by itself, which should be a more fair comparison. I've not tested the argsort function since then, but with this PR, the invocation of keyvalue_qsort is not any faster. The data itself is a bunch of little endian 32 bit integers serialized to a file and then that is gzipped. So you should be able to use something like fread on little endian to get all of the integers (calculating the size by dividing the number of bytes by 4). This is from voxels coming out of the Embree code, so they are less scrambled than the GPU counterpart but not guaranteed to be in order. |
Any progress on testing this data set? |
apologies, I haven't had time yet. I will try to take a look at it this week. |
HI @KungFuJesus yeah looks like for the data distribution you have, I am also measuring ippindexsort to be faster than avx-512 argsort. I wonder if the new pivot selection might help with it. Could you try benchmarking with #134? |
Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).
The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.
Here's my benchmark on my hardware:
The text was updated successfully, but these errors were encountered: