-
Notifications
You must be signed in to change notification settings - Fork 333
[WIP] Add Cooperative Group variant of GPU-STUMP #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This variant uses a single kernel invocation to compute the matrix profile and matrix profile indices. The loop over the range is pushed into the cooperative kernel, and threads synchronize on iterations of the loop using a grid sync. The cooperative kernel can be faster than using multiple non-cooperative launches, because the overhead of launching many individual kernels is removed. The choice of whether to use the cooperative kernel is left up to the user, as not all GPUs / systems support cooperative launches.
Thanks, Mark! I will take some time to review it For now, it looks like the CI is failing due to invalid |
I've just tried reformatting with 19.3b0 - I had 19.10b0 installed before, which seems to do things a little differently. |
@gmarkall Are you able to run the
This ensures that all lines of code are fully covered (it uses the simulated CUDA on CPU threads). Would you mind running that |
@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:
Could we create, say, an |
I've just enquired as to the possibilities here - will follow-up once I have more information. |
I'm able to run it without issue with the Numba branch Would CG support in the simulator be required for moving forward with this PR? |
We are in no rush so I would prefer to get it "right" before merging. And I completely understand and respect that you may have other priorities and that this may take a while before CG support in the simulator is pushed to the Numba release and then for it to be installable via our CI. This is all reasonable and I am willing to wait for it. As time moves on, I am considering completely removing the non-CG code in the future once CG is standard and "stable". P.S. I really appreciate all of your help, patience, and willingness to work with us! |
It turns out I'm able to remove this, so I've pushed changes that do so. I'm happy not to create an AUTHORS.md file, unless you'd like me to create one? |
I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.
Thankyou! It's great to be able to contribute a little! |
Great! Since you are okay with it, let's skip the
Awesome! Do you think it makes sense to keep this PR in the queue (as "work-in-progress") until the Another thing is that we actually have a very similar/analogous function in |
Keeping it here as a work-in-progress sounds good, and I've edited the PR title to reflect this. Whether to replace the non-CG parts depends on whether you would be comfortable with STUMPY requiring CC 6.0 or greater - if you are, then keeping the CG versions only should be fine - if not, then keeping both and selecting an appropriate implementation based on the device's compute capability is an option. Unfortunately I now realise that the CI setup is uses pip to install STUMPY and its dependencies, rather than Conda - I don't think I'm able to easily build a wheel for Numba with CGs, so I'll probably have to park this work until a released version of Numba supports CGs.
I'm happy to update this as part of this PR - I'll make the change when I pick up this work again. |
Please excuse my ignorance but when you say "CC 6.0 or greater", is this referring to
That sounds perfectly reasonable!
Sounds great! Thank you |
Apologies for my lack of clarity - CC is the Compute Capability, a property of the GPU hardware that identifies the CUDA features it supports. Compute Capability 6.0 and above includes, for example, the 10 series (GTX 1060, 1070, etc) Quadro GP100, etc. - there is a partial table on Wikipedia of GPUs and their respective compute capabilities. Of the current lineup of NVIDIA hardware, I think requiring CC 6.0 would prevent GPU-STUMP and GPU-AAMP being run on the Jetson Nano and Jetson TX1, but all current / recent desktop / server GPUs would be OK. Note that Numba now requires CUDA toolkit 9.0 (released Sept 2017), so this will also be a minimum requirement for STUMPY using the latest version of Numba. |
Thank you for the clarification. Based on what I'm seeing, I think CC 6.0 should be fine (I only have access to a 1080Ti for testing GPU code, which seems to be right above the CC 6.0 cutoff). I have no intentions to support Jetson hardware. So, knowing this, I think we can just remove/replace the non-CG parts and I think the code will look a lot cleaner
Good to know! Thank you |
@gmarkall FYI, we renamed our "master" branch to "main" so you may need to:
|
The simulator in 0.53.1 only supports definition, not overloads.
Codecov Report
@@ Coverage Diff @@
## main #266 +/- ##
===========================================
- Coverage 100.00% 98.45% -1.55%
===========================================
Files 33 33
Lines 2628 2723 +95
===========================================
+ Hits 2628 2681 +53
- Misses 0 42 +42
Continue to review full report at Codecov.
|
@gmarkall I wanted to check in and see if there is anything I can do or help with? |
@seanlaw Many thanks - I was just working through things on this branch, but I ran out of time yesterday. I'll continue working on this when I get a chance over the next few days and let you know if I get stuck. |
@seanlaw I realised my recent failures were only down to the minimum Numba version being 0.48.0 in the branch - I've updated this now. I re-ran the benchmark from the PR description above with Numba 0.53.1, and I get (on Xeon Gold 6128 vs RTX 8000):
So CGs are faster for smaller inputs - this may be related to some efforts put in to reduce the overhead of launching kernels. Given this, is the CG variant still of interest? |
@gmarkall Thanks for coming back to this. I'd hate to waste your efforts but, based on the numbers, it may not be worth the added code complexity/maintenance especially if the goal is to try and make things faster for much larger datasets (above 1 million data points). What are your thoughts? |
I don't think the efforts are wasted:
So if the overhead isn't worth it for only benefitting small data, I'm happy with closing it. Before we do that, can we check that you see similar (i.e. better performance without CGs) on your setup too, either with the benchmark above, or the one you used in #245 (comment)? |
So, I can certainly run the timing tests but it's been a while so could you please remind me what the setup for It seems like I no longer have to install a custom version of
|
Okay, so using:
I used
Unfortunately, my remote machine went down and I could not check the final run for |
Thanks for the update and all your support through this work - I think it seems appropriate to close the PR now... I'll be back if I have any more ideas on how to improve performance :-) |
Thank you @gmarkall! It was a pleasure to work with you |
Following on from the discussion in #245, this PR sketches out how
_compute_and_update_PI_kernel
could be re-written to use Cooperative Groups and grid synchronization to compute the matrix profile and indices within a single kernel launch. Note that this PR requires numba/numba#6245 to be merged before cooperative groups can be used. My intention is that this PR provides a starting point for working out whether it makes sense to use CGs in Stumpy kernels, and how best to go about it; I am happy to update the design / implementation in response to any guidance.Some points on the implementation:
max_cooperative_grid_blocks
is used to determine the grid dimension. This needs to be done for each GPU, in case there are different GPUs in the system with different resources.Running the following benchmark code:
Gives the following outputs with the
grm-cg-2
branch of Numba (along with numba/numba#6339 applied to fix a performance regression):Results obtained with a Quadro RTX 8000 and Xeon Gold 6128. For small inputs, using the CG version is many times faster, because the execution time is nearly all kernel launch overhead. For the larger input, it appears to reduce execution time by around 12% on top of the improvement brought by numba/numba#6042.
Pull Request Checklist
Below is a simple checklist but please do not hesitate to ask for assistance!
black
(i.e.,python -m pip install black
orconda install black
)flake8
(i.e.,python -m pip install flake8
orconda install flake8
)pytest-cov
(i.e.,python -m pip install pytest-cov
orconda install pytest-cov
)black .
in the root stumpy directoryflake8 .
in the root stumpy directory./setup.sh && ./test.sh
in the root stumpy directory