Skip to content

[WIP] Add Cooperative Group variant of GPU-STUMP #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

gmarkall
Copy link

@gmarkall gmarkall commented Oct 9, 2020

Following on from the discussion in #245, this PR sketches out how _compute_and_update_PI_kernel could be re-written to use Cooperative Groups and grid synchronization to compute the matrix profile and indices within a single kernel launch. Note that this PR requires numba/numba#6245 to be merged before cooperative groups can be used. My intention is that this PR provides a starting point for working out whether it makes sense to use CGs in Stumpy kernels, and how best to go about it; I am happy to update the design / implementation in response to any guidance.

Some points on the implementation:

  • In the CG variant, the loop over the range is pushed into the cooperative kernel and the entire grid synchronizes at the end of each loop iteration.
  • The choice of whether to use the cooperative kernel is left up to the user, as not all GPUs / systems support cooperative launches. There is the possibility of checking whether CGs are supported on a device to determine whether to use the CG / non-CG variant - would that be a better way to implement the decision rather than leaving it to the user?
  • To ensure successful cooperative launches, the new Numba function max_cooperative_grid_blocks is used to determine the grid dimension. This needs to be done for each GPU, in case there are different GPUs in the system with different resources.
  • The test additions attempt to test the CG variant of the kernel in every case in which the original non-CG variant is tested. These tests are passing for me locally on a system with a Quadro RTX 8000 and a Quadro P2200.

Running the following benchmark code:

import stumpy
import numpy as np
import numpy.testing as npt
import sys
import time


def test(N, validate):
    T = np.random.rand(N)
    m = 50

    gpu_start = time.perf_counter()
    gpu_mp = stumpy.gpu_stump(T, m)
    gpu_end = time.perf_counter()

    gpu_cg_start = time.perf_counter()
    gpu_cg_mp = stumpy.gpu_stump(T, m, cooperative=True)
    gpu_cg_end = time.perf_counter()

    if validate:
        cpu_start = time.perf_counter()
        cpu_mp = stumpy.stump(T, m)
        cpu_end = time.perf_counter()

        npt.assert_almost_equal(gpu_mp, cpu_mp)
        npt.assert_almost_equal(gpu_cg_mp, cpu_mp)

        print(f"CPU: {cpu_end - cpu_start}")

    print(f"GPU: {gpu_end - gpu_start}")
    print(f"GPU (CG): {gpu_cg_end - gpu_cg_start}")


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: validate.py <N> [validate]")
        sys.exit(1)

    if len(sys.argv) > 2:
        validate = bool(sys.argv[2])
    else:
        validate = 0

    test(int(sys.argv[1]), validate)

Gives the following outputs with the grm-cg-2 branch of Numba (along with numba/numba#6339 applied to fix a performance regression):

$ python validate.py 10000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 9.607369032000861
GPU: 1.263875148999432
GPU (CG): 0.0930071049988328

$ python validate.py 100000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 16.42363548299909
GPU: 12.566485560000729
GPU (CG): 2.1037257659991155

$ python validate.py 1000000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 725.9084194620009
GPU: 167.21874795799886
GPU (CG): 147.12839111400172
(numba) 

Results obtained with a Quadro RTX 8000 and Xeon Gold 6128. For small inputs, using the CG version is many times faster, because the execution time is nearly all kernel launch overhead. For the larger input, it appears to reduce execution time by around 12% on top of the improvement brought by numba/numba#6042.

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install pytest-cov)
  • Run black . in the root stumpy directory
  • Run flake8 . in the root stumpy directory
  • Run ./setup.sh && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)

This variant uses a single kernel invocation to compute the matrix
profile and matrix profile indices. The loop over the range is pushed
into the cooperative kernel, and threads synchronize on iterations of
the loop using a grid sync.

The cooperative kernel can be faster than using multiple non-cooperative
launches, because the overhead of launching many individual kernels is
removed.

The choice of whether to use the cooperative kernel is left up to the
user, as not all GPUs / systems support cooperative launches.
@seanlaw
Copy link
Contributor

seanlaw commented Oct 9, 2020

Thanks, Mark! I will take some time to review it

For now, it looks like the CI is failing due to invalid black formatting

@gmarkall
Copy link
Author

gmarkall commented Oct 9, 2020

For now, it looks like the CI is failing due to invalid black formatting

I've just tried reformatting with 19.3b0 - I had 19.10b0 installed before, which seems to do things a little differently.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 9, 2020

@gmarkall Are you able to run the test.sh script locally from the root directory? In addition to unit testing, for test coverage, we set:

export NUMBA_DISABLE_JIT=1
export NUMBA_ENABLE_CUDASIM=1

This ensures that all lines of code are fully covered (it uses the simulated CUDA on CPU threads). Would you mind running that test.sh script?

@seanlaw
Copy link
Contributor

seanlaw commented Oct 9, 2020

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:

# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.

Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.

@gmarkall
Copy link
Author

gmarkall commented Oct 9, 2020

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:

# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.

Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.

I've just enquired as to the possibilities here - will follow-up once I have more information.

@gmarkall
Copy link
Author

@gmarkall Are you able to run the test.sh script locally from the root directory? In addition to unit testing, for test coverage, we set:

export NUMBA_DISABLE_JIT=1
export NUMBA_ENABLE_CUDASIM=1

This ensures that all lines of code are fully covered (it uses the simulated CUDA on CPU threads). Would you mind running that test.sh script?

I'm able to run it without issue with the Numba branch grm-cg-2 and JIT enabled. However, I can't run it with JIT disabled because the CUDA simulator doesn't support Cooperative Groups - it executes one block to completion before launching another block. This meant it wasn't easy to just add in support for grid sync, and I was originally thinking that it would require quite some reworking to support it. I've an idea for how to get it working, but it might take a little time to try out.

Would CG support in the simulator be required for moving forward with this PR?

@seanlaw
Copy link
Contributor

seanlaw commented Oct 10, 2020

I'm able to run it without issue with the Numba branch grm-cg-2 and JIT enabled. However, I can't run it with JIT disabled because the CUDA simulator doesn't support Cooperative Groups - it executes one block to completion before launching another block. This meant it wasn't easy to just add in support for grid sync, and I was originally thinking that it would require quite some reworking to support it. I've an idea for how to get it working, but it might take a little time to try out.

Would CG support in the simulator be required for moving forward with this PR?

We are in no rush so I would prefer to get it "right" before merging. And I completely understand and respect that you may have other priorities and that this may take a while before CG support in the simulator is pushed to the Numba release and then for it to be installable via our CI. This is all reasonable and I am willing to wait for it. As time moves on, I am considering completely removing the non-CG code in the future once CG is standard and "stable".

P.S. I really appreciate all of your help, patience, and willingness to work with us!

@gmarkall
Copy link
Author

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:

# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.

Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.

I've just enquired as to the possibilities here - will follow-up once I have more information.

It turns out I'm able to remove this, so I've pushed changes that do so. I'm happy not to create an AUTHORS.md file, unless you'd like me to create one?

@gmarkall
Copy link
Author

We are in no rush so I would prefer to get it "right" before merging. And I completely understand and respect that you may have other priorities and that this may take a while before CG support in the simulator is pushed to the Numba release and then for it to be installable via our CI. This is all reasonable and I am willing to wait for it. As time moves on, I am considering completely removing the non-CG code in the future once CG is standard and "stable".

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

P.S. I really appreciate all of your help, patience, and willingness to work with us!

Thankyou! It's great to be able to contribute a little!

@seanlaw
Copy link
Contributor

seanlaw commented Oct 12, 2020

It turns out I'm able to remove this, so I've pushed changes that do so. I'm happy not to create an AUTHORS.md file, unless you'd like me to create one?

Great! Since you are okay with it, let's skip the AUTHORS.md file

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

Awesome! Do you think it makes sense to keep this PR in the queue (as "work-in-progress") until the Numba bits are all done and merged and then we can remove/replace all the existing non-CG parts? Also, once it's stable in Numba then maybe we can/should bump the minimum Numba version in our setup.py

Another thing is that we actually have a very similar/analogous function in gpu_aamp.py that could be updated with CG. Would you mind updating that as well? If not, then we can create a new issue to track it and then move that over in the future once this is done.

@gmarkall gmarkall changed the title Add Cooperative Group variant of GPU-STUMP [WIP] Add Cooperative Group variant of GPU-STUMP Oct 13, 2020
@gmarkall
Copy link
Author

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

Awesome! Do you think it makes sense to keep this PR in the queue (as "work-in-progress") until the Numba bits are all done and merged and then we can remove/replace all the existing non-CG parts? Also, once it's stable in Numba then maybe we can/should bump the minimum Numba version in our setup.py

Keeping it here as a work-in-progress sounds good, and I've edited the PR title to reflect this. Whether to replace the non-CG parts depends on whether you would be comfortable with STUMPY requiring CC 6.0 or greater - if you are, then keeping the CG versions only should be fine - if not, then keeping both and selecting an appropriate implementation based on the device's compute capability is an option.

Unfortunately I now realise that the CI setup is uses pip to install STUMPY and its dependencies, rather than Conda - I don't think I'm able to easily build a wheel for Numba with CGs, so I'll probably have to park this work until a released version of Numba supports CGs.

Another thing is that we actually have a very similar/analogous function in gpu_aamp.py that could be updated with CG. Would you mind updating that as well? If not, then we can create a new issue to track it and then move that over in the future once this is done.

I'm happy to update this as part of this PR - I'll make the change when I pick up this work again.

@gmarkall gmarkall marked this pull request as draft October 19, 2020 09:43
@seanlaw
Copy link
Contributor

seanlaw commented Oct 19, 2020

Keeping it here as a work-in-progress sounds good, and I've edited the PR title to reflect this. Whether to replace the non-CG parts depends on whether you would be comfortable with STUMPY requiring CC 6.0 or greater - if you are, then keeping the CG versions only should be fine - if not, then keeping both and selecting an appropriate implementation based on the device's compute capability is an option.

Please excuse my ignorance but when you say "CC 6.0 or greater", is this referring to nvcc or the cuda-toolkit version? If it's the cuda-toolkit then I'd say supporting, maybe, 8.0 is as far back as I'd want to go. Ultimately, I want to reduce the code redundancy and branching logic.

Unfortunately I now realise that the CI setup is uses pip to install STUMPY and its dependencies, rather than Conda - I don't think I'm able to easily build a wheel for Numba with CGs, so I'll probably have to park this work until a released version of Numba supports CGs.

That sounds perfectly reasonable!

I'm happy to update this as part of this PR - I'll make the change when I pick up this work again.

Sounds great! Thank you

@gmarkall
Copy link
Author

Please excuse my ignorance but when you say "CC 6.0 or greater", is this referring to nvcc or the cuda-toolkit version? If it's the cuda-toolkit then I'd say supporting, maybe, 8.0 is as far back as I'd want to go. Ultimately, I want to reduce the code redundancy and branching logic.

Apologies for my lack of clarity - CC is the Compute Capability, a property of the GPU hardware that identifies the CUDA features it supports. Compute Capability 6.0 and above includes, for example, the 10 series (GTX 1060, 1070, etc) Quadro GP100, etc. - there is a partial table on Wikipedia of GPUs and their respective compute capabilities.

Of the current lineup of NVIDIA hardware, I think requiring CC 6.0 would prevent GPU-STUMP and GPU-AAMP being run on the Jetson Nano and Jetson TX1, but all current / recent desktop / server GPUs would be OK.

Note that Numba now requires CUDA toolkit 9.0 (released Sept 2017), so this will also be a minimum requirement for STUMPY using the latest version of Numba.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 19, 2020

Apologies for my lack of clarity - CC is the Compute Capability, a property of the GPU hardware that identifies the CUDA features it supports. Compute Capability 6.0 and above includes, for example, the 10 series (GTX 1060, 1070, etc) Quadro GP100, etc. - there is a partial table on Wikipedia of GPUs and their respective compute capabilities.

Of the current lineup of NVIDIA hardware, I think requiring CC 6.0 would prevent GPU-STUMP and GPU-AAMP being run on the Jetson Nano and Jetson TX1, but all current / recent desktop / server GPUs would be OK.

Thank you for the clarification. Based on what I'm seeing, I think CC 6.0 should be fine (I only have access to a 1080Ti for testing GPU code, which seems to be right above the CC 6.0 cutoff). I have no intentions to support Jetson hardware. So, knowing this, I think we can just remove/replace the non-CG parts and I think the code will look a lot cleaner

Note that Numba now requires CUDA toolkit 9.0 (released Sept 2017), so this will also be a minimum requirement for STUMPY using the latest version of Numba.

Good to know! Thank you

Base automatically changed from master to main January 26, 2021 14:34
@seanlaw
Copy link
Contributor

seanlaw commented Jan 30, 2021

@gmarkall FYI, we renamed our "master" branch to "main" so you may need to:

git branch -m master main

@codecov-io
Copy link

codecov-io commented Apr 15, 2021

Codecov Report

Merging #266 (8fc3ae0) into main (b23a984) will decrease coverage by 1.54%.
The diff coverage is 60.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##              main     #266      +/-   ##
===========================================
- Coverage   100.00%   98.45%   -1.55%     
===========================================
  Files           33       33              
  Lines         2628     2723      +95     
===========================================
+ Hits          2628     2681      +53     
- Misses           0       42      +42     
Impacted Files Coverage Δ
stumpy/gpu_aamp.py 77.59% <18.00%> (-22.41%) ⬇️
stumpy/gpu_stump.py 99.47% <98.18%> (-0.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b23a984...8fc3ae0. Read the comment docs.

@seanlaw
Copy link
Contributor

seanlaw commented Apr 16, 2021

@gmarkall I wanted to check in and see if there is anything I can do or help with?

@gmarkall
Copy link
Author

@seanlaw Many thanks - I was just working through things on this branch, but I ran out of time yesterday. I'll continue working on this when I get a chance over the next few days and let you know if I get stuck.

@gmarkall
Copy link
Author

@seanlaw I realised my recent failures were only down to the minimum Numba version being 0.48.0 in the branch - I've updated this now.

I re-ran the benchmark from the PR description above with Numba 0.53.1, and I get (on Xeon Gold 6128 vs RTX 8000):

$ python benchmark.py 10000 1
CPU: 7.613144517999899
GPU: 1.8043515179997485
GPU (CG): 0.08748542399916914

$ python benchmark.py 100000 1
CPU: 15.842042092001066
GPU: 14.513052711001365
GPU (CG): 2.011685032000969

$ python benchmark.py 1000000 1
CPU: 835.6128378890025
GPU: 138.4262796160001
GPU (CG): 147.96828188799918

So CGs are faster for smaller inputs - this may be related to some efforts put in to reduce the overhead of launching kernels. Given this, is the CG variant still of interest?

@seanlaw
Copy link
Contributor

seanlaw commented May 21, 2021

Given this, is the CG variant still of interest?

@gmarkall Thanks for coming back to this. I'd hate to waste your efforts but, based on the numbers, it may not be worth the added code complexity/maintenance especially if the goal is to try and make things faster for much larger datasets (above 1 million data points).

What are your thoughts?

@gmarkall
Copy link
Author

What are your thoughts?

I don't think the efforts are wasted:

  • This work motivated the inclusion of support for CGs in Numba, so it has a new feature even if it's not used here,
  • The fact that launches are a little faster is still useful to STUMPY, it's just that you don't have to make any code changes to take advantage of them,
  • I've learned a lot about STUMP, the matrix profile, and time series data!

So if the overhead isn't worth it for only benefitting small data, I'm happy with closing it.

Before we do that, can we check that you see similar (i.e. better performance without CGs) on your setup too, either with the benchmark above, or the one you used in #245 (comment)?

@seanlaw
Copy link
Contributor

seanlaw commented May 21, 2021

Before we do that, can we check that you see similar (i.e. better performance without CGs) on your setup too, either with the benchmark above, or the one you used in #245 (comment)?

So, I can certainly run the timing tests but it's been a while so could you please remind me what the setup for GPU (no CG) and GPU (with CG) should be?

It seems like I no longer have to install a custom version of numba anymore and, in both (no CG and with CG) cases, I can just use numba=0.53.1. The only difference is that the no CG timing will use the TDAmeritrade/stumpy and with CG timing will use the https://github.com/gmarkall/stumpy repo (but switched over to the grm-cg-stump branch. Does that sound right?

git clone --single-branch --branch grm-cg-stump https://github.com/gmarkall/stumpy.git

@seanlaw
Copy link
Contributor

seanlaw commented May 21, 2021

Okay, so using:

#!/usr/bin/env python

import stumpy
import numpy as np
import time
from stumpy.config import STUMPY_THREADS_PER_BLOCK

STUMPY_THREADS_PER_BLOCK = 512

if __name__ == '__main__':
    np.random.seed(0)
    x = np.random.rand(64)
    m = 50
    start_time = time.time()
    mp = stumpy.gpu_stump(x, m, device_id=[0, 1])
    mp = stumpy.gpu_stump(x, m, device_id=1)
    for n in [10000, 100000, 1000000]:
        x = np.random.rand(n)
        m = 50
        start_time = time.time()
        # mp = stumpy.gpu_stump(x, m, device_id=1)
        mp = stumpy.gpu_stump(x, m, device_id=[0, 1])
        print(n, time.time() - start_time, flush=True)

I used numba=0.53.1 in all cases (installed via conda-forge).

# 2x 1080Ti, TDAmeritrade/stumpy
10000 3.2
100000 12.3
1000000 124.5

# 2x 1080Ti, gmarkall/stumpy
10000 4.1
100000 12.9

Unfortunately, my remote machine went down and I could not check the final run for gmarkall/stumpy at 1000000 but hopefully this is sufficient. They are both very similar to the CG numbers from last fall.

@gmarkall
Copy link
Author

Thanks for the update and all your support through this work - I think it seems appropriate to close the PR now... I'll be back if I have any more ideas on how to improve performance :-)

@gmarkall gmarkall closed this May 21, 2021
@seanlaw
Copy link
Contributor

seanlaw commented May 21, 2021

Thank you @gmarkall! It was a pleasure to work with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants