[WIP] Add Cooperative Group variant of GPU-STUMP #266

gmarkall · 2020-10-09T15:13:07Z

Following on from the discussion in #245, this PR sketches out how _compute_and_update_PI_kernel could be re-written to use Cooperative Groups and grid synchronization to compute the matrix profile and indices within a single kernel launch. Note that this PR requires numba/numba#6245 to be merged before cooperative groups can be used. My intention is that this PR provides a starting point for working out whether it makes sense to use CGs in Stumpy kernels, and how best to go about it; I am happy to update the design / implementation in response to any guidance.

Some points on the implementation:

In the CG variant, the loop over the range is pushed into the cooperative kernel and the entire grid synchronizes at the end of each loop iteration.
The choice of whether to use the cooperative kernel is left up to the user, as not all GPUs / systems support cooperative launches. There is the possibility of checking whether CGs are supported on a device to determine whether to use the CG / non-CG variant - would that be a better way to implement the decision rather than leaving it to the user?
To ensure successful cooperative launches, the new Numba function max_cooperative_grid_blocks is used to determine the grid dimension. This needs to be done for each GPU, in case there are different GPUs in the system with different resources.
The test additions attempt to test the CG variant of the kernel in every case in which the original non-CG variant is tested. These tests are passing for me locally on a system with a Quadro RTX 8000 and a Quadro P2200.

Running the following benchmark code:

import stumpy
import numpy as np
import numpy.testing as npt
import sys
import time


def test(N, validate):
    T = np.random.rand(N)
    m = 50

    gpu_start = time.perf_counter()
    gpu_mp = stumpy.gpu_stump(T, m)
    gpu_end = time.perf_counter()

    gpu_cg_start = time.perf_counter()
    gpu_cg_mp = stumpy.gpu_stump(T, m, cooperative=True)
    gpu_cg_end = time.perf_counter()

    if validate:
        cpu_start = time.perf_counter()
        cpu_mp = stumpy.stump(T, m)
        cpu_end = time.perf_counter()

        npt.assert_almost_equal(gpu_mp, cpu_mp)
        npt.assert_almost_equal(gpu_cg_mp, cpu_mp)

        print(f"CPU: {cpu_end - cpu_start}")

    print(f"GPU: {gpu_end - gpu_start}")
    print(f"GPU (CG): {gpu_cg_end - gpu_cg_start}")


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: validate.py <N> [validate]")
        sys.exit(1)

    if len(sys.argv) > 2:
        validate = bool(sys.argv[2])
    else:
        validate = 0

    test(int(sys.argv[1]), validate)

Gives the following outputs with the grm-cg-2 branch of Numba (along with numba/numba#6339 applied to fix a performance regression):

$ python validate.py 10000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 9.607369032000861
GPU: 1.263875148999432
GPU (CG): 0.0930071049988328

$ python validate.py 100000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 16.42363548299909
GPU: 12.566485560000729
GPU (CG): 2.1037257659991155

$ python validate.py 1000000 1
OMP: Info #274: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
CPU: 725.9084194620009
GPU: 167.21874795799886
GPU (CG): 147.12839111400172
(numba)

Results obtained with a Quadro RTX 8000 and Xeon Gold 6128. For small inputs, using the CG version is many times faster, because the execution time is nearly all kernel launch overhead. For the larger input, it appears to reduce execution time by around 12% on top of the improvement brought by numba/numba#6042.

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

This variant uses a single kernel invocation to compute the matrix profile and matrix profile indices. The loop over the range is pushed into the cooperative kernel, and threads synchronize on iterations of the loop using a grid sync. The cooperative kernel can be faster than using multiple non-cooperative launches, because the overhead of launching many individual kernels is removed. The choice of whether to use the cooperative kernel is left up to the user, as not all GPUs / systems support cooperative launches.

seanlaw · 2020-10-09T15:34:16Z

Thanks, Mark! I will take some time to review it

For now, it looks like the CI is failing due to invalid black formatting

gmarkall · 2020-10-09T15:40:15Z

For now, it looks like the CI is failing due to invalid black formatting

I've just tried reformatting with 19.3b0 - I had 19.10b0 installed before, which seems to do things a little differently.

seanlaw · 2020-10-09T16:49:21Z

@gmarkall Are you able to run the test.sh script locally from the root directory? In addition to unit testing, for test coverage, we set:

export NUMBA_DISABLE_JIT=1
export NUMBA_ENABLE_CUDASIM=1

This ensures that all lines of code are fully covered (it uses the simulated CUDA on CPU threads). Would you mind running that test.sh script?

seanlaw · 2020-10-09T17:42:25Z

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:

# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.

Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.

gmarkall · 2020-10-09T18:42:25Z

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:
# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.
Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.

I've just enquired as to the possibilities here - will follow-up once I have more information.

gmarkall · 2020-10-10T07:58:28Z

@gmarkall Are you able to run the test.sh script locally from the root directory? In addition to unit testing, for test coverage, we set:
export NUMBA_DISABLE_JIT=1
export NUMBA_ENABLE_CUDASIM=1
This ensures that all lines of code are fully covered (it uses the simulated CUDA on CPU threads). Would you mind running that test.sh script?

I'm able to run it without issue with the Numba branch grm-cg-2 and JIT enabled. However, I can't run it with JIT disabled because the CUDA simulator doesn't support Cooperative Groups - it executes one block to completion before launching another block. This meant it wasn't easy to just add in support for grid sync, and I was originally thinking that it would require quite some reworking to support it. I've an idea for how to get it working, but it might take a little time to try out.

Would CG support in the simulator be required for moving forward with this PR?

seanlaw · 2020-10-10T14:25:06Z

I'm able to run it without issue with the Numba branch grm-cg-2 and JIT enabled. However, I can't run it with JIT disabled because the CUDA simulator doesn't support Cooperative Groups - it executes one block to completion before launching another block. This meant it wasn't easy to just add in support for grid sync, and I was originally thinking that it would require quite some reworking to support it. I've an idea for how to get it working, but it might take a little time to try out.

Would CG support in the simulator be required for moving forward with this PR?

We are in no rush so I would prefer to get it "right" before merging. And I completely understand and respect that you may have other priorities and that this may take a while before CG support in the simulator is pushed to the Numba release and then for it to be installable via our CI. This is all reasonable and I am willing to wait for it. As time moves on, I am considering completely removing the non-CG code in the future once CG is standard and "stable".

P.S. I really appreciate all of your help, patience, and willingness to work with us!

gmarkall · 2020-10-12T15:47:58Z

@gmarkall I hope this is not a big issue but I spoke with our manager of "Office of the Open Source" at TD Ameritrade and we are not able to change the header:
# Copyright 2019 TD Ameritrade, 2020 NVIDIA CORPORATION.
Could we create, say, an AUTHORS.md file or something to acknowledge your contribution instead? I just want to be respectful and make sure that you get what you need out of this as well.
I've just enquired as to the possibilities here - will follow-up once I have more information.

It turns out I'm able to remove this, so I've pushed changes that do so. I'm happy not to create an AUTHORS.md file, unless you'd like me to create one?

gmarkall · 2020-10-12T15:52:56Z

We are in no rush so I would prefer to get it "right" before merging. And I completely understand and respect that you may have other priorities and that this may take a while before CG support in the simulator is pushed to the Numba release and then for it to be installable via our CI. This is all reasonable and I am willing to wait for it. As time moves on, I am considering completely removing the non-CG code in the future once CG is standard and "stable".

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

P.S. I really appreciate all of your help, patience, and willingness to work with us!

Thankyou! It's great to be able to contribute a little!

seanlaw · 2020-10-12T16:16:34Z

It turns out I'm able to remove this, so I've pushed changes that do so. I'm happy not to create an AUTHORS.md file, unless you'd like me to create one?

Great! Since you are okay with it, let's skip the AUTHORS.md file

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

Awesome! Do you think it makes sense to keep this PR in the queue (as "work-in-progress") until the Numba bits are all done and merged and then we can remove/replace all the existing non-CG parts? Also, once it's stable in Numba then maybe we can/should bump the minimum Numba version in our setup.py

Another thing is that we actually have a very similar/analogous function in gpu_aamp.py that could be updated with CG. Would you mind updating that as well? If not, then we can create a new issue to track it and then move that over in the future once this is done.

gmarkall · 2020-10-19T09:43:15Z

I think STUMPY is a nice application for CGs, so I'm keen to progress here as much as I can. Once I've got the simulator working with CGs, I'm hoping to build test packages and upload them to my Anaconda channel to make it possible to test this PR out with STUMPY's CI prior to a Numba release supporting CGs.

Awesome! Do you think it makes sense to keep this PR in the queue (as "work-in-progress") until the Numba bits are all done and merged and then we can remove/replace all the existing non-CG parts? Also, once it's stable in Numba then maybe we can/should bump the minimum Numba version in our setup.py

Keeping it here as a work-in-progress sounds good, and I've edited the PR title to reflect this. Whether to replace the non-CG parts depends on whether you would be comfortable with STUMPY requiring CC 6.0 or greater - if you are, then keeping the CG versions only should be fine - if not, then keeping both and selecting an appropriate implementation based on the device's compute capability is an option.

Unfortunately I now realise that the CI setup is uses pip to install STUMPY and its dependencies, rather than Conda - I don't think I'm able to easily build a wheel for Numba with CGs, so I'll probably have to park this work until a released version of Numba supports CGs.

Another thing is that we actually have a very similar/analogous function in gpu_aamp.py that could be updated with CG. Would you mind updating that as well? If not, then we can create a new issue to track it and then move that over in the future once this is done.

I'm happy to update this as part of this PR - I'll make the change when I pick up this work again.

seanlaw · 2020-10-19T13:26:09Z

Keeping it here as a work-in-progress sounds good, and I've edited the PR title to reflect this. Whether to replace the non-CG parts depends on whether you would be comfortable with STUMPY requiring CC 6.0 or greater - if you are, then keeping the CG versions only should be fine - if not, then keeping both and selecting an appropriate implementation based on the device's compute capability is an option.

Please excuse my ignorance but when you say "CC 6.0 or greater", is this referring to nvcc or the cuda-toolkit version? If it's the cuda-toolkit then I'd say supporting, maybe, 8.0 is as far back as I'd want to go. Ultimately, I want to reduce the code redundancy and branching logic.

Unfortunately I now realise that the CI setup is uses pip to install STUMPY and its dependencies, rather than Conda - I don't think I'm able to easily build a wheel for Numba with CGs, so I'll probably have to park this work until a released version of Numba supports CGs.

That sounds perfectly reasonable!

I'm happy to update this as part of this PR - I'll make the change when I pick up this work again.

Sounds great! Thank you

gmarkall · 2020-10-19T13:54:13Z

Please excuse my ignorance but when you say "CC 6.0 or greater", is this referring to nvcc or the cuda-toolkit version? If it's the cuda-toolkit then I'd say supporting, maybe, 8.0 is as far back as I'd want to go. Ultimately, I want to reduce the code redundancy and branching logic.

Apologies for my lack of clarity - CC is the Compute Capability, a property of the GPU hardware that identifies the CUDA features it supports. Compute Capability 6.0 and above includes, for example, the 10 series (GTX 1060, 1070, etc) Quadro GP100, etc. - there is a partial table on Wikipedia of GPUs and their respective compute capabilities.

Of the current lineup of NVIDIA hardware, I think requiring CC 6.0 would prevent GPU-STUMP and GPU-AAMP being run on the Jetson Nano and Jetson TX1, but all current / recent desktop / server GPUs would be OK.

Note that Numba now requires CUDA toolkit 9.0 (released Sept 2017), so this will also be a minimum requirement for STUMPY using the latest version of Numba.

seanlaw · 2020-10-19T14:11:17Z

Apologies for my lack of clarity - CC is the Compute Capability, a property of the GPU hardware that identifies the CUDA features it supports. Compute Capability 6.0 and above includes, for example, the 10 series (GTX 1060, 1070, etc) Quadro GP100, etc. - there is a partial table on Wikipedia of GPUs and their respective compute capabilities.

Of the current lineup of NVIDIA hardware, I think requiring CC 6.0 would prevent GPU-STUMP and GPU-AAMP being run on the Jetson Nano and Jetson TX1, but all current / recent desktop / server GPUs would be OK.

Thank you for the clarification. Based on what I'm seeing, I think CC 6.0 should be fine (I only have access to a 1080Ti for testing GPU code, which seems to be right above the CC 6.0 cutoff). I have no intentions to support Jetson hardware. So, knowing this, I think we can just remove/replace the non-CG parts and I think the code will look a lot cleaner

Note that Numba now requires CUDA toolkit 9.0 (released Sept 2017), so this will also be a minimum requirement for STUMPY using the latest version of Numba.

Good to know! Thank you

seanlaw · 2021-01-30T02:42:01Z

@gmarkall FYI, we renamed our "master" branch to "main" so you may need to:

git branch -m master main

The simulator in 0.53.1 only supports definition, not overloads.

codecov-io · 2021-04-15T14:19:41Z

Codecov Report

Merging #266 (8fc3ae0) into main (b23a984) will decrease coverage by 1.54%.
The diff coverage is 60.00%.

@@             Coverage Diff             @@
##              main     #266      +/-   ##
===========================================
- Coverage   100.00%   98.45%   -1.55%     
===========================================
  Files           33       33              
  Lines         2628     2723      +95     
===========================================
+ Hits          2628     2681      +53     
- Misses           0       42      +42

Impacted Files	Coverage Δ
stumpy/gpu_aamp.py	`77.59% <18.00%> (-22.41%)`	⬇️
stumpy/gpu_stump.py	`99.47% <98.18%> (-0.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b23a984...8fc3ae0. Read the comment docs.

seanlaw · 2021-04-16T11:27:43Z

@gmarkall I wanted to check in and see if there is anything I can do or help with?

gmarkall · 2021-04-16T12:28:36Z

@seanlaw Many thanks - I was just working through things on this branch, but I ran out of time yesterday. I'll continue working on this when I get a chance over the next few days and let you know if I get stuck.

gmarkall · 2021-05-21T15:48:07Z

@seanlaw I realised my recent failures were only down to the minimum Numba version being 0.48.0 in the branch - I've updated this now.

I re-ran the benchmark from the PR description above with Numba 0.53.1, and I get (on Xeon Gold 6128 vs RTX 8000):

$ python benchmark.py 10000 1
CPU: 7.613144517999899
GPU: 1.8043515179997485
GPU (CG): 0.08748542399916914

$ python benchmark.py 100000 1
CPU: 15.842042092001066
GPU: 14.513052711001365
GPU (CG): 2.011685032000969

$ python benchmark.py 1000000 1
CPU: 835.6128378890025
GPU: 138.4262796160001
GPU (CG): 147.96828188799918

So CGs are faster for smaller inputs - this may be related to some efforts put in to reduce the overhead of launching kernels. Given this, is the CG variant still of interest?

seanlaw · 2021-05-21T16:31:17Z

Given this, is the CG variant still of interest?

@gmarkall Thanks for coming back to this. I'd hate to waste your efforts but, based on the numbers, it may not be worth the added code complexity/maintenance especially if the goal is to try and make things faster for much larger datasets (above 1 million data points).

What are your thoughts?

gmarkall · 2021-05-21T16:43:00Z

What are your thoughts?

I don't think the efforts are wasted:

This work motivated the inclusion of support for CGs in Numba, so it has a new feature even if it's not used here,
The fact that launches are a little faster is still useful to STUMPY, it's just that you don't have to make any code changes to take advantage of them,
I've learned a lot about STUMP, the matrix profile, and time series data!

So if the overhead isn't worth it for only benefitting small data, I'm happy with closing it.

Before we do that, can we check that you see similar (i.e. better performance without CGs) on your setup too, either with the benchmark above, or the one you used in #245 (comment)?

seanlaw · 2021-05-21T17:18:47Z

Before we do that, can we check that you see similar (i.e. better performance without CGs) on your setup too, either with the benchmark above, or the one you used in #245 (comment)?

So, I can certainly run the timing tests but it's been a while so could you please remind me what the setup for GPU (no CG) and GPU (with CG) should be?

It seems like I no longer have to install a custom version of numba anymore and, in both (no CG and with CG) cases, I can just use numba=0.53.1. The only difference is that the no CG timing will use the TDAmeritrade/stumpy and with CG timing will use the https://github.com/gmarkall/stumpy repo (but switched over to the grm-cg-stump branch. Does that sound right?

git clone --single-branch --branch grm-cg-stump https://github.com/gmarkall/stumpy.git

seanlaw · 2021-05-21T18:34:06Z

Okay, so using:

#!/usr/bin/env python

import stumpy
import numpy as np
import time
from stumpy.config import STUMPY_THREADS_PER_BLOCK

STUMPY_THREADS_PER_BLOCK = 512

if __name__ == '__main__':
    np.random.seed(0)
    x = np.random.rand(64)
    m = 50
    start_time = time.time()
    mp = stumpy.gpu_stump(x, m, device_id=[0, 1])
    mp = stumpy.gpu_stump(x, m, device_id=1)
    for n in [10000, 100000, 1000000]:
        x = np.random.rand(n)
        m = 50
        start_time = time.time()
        # mp = stumpy.gpu_stump(x, m, device_id=1)
        mp = stumpy.gpu_stump(x, m, device_id=[0, 1])
        print(n, time.time() - start_time, flush=True)

I used numba=0.53.1 in all cases (installed via conda-forge).

# 2x 1080Ti, TDAmeritrade/stumpy
10000 3.2
100000 12.3
1000000 124.5

# 2x 1080Ti, gmarkall/stumpy
10000 4.1
100000 12.9

Unfortunately, my remote machine went down and I could not check the final run for gmarkall/stumpy at 1000000 but hopefully this is sufficient. They are both very similar to the CG numbers from last fall.

gmarkall · 2021-05-21T19:58:38Z

Thanks for the update and all your support through this work - I think it seems appropriate to close the PR now... I'll be back if I have any more ideas on how to improve performance :-)

seanlaw · 2021-05-21T20:01:25Z

Thank you @gmarkall! It was a pleasure to work with you

Reformat test_gpu_stump.py with black 19.3b0

7c3343d

gmarkall added 2 commits October 12, 2020 16:31

Restore original copyright header

d779e51

Remove copyright header that did not originally exist

aa31825

gmarkall changed the title ~~Add Cooperative Group variant of GPU-STUMP~~ [WIP] Add Cooperative Group variant of GPU-STUMP Oct 13, 2020

gmarkall marked this pull request as draft October 19, 2020 09:43

Base automatically changed from master to main January 26, 2021 14:34

gmarkall added 5 commits April 15, 2021 10:09

Merge remote-tracking branch 'origin/main' into grm-cg-stump

0872e99

Reformat gpu_stump with black 20.8b1

892f61a

Use overloads instead of definitions for CG kernel

f1249d3

Add CG version of AAMP

ab81dc6

Go back to using .definition to get _Kernel

8fc3ae0

The simulator in 0.53.1 only supports definition, not overloads.

gmarkall added 2 commits May 21, 2021 15:10

Merge remote-tracking branch 'origin/main' into grm-cg-stump

c36a051

Increase required Numba version to 0.53.1

f4579aa

gmarkall closed this May 21, 2021

seanlaw mentioned this pull request Sep 16, 2024

Add Cooperative Group for GPU-STUMP #1029

Open

[WIP] Add Cooperative Group variant of GPU-STUMP #266

[WIP] Add Cooperative Group variant of GPU-STUMP #266

Uh oh!

Conversation

gmarkall commented Oct 9, 2020

Pull Request Checklist

Uh oh!

seanlaw commented Oct 9, 2020

Uh oh!

gmarkall commented Oct 9, 2020

Uh oh!

seanlaw commented Oct 9, 2020

Uh oh!

seanlaw commented Oct 9, 2020

Uh oh!

gmarkall commented Oct 9, 2020

Uh oh!

gmarkall commented Oct 10, 2020

Uh oh!

seanlaw commented Oct 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmarkall commented Oct 12, 2020

Uh oh!

gmarkall commented Oct 12, 2020

Uh oh!

seanlaw commented Oct 12, 2020

Uh oh!

gmarkall commented Oct 19, 2020

Uh oh!

seanlaw commented Oct 19, 2020

Uh oh!

gmarkall commented Oct 19, 2020

Uh oh!

seanlaw commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Jan 30, 2021

Uh oh!

codecov-io commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seanlaw commented Apr 16, 2021

Uh oh!

gmarkall commented Apr 16, 2021

Uh oh!

gmarkall commented May 21, 2021

Uh oh!

seanlaw commented May 21, 2021

Uh oh!

gmarkall commented May 21, 2021

Uh oh!

seanlaw commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmarkall commented May 21, 2021

Uh oh!

seanlaw commented May 21, 2021

Uh oh!

Uh oh!

seanlaw commented Oct 10, 2020 •

edited

Loading

seanlaw commented Oct 19, 2020 •

edited

Loading

codecov-io commented Apr 15, 2021 •

edited

Loading

seanlaw commented May 21, 2021 •

edited

Loading

seanlaw commented May 21, 2021 •

edited

Loading