Adaptive stress tests when GPU memory capacity is insufficient #3916

lowener · 2021-05-31T21:13:29Z

Closes #3903.

dantegd

Thanks for the PR, really cool!

I was wondering what do you think of an option where the tests instead of being skipped get reduced in size to fit in the GPU (probably just in a simple manner, like always reducing number of rows)? So something like:

~ pytest cuml/test --run_stress # skips stress tests that don't fit in the card
~ CUML_ADAPT_STRESS_TESTS=true pytest cuml/test --run_stress # reduces size of stress tests

That could be pretty useful for testing deployment in different environments. What do you think?

dantegd · 2021-06-01T14:00:57Z

python/cuml/test/utils.py

@@ -30,6 +32,9 @@
 import cuml
 import pytest

+# max_gpu_memory: Capacity of the GPU memory in GB
+max_gpu_memory = 0


I think using pytest_configure might be a better choice as opposed to the global variable https://docs.pytest.org/en/6.2.x/reference.html#pytest.hookspec.pytest_configure

lowener · 2021-06-21T23:34:41Z

2 stress tests are having troubles with the change in parameters introduced in this PR.
I'm copying here the errors and will make an issue if needed.

test_mbsgd_regressor.py:

_____________________ test_pca_fit[dataframe-data_info1] ______________________

data_info = [9000000, 5000, 30], input_type = 'dataframe'
client = <Client: 'tcp://127.0.0.1:46875' processes=8 threads=8, memory=503.79 GiB>

    @pytest.mark.mg
    @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 30]),
                             stress_param([int(9e6), 5000, 30])])
    @pytest.mark.parametrize("input_type", ["dataframe", "array"])
    def test_pca_fit(data_info, input_type, client):
    
        nrows, ncols, n_parts = data_info
        if nrows == int(9e6) and pytest.max_gpu_memory < 48:
            if pytest.adapt_stress_test:
                nrows = nrows * pytest.max_gpu_memory // 480
                ncols = ncols * pytest.max_gpu_memory // 480
            else:
                pytest.skip("Insufficient GPU memory for this test."
                            "Re-run with 'CUML_ADAPT_STRESS_TESTS=True'")
    
        from cuml.dask.decomposition import TruncatedSVD as daskTPCA
        from sklearn.decomposition import TruncatedSVD
    
        from cuml.dask.datasets import make_blobs
    
        X, _ = make_blobs(n_samples=nrows,
                          n_features=ncols,
                          centers=1,
                          n_parts=n_parts,
                          cluster_std=0.5,
                          random_state=10, dtype=np.float32)
    
        if input_type == "dataframe":
            X_train = to_dask_cudf(X)
            X_cpu = X_train.compute().to_pandas().values
        elif input_type == "array":
            X_train = X
            X_cpu = cp.asnumpy(X_train.compute())
    
        cutsvd = daskTPCA(n_components=5)
        cutsvd.fit(X_train)
    
        sktsvd = TruncatedSVD(n_components=5, algorithm="arpack")
        sktsvd.fit(X_cpu)
    
        all_attr = ['singular_values_', 'components_',
                    'explained_variance_', 'explained_variance_ratio_']
    
        for attr in all_attr:
            with_sign = False if attr in ['components_'] else True
            cuml_res = (getattr(cutsvd, attr))
            if type(cuml_res) == np.ndarray:
                cuml_res = cuml_res.as_matrix()
            skl_res = getattr(sktsvd, attr)
            if attr == 'singular_values_':
>               assert array_equal(cuml_res, skl_res, 1, with_sign=with_sign)
E               assert False
E                +  where False = array_equal(0    13238.003906\n1       92.839333\n2       92.441063\n3       92.416443\n4       92.234329\ndtype: float32, array([13238.004,     0.   ,     0.   ,     0.   ,     0.   ],\n      dtype=float32), 1, with_sign=True)

python/cuml/test/dask/test_tsvd.py:76: AssertionError
________________________ test_pca_fit[array-data_info1] ________________________

data_info = [9000000, 5000, 30], input_type = 'array'
client = <Client: 'tcp://127.0.0.1:46875' processes=8 threads=8, memory=503.79 GiB>

    @pytest.mark.mg
    @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 30]),
                             stress_param([int(9e6), 5000, 30])])
    @pytest.mark.parametrize("input_type", ["dataframe", "array"])
    def test_pca_fit(data_info, input_type, client):
    
        nrows, ncols, n_parts = data_info
        if nrows == int(9e6) and pytest.max_gpu_memory < 48:
            if pytest.adapt_stress_test:
                nrows = nrows * pytest.max_gpu_memory // 4800
                ncols = ncols * pytest.max_gpu_memory // 480
            else:
                pytest.skip("Insufficient GPU memory for this test."
                            "Re-run with 'CUML_ADAPT_STRESS_TESTS=True'")
    
        from cuml.dask.decomposition import TruncatedSVD as daskTPCA
        from sklearn.decomposition import TruncatedSVD
    
        from cuml.dask.datasets import make_blobs
    
        X, _ = make_blobs(n_samples=nrows,
                          n_features=ncols,
                          centers=1,
                          n_parts=n_parts,
                          cluster_std=0.5,
                          random_state=10, dtype=np.float32)
    
        if input_type == "dataframe":
            X_train = to_dask_cudf(X)
            X_cpu = X_train.compute().to_pandas().values
        elif input_type == "array":
            X_train = X
            X_cpu = cp.asnumpy(X_train.compute())
    
        cutsvd = daskTPCA(n_components=5)
        cutsvd.fit(X_train)
    
        sktsvd = TruncatedSVD(n_components=5, algorithm="arpack")
        sktsvd.fit(X_cpu)
    
        all_attr = ['singular_values_', 'components_',
                    'explained_variance_', 'explained_variance_ratio_']
    
        for attr in all_attr:
            with_sign = False if attr in ['components_'] else True
            cuml_res = (getattr(cutsvd, attr))
            if type(cuml_res) == np.ndarray:
                cuml_res = cuml_res.as_matrix()
            skl_res = getattr(sktsvd, attr)
            if attr == 'singular_values_':
>               assert array_equal(cuml_res, skl_res, 1, with_sign=with_sign)
E               assert False
E                +  where False = array_equal(array([13238.004  ,    92.83933,    92.44106,    92.41644,    92.23433],\n      dtype=float32), array([13238.005,     0.   ,     0.   ,     0.   ,     0.   ],\n      dtype=float32), 1, with_sign=True)

python/cuml/test/dask/test_tsvd.py:76: AssertionError
=========================== short test summary info ============================
FAILED python/cuml/test/dask/test_tsvd.py::test_pca_fit[dataframe-data_info1]
FAILED python/cuml/test/dask/test_tsvd.py::test_pca_fit[array-data_info1] - a...
================== 2 failed, 2 skipped, 43 warnings in 16.77s ==================

test_mbsgd_regressor.py::test_mbsgd_regressor_vs_skl

=================================== FAILURES ===================================
________ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-constant-none] ________

lrate = 'constant', penalty = 'none'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - -0.00015029520967990706))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError
_________ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-adaptive-l2] _________

lrate = 'adaptive', penalty = 'l2'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - 0.9999999897844838))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError
_____ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-constant-elasticnet] _____

lrate = 'constant', penalty = 'elasticnet'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - -4.379805773724321e-05))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError

codecov-commenter · 2021-06-29T20:06:01Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@c05d7a2). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #3916   +/-   ##
===============================================
  Coverage                ?   85.44%           
===============================================
  Files                   ?      230           
  Lines                   ?    18088           
  Branches                ?        0           
===============================================
  Hits                    ?    15455           
  Misses                  ?     2633           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.04% <0.00%> (?)`
non-dask	`77.79% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c05d7a2...35f7ca5. Read the comment docs.

dantegd · 2021-07-01T15:27:46Z

@gpucibot merge

…sai#3916) Closes rapidsai#3903. Authors: - Micka (https://github.com/lowener) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3916

lowener added 2 commits May 12, 2021 17:01

First prototype of GPU memory check

7cbd058

Added pytest skip for memory-heavy stress test

e8d1dc0

github-actions bot added the Cython / Python Cython or Python issue label May 31, 2021

Merge branch 'branch-21.06' into 020-stress-gpu-memory

f894ca5

dantegd reviewed Jun 1, 2021

View reviewed changes

dantegd added the 2 - In Progress Currenty a work in progress label Jun 1, 2021

Add option to adapt stress test

4053336

lowener changed the title ~~[WIP] Skip stress test when GPU memory capacity is insufficient~~ Skip stress test when GPU memory capacity is insufficient Jun 8, 2021

lowener marked this pull request as ready for review June 9, 2021 10:26

lowener requested a review from a team as a code owner June 9, 2021 10:26

Fix style

6ab2ca2

lowener changed the base branch from branch-21.06 to branch-21.08 June 16, 2021 14:52

dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 17, 2021

Update adaptative stress test

6d0832f

Correct stress test reduction

35f7ca5

dantegd changed the title ~~Skip stress test when GPU memory capacity is insufficient~~ Adaptive stress tests when GPU memory capacity is insufficient Jul 1, 2021

dantegd approved these changes Jul 1, 2021

View reviewed changes

rapids-bot bot merged commit 73d946d into rapidsai:branch-21.08 Jul 1, 2021

lowener deleted the 020-stress-gpu-memory branch July 1, 2021 15:29

This was referenced Jul 1, 2021

[BUG] CUDA OOM in multiple Stress Test for v0.18 #3541

Closed

[BUG] PCA Stress test failure: test_sparse_pca_inputs OOM #3810

Closed

[BUG] CUDA OOM in PCA Stress Test #3295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive stress tests when GPU memory capacity is insufficient #3916

Adaptive stress tests when GPU memory capacity is insufficient #3916

lowener commented May 31, 2021

dantegd left a comment

dantegd Jun 1, 2021

lowener commented Jun 21, 2021

codecov-commenter commented Jun 29, 2021

dantegd commented Jul 1, 2021

Adaptive stress tests when GPU memory capacity is insufficient #3916

Adaptive stress tests when GPU memory capacity is insufficient #3916

Conversation

lowener commented May 31, 2021

dantegd left a comment

Choose a reason for hiding this comment

dantegd Jun 1, 2021

Choose a reason for hiding this comment

lowener commented Jun 21, 2021

codecov-commenter commented Jun 29, 2021

Codecov Report

dantegd commented Jul 1, 2021