Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive stress tests when GPU memory capacity is insufficient #3916

Merged
merged 7 commits into from
Jul 1, 2021

Conversation

lowener
Copy link
Contributor

@lowener lowener commented May 31, 2021

Closes #3903.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label May 31, 2021
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, really cool!

I was wondering what do you think of an option where the tests instead of being skipped get reduced in size to fit in the GPU (probably just in a simple manner, like always reducing number of rows)? So something like:

~ pytest cuml/test --run_stress # skips stress tests that don't fit in the card
~ CUML_ADAPT_STRESS_TESTS=true pytest cuml/test --run_stress # reduces size of stress tests

That could be pretty useful for testing deployment in different environments. What do you think?

@@ -30,6 +32,9 @@
import cuml
import pytest

# max_gpu_memory: Capacity of the GPU memory in GB
max_gpu_memory = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using pytest_configure might be a better choice as opposed to the global variable https://docs.pytest.org/en/6.2.x/reference.html#pytest.hookspec.pytest_configure

@dantegd dantegd added the 2 - In Progress Currenty a work in progress label Jun 1, 2021
@lowener lowener changed the title [WIP] Skip stress test when GPU memory capacity is insufficient Skip stress test when GPU memory capacity is insufficient Jun 8, 2021
@lowener lowener marked this pull request as ready for review June 9, 2021 10:26
@lowener lowener requested a review from a team as a code owner June 9, 2021 10:26
@lowener lowener changed the base branch from branch-21.06 to branch-21.08 June 16, 2021 14:52
@dantegd dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 17, 2021
@lowener
Copy link
Contributor Author

lowener commented Jun 21, 2021

2 stress tests are having troubles with the change in parameters introduced in this PR.
I'm copying here the errors and will make an issue if needed.

  • test_mbsgd_regressor.py:
_____________________ test_pca_fit[dataframe-data_info1] ______________________

data_info = [9000000, 5000, 30], input_type = 'dataframe'
client = <Client: 'tcp://127.0.0.1:46875' processes=8 threads=8, memory=503.79 GiB>

    @pytest.mark.mg
    @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 30]),
                             stress_param([int(9e6), 5000, 30])])
    @pytest.mark.parametrize("input_type", ["dataframe", "array"])
    def test_pca_fit(data_info, input_type, client):
    
        nrows, ncols, n_parts = data_info
        if nrows == int(9e6) and pytest.max_gpu_memory < 48:
            if pytest.adapt_stress_test:
                nrows = nrows * pytest.max_gpu_memory // 480
                ncols = ncols * pytest.max_gpu_memory // 480
            else:
                pytest.skip("Insufficient GPU memory for this test."
                            "Re-run with 'CUML_ADAPT_STRESS_TESTS=True'")
    
        from cuml.dask.decomposition import TruncatedSVD as daskTPCA
        from sklearn.decomposition import TruncatedSVD
    
        from cuml.dask.datasets import make_blobs
    
        X, _ = make_blobs(n_samples=nrows,
                          n_features=ncols,
                          centers=1,
                          n_parts=n_parts,
                          cluster_std=0.5,
                          random_state=10, dtype=np.float32)
    
        if input_type == "dataframe":
            X_train = to_dask_cudf(X)
            X_cpu = X_train.compute().to_pandas().values
        elif input_type == "array":
            X_train = X
            X_cpu = cp.asnumpy(X_train.compute())
    
        cutsvd = daskTPCA(n_components=5)
        cutsvd.fit(X_train)
    
        sktsvd = TruncatedSVD(n_components=5, algorithm="arpack")
        sktsvd.fit(X_cpu)
    
        all_attr = ['singular_values_', 'components_',
                    'explained_variance_', 'explained_variance_ratio_']
    
        for attr in all_attr:
            with_sign = False if attr in ['components_'] else True
            cuml_res = (getattr(cutsvd, attr))
            if type(cuml_res) == np.ndarray:
                cuml_res = cuml_res.as_matrix()
            skl_res = getattr(sktsvd, attr)
            if attr == 'singular_values_':
>               assert array_equal(cuml_res, skl_res, 1, with_sign=with_sign)
E               assert False
E                +  where False = array_equal(0    13238.003906\n1       92.839333\n2       92.441063\n3       92.416443\n4       92.234329\ndtype: float32, array([13238.004,     0.   ,     0.   ,     0.   ,     0.   ],\n      dtype=float32), 1, with_sign=True)

python/cuml/test/dask/test_tsvd.py:76: AssertionError
________________________ test_pca_fit[array-data_info1] ________________________

data_info = [9000000, 5000, 30], input_type = 'array'
client = <Client: 'tcp://127.0.0.1:46875' processes=8 threads=8, memory=503.79 GiB>

    @pytest.mark.mg
    @pytest.mark.parametrize("data_info", [unit_param([1000, 20, 30]),
                             stress_param([int(9e6), 5000, 30])])
    @pytest.mark.parametrize("input_type", ["dataframe", "array"])
    def test_pca_fit(data_info, input_type, client):
    
        nrows, ncols, n_parts = data_info
        if nrows == int(9e6) and pytest.max_gpu_memory < 48:
            if pytest.adapt_stress_test:
                nrows = nrows * pytest.max_gpu_memory // 4800
                ncols = ncols * pytest.max_gpu_memory // 480
            else:
                pytest.skip("Insufficient GPU memory for this test."
                            "Re-run with 'CUML_ADAPT_STRESS_TESTS=True'")
    
        from cuml.dask.decomposition import TruncatedSVD as daskTPCA
        from sklearn.decomposition import TruncatedSVD
    
        from cuml.dask.datasets import make_blobs
    
        X, _ = make_blobs(n_samples=nrows,
                          n_features=ncols,
                          centers=1,
                          n_parts=n_parts,
                          cluster_std=0.5,
                          random_state=10, dtype=np.float32)
    
        if input_type == "dataframe":
            X_train = to_dask_cudf(X)
            X_cpu = X_train.compute().to_pandas().values
        elif input_type == "array":
            X_train = X
            X_cpu = cp.asnumpy(X_train.compute())
    
        cutsvd = daskTPCA(n_components=5)
        cutsvd.fit(X_train)
    
        sktsvd = TruncatedSVD(n_components=5, algorithm="arpack")
        sktsvd.fit(X_cpu)
    
        all_attr = ['singular_values_', 'components_',
                    'explained_variance_', 'explained_variance_ratio_']
    
        for attr in all_attr:
            with_sign = False if attr in ['components_'] else True
            cuml_res = (getattr(cutsvd, attr))
            if type(cuml_res) == np.ndarray:
                cuml_res = cuml_res.as_matrix()
            skl_res = getattr(sktsvd, attr)
            if attr == 'singular_values_':
>               assert array_equal(cuml_res, skl_res, 1, with_sign=with_sign)
E               assert False
E                +  where False = array_equal(array([13238.004  ,    92.83933,    92.44106,    92.41644,    92.23433],\n      dtype=float32), array([13238.005,     0.   ,     0.   ,     0.   ,     0.   ],\n      dtype=float32), 1, with_sign=True)

python/cuml/test/dask/test_tsvd.py:76: AssertionError
=========================== short test summary info ============================
FAILED python/cuml/test/dask/test_tsvd.py::test_pca_fit[dataframe-data_info1]
FAILED python/cuml/test/dask/test_tsvd.py::test_pca_fit[array-data_info1] - a...
================== 2 failed, 2 skipped, 43 warnings in 16.77s ==================
  • test_mbsgd_regressor.py::test_mbsgd_regressor_vs_skl
=================================== FAILURES ===================================
________ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-constant-none] ________

lrate = 'constant', penalty = 'none'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - -0.00015029520967990706))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError
_________ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-adaptive-l2] _________

lrate = 'adaptive', penalty = 'l2'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - 0.9999999897844838))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError
_____ test_mbsgd_regressor_vs_skl[500000-1000-500-f64-constant-elasticnet] _____

lrate = 'constant', penalty = 'elasticnet'
make_dataset = (250000, <class 'numpy.float64'>, array([[ 0.27115661,  0.28503481, -0.61464179, ..., -0.67867309,
         1.16837549...055],
       [-1060.92529297],
       ...,
       [ -918.22424316],
       [-1065.03662109],
       [ -296.8571167 ]]))

    @pytest.mark.parametrize(
        # Grouped those tests to reduce the total number of individual tests
        # while still keeping good coverage of the different features of MBSGD
        ('lrate', 'penalty'), [
            ('constant', 'none'),
            ('invscaling', 'l1'),
            ('adaptive', 'l2'),
            ('constant', 'elasticnet'),
        ]
    )
    def test_mbsgd_regressor_vs_skl(lrate, penalty, make_dataset):
        nrows, datatype, X_train, X_test, y_train, y_test = make_dataset
    
        if nrows < 500000:
    
            cu_mbsgd_regressor = cumlMBSGRegressor(learning_rate=lrate,
                                                   eta0=0.005, epochs=100,
                                                   fit_intercept=True,
                                                   batch_size=2, tol=0.0,
                                                   penalty=penalty)
    
            cu_mbsgd_regressor.fit(X_train, y_train)
            cu_pred = cu_mbsgd_regressor.predict(X_test)
            cu_r2 = r2_score(cp.asnumpy(cu_pred), cp.asnumpy(y_test),
                             convert_dtype=datatype)
    
            skl_sgd_regressor = SGDRegressor(learning_rate=lrate, eta0=0.005,
                                             max_iter=100, fit_intercept=True,
                                             tol=0.0, penalty=penalty,
                                             random_state=0)
    
            skl_sgd_regressor.fit(cp.asnumpy(X_train), cp.asnumpy(y_train))
            skl_pred = skl_sgd_regressor.predict(cp.asnumpy(X_test))
            skl_r2 = r2_score(skl_pred, cp.asnumpy(y_test),
                              convert_dtype=datatype)
>           assert abs(cu_r2 - skl_r2) <= 0.02
E           assert nan <= 0.02
E            +  where nan = abs((nan - -4.379805773724321e-05))

python/cuml/test/test_mbsgd_regressor.py:92: AssertionError

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@c05d7a2). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #3916   +/-   ##
===============================================
  Coverage                ?   85.44%           
===============================================
  Files                   ?      230           
  Lines                   ?    18088           
  Branches                ?        0           
===============================================
  Hits                    ?    15455           
  Misses                  ?     2633           
  Partials                ?        0           
Flag Coverage Δ
dask 48.04% <0.00%> (?)
non-dask 77.79% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c05d7a2...35f7ca5. Read the comment docs.

@dantegd dantegd changed the title Skip stress test when GPU memory capacity is insufficient Adaptive stress tests when GPU memory capacity is insufficient Jul 1, 2021
@dantegd
Copy link
Member

dantegd commented Jul 1, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 73d946d into rapidsai:branch-21.08 Jul 1, 2021
@lowener lowener deleted the 020-stress-gpu-memory branch July 1, 2021 15:29
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve stress test support of different GPU memory sizes
3 participants