Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knn Imputer Class and dependency functionalities #4820

Open
wants to merge 10,000 commits into
base: branch-23.02
Choose a base branch
from

Conversation

SreekiranprasadV
Copy link
Contributor

@SreekiranprasadV SreekiranprasadV commented Jul 18, 2022

Merge PR : #4797 before merging this one. The functionalities required for this are in #4797

Created a draft PR and Added KNN Imputer class and dependency functionalities for imputation of missing values.

Supported Inputs: Numpy arrays, Pandas DataFrame, Cupy arrays, Cudf DataFrame

Tested on: Tesla T4 Single GPU

Time Latency:

Tested on numpy arrays with 25% of the data is masked, averaged the distance metric and set the column size to 100.
Data Points Cuml Sklearn
100000 0.513s 0.383s
1M 10.5s 36.1s
10M 105s 373s

Tested on numpy arrays with 1% of the data is masked, averaged the distance metric and set the column size to 100.
Data Points Cuml Sklearn
100000 0.217s 0.208s
1M 2.86s 7.73s
10M 10.2s 122s

Profiling on 1 million records:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.491    0.005    0.570    0.006 {method 'argpartition' of 'cupy._core.core.ndarray' objects}
     3561    0.197    0.000    0.213    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/rmm/rmm.py:212(rmm_cupy_allocator)
        1    0.149    0.149    1.078    1.078 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py:951(transform)
      2/1    0.087    0.044    0.161    0.161 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/internals/api_decorators.py:453(inner_with_getters)
        3    0.056    0.019    0.064    0.021 {method 'dot' of 'cupy._core.core.ndarray' objects}
      201    0.024    0.000    0.039    0.000 {method 'nonzero' of 'cupy._core.core.ndarray' objects}
      100    0.014    0.000    0.621    0.006 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py:863(_calc_impute)
      200    0.005    0.000    0.009    0.000 {built-in method cupy._core._routines_math._nansum}
     3562    0.005    0.000    0.010    0.000 cuda/cudart.pyx:10521(cudaGetDevice)
      101    0.004    0.000    0.009    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cupy/_creation/ranges.py:9(arange)
     3562    0.004    0.000    0.014    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/rmm/_cuda/gpu.py:53(getDevice)
      200    0.004    0.000    0.008    0.000 {method 'take' of 'cupy._core.core.ndarray' objects}
      100    0.003    0.000    0.006    0.000 {method 'all' of 'cupy._core.core.ndarray' objects}
     3562    0.003    0.000    0.005    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/enum.py:358(__call__)
      103    0.003    0.000    0.005    0.000 {method 'any' of 'cupy._core.core.ndarray' objects}
      616    0.002    0.000    0.014    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cupy/_creation/basic.py:7(empty)
     3561    0.002    0.000    0.002    0.000 {built-in method cupy.cuda.stream.get_current_stream}
     3562    0.002    0.000    0.002    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/enum.py:670(__new__)
     2107    0.002    0.000    0.004    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/numpy/core/numeric.py:1858(isscalar)
      7/1    0.002    0.000    1.080    1.080 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/internals/api_decorators.py:357(inner)

Cupy in built functionalities are costing more time.

ajschmidt8 and others added 30 commits November 17, 2021 13:41
Implementing LinearSVM using the existing QN solvers.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Robert Maynard (https://github.com/robertmaynard)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4268
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Closes rapidsai#3846 

Adds support for exogenous variables to ARIMA.
All series in the batch must have the same number of exogenous variables, and exogenous variables are not shared across the batch (`exog` therefore has `n_exog * batch_size` columns).

Example:
```python
model = ARIMA(endog=df_endog, exog=df_exog_past, order=(1,0,1),
              seasonal_order=(1,1,1,12), fit_intercept=True,
              simple_differencing=False)
model.fit()
fc, lower, upper = model.forecast(40, exog=df_exog_future, level=0.95)
```

![2021-09-22_exog_fc](https://user-images.githubusercontent.com/17441062/134339807-f815a7a3-98dc-49e5-8599-9607e660597a.png)

Authors:
  - Louis Sugy (https://github.com/Nyrio)
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4221
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Addresses rapidsai#4110

This is an experimental prototype. For now, it supports:
* XGBoost models with numerical splits
* cuML RF regressors with numerical splits

cuML RF classifiers are not supported.

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - William Hicks (https://github.com/wphicks)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4351
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
This upgrade is required to be in-line with: rapidsai/cudf#9716

Depends on: rapidsai/integration#390

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#4372
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Fix Changelog Merge Conflicts for `branch-21.12`
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Changes to be in-line with: rapidsai/cudf#9734

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: rapidsai#4390
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
…idsai#4400)

PR uses project flash to build the cuML Python package mirroring what the C++ flow looks like.

Note: Currently only changed for the CUDA 11.0 GPU test since that one uses Python 3.7, to do the other jobs we need to build the python package twice on the CPU job.
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4396
…#4382)

Suggest using LinearSVM when the user chooses to use the linear kernel in SVM. The reason is that LinearSVM uses a specialized faster solver.

Closes rapidsai#1664
Also partially addresses rapidsai#2857

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4382
…ai#4405)

There were actuall 2 minor issues that prevented `UMAPAlgo::Optimize::find_params_ab()` from being ASAN-clean at the moment:

- One is the mem leaks, of course
- Another one is the `malloc()`-`delete` mismatch -- only memory allocated using `new` or equivalent should be freed with operator `delete` or `delete[]`

Another issue that was also addressed here: exception safety (i.e., by using `make_unique` from C++-14)

Signed-off-by: Yitao Li <yitao@rstudio.com>

Authors:
  - Yitao Li (https://github.com/yitao-li)

Approvers:
  - Zach Bjornson (https://github.com/zbjornson)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4405
P_sum is equal to n. See rapidsai#2622 where I made this change once before. rapidsai#4208 changed it back while consolidating code.

Authors:
  - Zach Bjornson (https://github.com/zbjornson)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4425
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@SreekiranprasadV
Copy link
Contributor Author

rerun tests

Fixes issue rapidsai#2387.

For large data sizes, the batch size of the DBSCAN algorithm is small in order to fit the distance matrix in memory.

This results in a matrix that has dimensions num_points x batch_size, both for the distance and adjacency matrix.

The conversion of the boolean adjacency matrix to CSR format is performed in the 'adjgraph' step. This step was slow when the batch size was small, as described in issue rapidsai#2387.

In this commit, the adjgraph step is sped up. This is done in two ways:

1. The adjacency matrix is now stored in row-major batch_size x num_points format --- it was transposed before. This required changes    in the vertexdeg step.

2. The csr_row_op kernel has been replaced by the adj_to_csr kernel.    This kernel can divide the work over multiple blocks even when the    number of rows (batch size) is small. It makes optimal use of memory    bandwidth because rows of the matrix are laid out contiguously in memory.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#4803
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

Allard Hendriksen added 2 commits July 25, 2022 19:31
This functionality has been moved to RAFT.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4829
…4804)

This PR removes the naive versions of the DBSCAN algorithms. They were not used anymore and were largely incorrect, as described in rapidsai#3414. 

This fixes issue rapidsai#3414.

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4804
@beckernick beckernick added feature request New feature or request non-breaking Non-breaking change labels Jul 26, 2022
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

SreekiranprasadV and others added 8 commits July 26, 2022 11:25
Pass `NVTX` option to raft in a more similar way to the other arguments and make sure `RAFT_NVTX` option in the installed `raft-config.cmake`.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#4825
The conda recipe was updated to UCX 1.13.0 in rapidsai#4809 , but updating conda environment files was missing there.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)

URL: rapidsai#4813
Allows cuML to be installed with CuPy 11.

xref: rapidsai/integration#508

Authors:
  - https://github.com/jakirkham

Approvers:
  - Sevag H (https://github.com/sevagh)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4837
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

1 similar comment
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

@dantegd dantegd changed the base branch from branch-22.08 to branch-22.10 August 31, 2022 17:59
@codecov-commenter
Copy link

Codecov Report

Base: 77.62% // Head: 78.24% // Increases project coverage by +0.61% 🎉

Coverage data is based on head (e629e77) compared to base (dc77d6b).
Patch coverage: 81.81% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10    #4820      +/-   ##
================================================
+ Coverage         77.62%   78.24%   +0.61%     
================================================
  Files               180      181       +1     
  Lines             11384    11610     +226     
================================================
+ Hits               8837     9084     +247     
+ Misses             2547     2526      -21     
Flag Coverage Δ
dask 46.27% <14.39%> (+0.75%) ⬆️
non-dask 67.70% <81.81%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/_thirdparty/sklearn/neighbors/_base.py 66.66% <66.66%> (ø)
...l/_thirdparty/sklearn/preprocessing/_imputation.py 85.71% <84.90%> (-0.62%) ⬇️
...cuml/_thirdparty/sklearn/preprocessing/__init__.py 100.00% <100.00%> (ø)
python/cuml/metrics/__init__.py 100.00% <100.00%> (ø)
python/cuml/common/array.py 97.21% <0.00%> (-0.78%) ⬇️
python/cuml/cluster/__init__.py 100.00% <0.00%> (ø)
python/cuml/feature_extraction/_vectorizers.py 89.93% <0.00%> (+0.37%) ⬆️
python/cuml/common/import_utils.py 59.82% <0.00%> (+0.85%) ⬆️
python/cuml/thirdparty_adapters/adapters.py 92.99% <0.00%> (+1.50%) ⬆️
.../dask/extended/linear_model/logistic_regression.py 92.00% <0.00%> (+57.33%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@dantegd dantegd changed the base branch from branch-22.10 to branch-23.02 December 8, 2022 11:35
@ajschmidt8 ajschmidt8 requested review from a team as code owners February 13, 2023 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue feature request New feature or request gpuCI gpuCI issue inactive-30d non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.