Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay setting MultiIndex.level/codes until needed #17728

Merged
merged 7 commits into from
Jan 15, 2025

Conversation

mroeschke
Copy link
Contributor

Description

Follow up to #17644

This PR changes MultiIndex to delay computing self._level and self._codes via factorization of self._column until needed by certain methods. Before, for state consistency, those attributes were always eagerly computed. As discussed offline the performance benefit of not eagerly computing those attributes is more desirable.

import cudf
df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int})
target = "label"
col = ['weekday', 'cat_2', 'brand']
df_gb= df_train[col + [target]].groupby(col)
%%time
df_gb.agg(['mean', 'count'])


# PR
CPU times: user 144 ms, sys: 23.9 ms, total: 168 ms
Wall time: 166 ms

  _     ._   __/__   _ _  _  _ _/_   Recorded: 12:12:46  Samples:  4
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.159     CPU time: 0.160
/   _/                      v5.0.0

Cell [3]

0.158 <module>  <ipython-input-3-ee51998a643a>:1
`- 0.158 wrapper  cudf/utils/performance_tracking.py:30
   `- 0.158 DataFrameGroupBy.agg  cudf/core/groupby/groupby.py:879
      |- 0.129 DataFrameGroupBy._aggregate  cudf/core/groupby/groupby.py:789
      `- 0.029 NumericalColumn.astype  cudf/core/column/column.py:1126
         `- 0.029 NumericalColumn.as_numerical_column  cudf/core/column/numerical.py:428
            `- 0.029 inner  contextlib.py:78
               `- 0.029 cast  cudf/core/_internals/unary.py:40


# branch 25.02 head
CPU times: user 369 ms, sys: 105 ms, total: 474 ms
Wall time: 478 ms


  _     ._   __/__   _ _  _  _ _/_   Recorded: 12:11:20  Samples:  79
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.480     CPU time: 0.478
/   _/                      v5.0.0

Cell [3]

0.479 <module>  <ipython-input-3-ee51998a643a>:1
`- 0.478 wrapper  cudf/utils/performance_tracking.py:30
   `- 0.478 DataFrameGroupBy.agg  cudf/core/groupby/groupby.py:879
      |- 0.267 cached_property.__get__  functools.py:979
      |  `- 0.267 _Grouping.keys  cudf/core/groupby/groupby.py:3534
      |     `- 0.267 wrapper  cudf/utils/performance_tracking.py:30
      |        `- 0.267 MultiIndex._from_data  cudf/core/multiindex.py:344
      |           `- 0.265 _compute_levels_and_codes  cudf/core/multiindex.py:67
      |              |- 0.230 factorize  cudf/core/algorithms.py:22
      |              |  |- 0.168 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
      |              |  |  |- 0.097 [self]  cudf/core/column/column.py
      |              |  |  |- 0.031 as_column  cudf/core/column/column.py:1948
      |              |  |  |  `- 0.028 NumericalColumn.astype  cudf/core/column/column.py:1126
      |              |  |  |     `- 0.028 NumericalColumn.as_numerical_column  cudf/core/column/numerical.py:428
      |              |  |  |        `- 0.028 inner  contextlib.py:78
      |              |  |  |           `- 0.028 cast  cudf/core/_internals/unary.py:40
      |              |  |  |- 0.030 inner  contextlib.py:78
      |              |  |  |  `- 0.030 sort_by_key  cudf/core/_internals/sorting.py:160
      |              |  |  `- 0.007 NumericalColumn.take  cudf/core/column/column.py:943
      |              |  |     `- 0.007 inner  contextlib.py:78
      |              |  |        `- 0.007 gather  cudf/core/_internals/copying.py:18
      |              |  |- 0.051 NumericalColumn.unique  cudf/core/column/column.py:1342
      |              |  |  |- 0.036 inner  contextlib.py:78
      |              |  |  |  `- 0.036 drop_duplicates  cudf/core/_internals/stream_compaction.py:82
      |              |  |  `- 0.015 NumericalColumn.is_unique  cudf/core/column/column.py:1056
      |              |  |     `- 0.015 NumericalColumn.distinct_count  cudf/core/column/column.py:1108
      |              |  `- 0.006 NumericalColumn.dropna  cudf/core/column/column.py:294
      |              |     `- 0.006 NumericalColumn.copy  cudf/core/column/column.py:481
      |              `- 0.028 _compile_module_with_cache  cupy/cuda/compiler.py:473
      |                    [4 frames hidden]  cupy, <built-in>
      |- 0.133 DataFrameGroupBy._aggregate  cudf/core/groupby/groupby.py:789
      `- 0.079 wrapper  cudf/utils/performance_tracking.py:30
         `- 0.079 MultiIndex._from_columns_like_self  cudf/core/frame.py:194
            `- 0.079 wrapper  cudf/utils/performance_tracking.py:30
               |- 0.040 MultiIndex._copy_type_metadata  cudf/core/multiindex.py:2101
               |  `- 0.038 _compute_levels_and_codes  cudf/core/multiindex.py:67
               |     `- 0.038 factorize  cudf/core/algorithms.py:22
               |        |- 0.021 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
               |        |  |- 0.010 [self]  cudf/core/column/column.py
               |        |  `- 0.007 inner  contextlib.py:78
               |        |     `- 0.007 sort_by_key  cudf/core/_internals/sorting.py:160
               |        `- 0.011 NumericalColumn.unique  cudf/core/column/column.py:1342
               |           `- 0.010 inner  contextlib.py:78
               |              `- 0.010 drop_duplicates  cudf/core/_internals/stream_compaction.py:82
               `- 0.039 MultiIndex._from_data  cudf/core/multiindex.py:344
                  `- 0.038 _compute_levels_and_codes  cudf/core/multiindex.py:67
                     `- 0.038 factorize  cudf/core/algorithms.py:22
                        |- 0.021 NumericalColumn._label_encoding  cudf/core/column/column.py:1516
                        |  |- 0.010 [self]  cudf/core/column/column.py
                        |  `- 0.007 inner  contextlib.py:78
                        |     `- 0.007 sort_by_key  cudf/core/_internals/sorting.py:160
                        `- 0.010 NumericalColumn.unique  cudf/core/column/column.py:1342
                           `- 0.010 inner  contextlib.py:78
                              `- 0.010 drop_duplicates  cudf/core/_internals/stream_compaction.py:82

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 13, 2025
@mroeschke mroeschke self-assigned this Jan 13, 2025
@mroeschke mroeschke requested a review from a team as a code owner January 13, 2025 22:13
@mroeschke mroeschke requested review from bdice and Matt711 January 13, 2025 22:13
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks Matt!

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 960c723 into rapidsai:branch-25.02 Jan 15, 2025
105 of 106 checks passed
@mroeschke mroeschke deleted the perf/mi/lazy branch January 15, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants