Skip to content

[PERF/ENH] Index.intersection does more hashing work than necessary #14487

Open
@wence-

Description

Index intersection performs an inner merge of the unique values of the left and right indices (the unique is done so that indices with repeated values don't blow up the memory footprint). This does a full hash of both indices, then the merge (hashing again). Finally, if requested, the result is sorted.

This could be replaced, I think with positive performance effect by either:

  • leftsemi join + drop_duplicates
  • libcudf.search.contains + apply_boolean_mask + drop_duplicates

One would have to think through the consequences of either of these wrt any ordering guarantees we might want when sort=False (possibly gated behind pandas-compat mode).

This applies mutatis mutandis to MultiIndex.intersection too.

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issuePythonAffects Python cuDF API.

    Type

    No type

    Projects

    • Status

      Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions