Skip to content

Potential performance issue: Unreliable performance of .loc in pandas 2.0.3 #923

@TendouArisu

Description

@TendouArisu

Issue Description:

Hello.
I have discovered a performance degradation in the .loc function of pandas version 2.0.3 when .loc handling big DataFrame with non-unique indexes. When using pandas more than 4 indexes, .loc drastically increases to X1000 times. And I notice that hi-ml-cpath/environment.yml, shows that it depends on pandas version 2.0.3. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on GitHub related to this issue, including #54550 and #54746.
I also found that hi-ml-cpath/other/slide_image_loading/src/Histopathology/datasets/panda_dataset.py and hi-ml-cpath/src/health_cpath/datasets/panda_tiles_dataset.py used the influenced api. There may be more files used the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance of .loc .
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions