Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#5650: Restore the right dtype for applying Series.cat #5651

Merged
merged 2 commits into from
Feb 14, 2023

Conversation

YarShev
Copy link
Collaborator

@YarShev YarShev commented Feb 13, 2023

Signed-off-by: Igoshev, Iaroslav iaroslav.igoshev@intel.com

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: AttributeError: Can only use .cat accessor with a 'category' dtype #5650
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
@YarShev YarShev marked this pull request as ready for review February 13, 2023 22:18
@YarShev YarShev requested a review from a team as a code owner February 13, 2023 22:18
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good as a temporary solution, left couple of comments

# `pd.concat` doesn't preserve categorical dtype
# if the dfs have categorical columns
# so we intentionaly restore the right dtype.
# TODO: revert the change when https://github.com/pandas-dev/pandas/issues/51362 is fixed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've already raised a similar issue in modin, we should probably link it here as well: #2513

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't fix #2513, but only series.cat.* methods. So I think we can close #2513 when pandas has a fix on their side unless we get a P0 request to fix #2513 by that time.

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Copy link
Collaborator

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for now, but we should consider closely watching for pandas-dev/pandas#51362 or, if it would get stuck, for other possible solutions in #2513, as for now categorical seems to not serve its purpose of saving memory when faced with a column-wise operation, which may result into an excessive memory consumption and out of memory issues.

@anmyachev anmyachev merged commit ab33797 into modin-project:master Feb 14, 2023
@YarShev YarShev deleted the dev/yigoshev-issue5650 branch November 2, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: AttributeError: Can only use .cat accessor with a 'category' dtype
3 participants