-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#7039: Pass scalar dtype as is to astype query compiler #7152
Conversation
Performance comparison for astype with scalar dtype
For Master 4.155833005905151 seconds |
if isinstance(new_dtype, pandas.CategoricalDtype): | ||
new_dtypes[column] = LazyProxyCategoricalDtype._build_proxy( | ||
new_dtypes[:] = LazyProxyCategoricalDtype._build_proxy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to use the same object in multiple elements of a series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it would be the same value for dtype for all elements of the series, What would be an issue using same object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we store a reference to the same object in several elements of the series, and if so, then changing one element will entail changing all elements, which can easily lead to errors. However, if this is not the case, then there is no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we only store dtype values to new_dtypes[:] I think it should be okay as the cases where values of dtypes itself being mutated inplace are very unlikely. If we reassign the value for dtypes eg with new_dtypes[0]="str"
this would also work fine so I think we can keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering also that this class does not have inplace operations other than materialization, then ok.
# Actual parent will substitute `None` at `.set_dtypes_cache` | ||
parent=None, | ||
column_name=column, | ||
column_name=new_dtypes.index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument can only be a string, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should take only string have made, have made changes to use apply() function and pass index of the column as a string.
7d8a42e
to
bc7ce11
Compare
modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py
Show resolved
Hide resolved
CI failed, please recheck. |
b52daf1
to
9621963
Compare
CI is failing because of #7167 so let's wait for it to be merged. |
@anmyachev, any comments? |
@arunjose696, please rebase on current master to make CI pass. |
…iler Signed-off-by: arunjose696 <arunjose696@gmail.com>
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
handling categorical type Signed-off-by: arunjose696 <arunjose696@gmail.com>
rebased |
@anmyachev, any comments? |
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
Currently in master, In astype method if the datatype passed is a scalar, the scalar is converted to a dictionary with the keys as column names and values the scalar dtype value for all keys. This causes the exception message to be different in pandas vs modin as the function passed to remote function applies astype with the dictionary instead of scalar.
This PR passes the scalar dtypes as is to QC layer and the scalar is used in remote function for astype. This would also avoid unnecessarily iterating over the column names if dtype is a scalar value.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
test_series.py::test_astype
failed due to different exception messages #7039?docs/development/architecture.rst
is up-to-date