PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis#6760
PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis#6760dchigarev merged 6 commits intomodin-project:masterfrom
broadcast_apply_full_axis#6760Conversation
broadcast_apply_full_axis
| kw["column_widths"] = [1] | ||
| else: | ||
| if ( | ||
| kw["row_lengths"] is None |
There was a problem hiding this comment.
| kw["row_lengths"] is None | |
| axis == 0 and | |
| kw["row_lengths"] is None |
It seems that we can preserve lengths only for one axis (for the one, that we're applying a full-axis function). Consider this case, where kernel modifies the shapes of partitions without modifying the total length:
from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme
import pandas
md_df = construct_modin_df_by_scheme(pandas.DataFrame({"a": [1, 2, 3, 4], "b": [3, 4, 5, 6], "c": [6, 7, 8, 9], "d": [0, 1, 2, 3]}), {"row_lengths": [2, 2], "column_widths": [2, 2]})._query_compiler._modin_frame
def func(df):
if df.iloc[0, 0] == 1:
return pandas.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4], "c": [4, 2, 1], "d": [3, 2, 1]})
else:
return pandas.DataFrame({"a": [3], "b": [4], "c": [1], "d": [1]})
res = md_df.apply_full_axis(func=func, axis=1, new_index=[0, 1, 2, 3], new_columns=["a", "b", "c", "d"], keep_partitioning=True)
print(res._row_lengths_cache) # [2, 2]
print(res._column_widths_cache) # [2, 2]
actual_row_lengths = [part.length() for part in res._partitions[:, 0]]
print(actual_row_lengths) # [3, 1]bef002d to
5777085
Compare
| and len(broadcastable_by) == 1 | ||
| and broadcastable_by[0].has_materialized_dtypes | ||
| ): | ||
| new_index = ModinIndex( |
There was a problem hiding this comment.
@dchigarev How is it supposed to take the length from this object in broadcast_apply_full_axis function if it refers to the old modin frame?
There was a problem hiding this comment.
oh, right, I didn't think that the new_index might be used before being passed to the new frame's constructor where the value will be updated...
We can add a logic, that won't use new_index/new_columns for lengths/widths inferring if they return False for .is_materialized. From my side, I'll make a PR passing here None as a value, so in case an index is being used preliminary, it would raise an error, rather than giving an incorrect result
…_apply_full_axis' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This reverts commit 5777085.
d287c10 to
1f01445
Compare
| # to avoid problems that may arise when filtering empty dataframes | ||
| and all(r != 0 for r in self._row_lengths_cache) |
There was a problem hiding this comment.
what kind of problems do you mean here?
There was a problem hiding this comment.
The logic for filtering empty partitions is too unpredictable at first glance, so I don’t want to enable length calculations in this mode for now.
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.pyblack --check modin/ asv_bench/benchmarks scripts/doc_checker.pygit commit -sbroadcast_apply_full_axiscan be implemented to compute caches in some cases #4804docs/development/architecture.rstis up-to-date