-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Update architecture to be more flexible and performant #1821
[DataFrame] Update architecture to be more flexible and performant #1821
Conversation
Setting up pandas concordant constructor adding column partitions to Dataframe architecture using copy to create temp ray df resolve merge issues renaming rebased variables to new names rename cols, rows to _col_partitions and _row_partitions Adding in the col_partitions and row_partitions properties Resolving some flake8 issues Shuffle on either axis added WIP comments to shuffle.py Modifications to shuffle actor drop on shuffle axis implemented rows_to_cols WIP implement transpose using col and row partitions Rebuild columns, implement index calculation * zipped index calculations * rename _index and _length to _row* * add _col_index and _col_length (currently broken) * pass index & columns in transpose resolving rebase from ray/master rearranging utils to match ray/master reimplement groupby and update_inplace implemented _rebuild_rows fix import issue update map functions to new architecture cast to df in sum and fix empty partition index adhoc fix for sum
any/all impl Resolving additional functions to architecture change any/all index join impl Implement __delitem__ WIP (untested) fix for when columns are non-duplicate replace columns with _col_index implemented inplace implemented insert, handtested uncomment tests
* Adding fixes * Adding iterrows update
* Fix items * Removing debug code
* changes to min/max dataframes functions * max/min now return a Series - fixed tests to check equality in pandas series objects * added error checking for axis in min and max * updated error checking for axis in min/max
* Update __delitem__ with row/col_partitions * Fix __neg__
…l(), and count() (ray-project#8) * cleanup and fixing eval * fixed eval and ffill dataframe functions * changing _col_index entries during eval * small change to documentation for applymap()
Test PASSed. |
python/ray/dataframe/dataframe.py
Outdated
_row_partitions = property(_get_row_partitions, _set_row_partitions) | ||
|
||
def _get_col_partitions(self): | ||
@ray.remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid defining remote function here
python/ray/dataframe/dataframe.py
Outdated
@@ -401,7 +661,12 @@ def isnull(self): | |||
True: cell contains null. | |||
False: otherwise. | |||
""" | |||
return self._map_partitions(lambda df: df.isnull) | |||
new_blk_partitions = np.array([_map_partitions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blk -> block everywhere would be preferable
@@ -1143,6 +1161,8 @@ def test_fillna_sanity(num_partitions=2): | |||
|
|||
zero_filled = test_data.tsframe.fillna(0) | |||
ray_df = from_pandas(test_data.tsframe, num_partitions).fillna(0) | |||
print(ray_df) | |||
print(zero_filled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove prints
python/ray/dataframe/dataframe.py
Outdated
|
||
if index is not None: | ||
self.index = index | ||
dtype : Data type to force. Only a single dtype is allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype:
python/ray/dataframe/dataframe.py
Outdated
If axis=None or axis=0, this call applies df.all(axis=1) | ||
to the transpose of df. | ||
If axis=None or axis=0, this call applies on the column partitions, | ||
otherwise operates on row partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent by 4 spaces
python/ray/dataframe/dataframe.py
Outdated
"""Updates the current DataFrame inplace | ||
Note: | ||
If `columns` or `index` are not supplied, they will revert to | ||
default columns or index respectively, as this function does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent by 4 spaces
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments. Currently failing linting, but looks good to me.
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
What do these changes do?
Changed the underlying architecture to be partitioned based on blocks and allow for data to be accessed by either row or column. Also updated many method implementations that needed minor changes to be correct in the new architecture.
Related issue number