Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Update architecture to be more flexible and performant #1821

Merged
merged 63 commits into from
Apr 5, 2018

Conversation

devin-petersohn
Copy link
Member

What do these changes do?

Changed the underlying architecture to be partitioned based on blocks and allow for data to be accessed by either row or column. Also updated many method implementations that needed minor changes to be correct in the new architecture.

Related issue number

devin-petersohn and others added 30 commits March 27, 2018 12:45
Setting up pandas concordant constructor

adding column partitions to Dataframe architecture

using copy to create temp ray df

resolve merge issues

renaming rebased variables to new names

rename cols, rows to _col_partitions and _row_partitions

Adding in the col_partitions and row_partitions properties

Resolving some flake8 issues

Shuffle on either axis

added WIP comments to shuffle.py

Modifications to shuffle actor

drop on shuffle axis

implemented rows_to_cols WIP

implement transpose using col and row partitions

Rebuild columns, implement index calculation

* zipped index calculations

* rename _index and _length to _row*

* add _col_index and _col_length (currently broken)

* pass index & columns in transpose

resolving rebase from ray/master

rearranging utils to match ray/master

reimplement groupby and update_inplace

implemented _rebuild_rows

fix import issue

update map functions to new architecture

cast to df in sum and fix empty partition index

adhoc fix for sum
any/all impl

Resolving additional functions to architecture change

any/all index join impl

Implement __delitem__ WIP (untested)

fix for when columns are non-duplicate

replace columns with _col_index

implemented inplace

implemented insert, handtested

uncomment tests
* Adding fixes

* Adding iterrows update
* Fix items

* Removing debug code
* changes to min/max dataframes functions

* max/min now return a Series - fixed tests to check equality in pandas series objects

* added error checking for axis in min and max

* updated error checking for axis in min/max
* Update __delitem__ with row/col_partitions

* Fix __neg__
…l(), and count() (ray-project#8)

* cleanup and fixing eval

* fixed eval and ffill dataframe functions

* changing _col_index entries during eval

* small change to documentation for applymap()
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4650/
Test PASSed.

_row_partitions = property(_get_row_partitions, _set_row_partitions)

def _get_col_partitions(self):
@ray.remote
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid defining remote function here

@@ -401,7 +661,12 @@ def isnull(self):
True: cell contains null.
False: otherwise.
"""
return self._map_partitions(lambda df: df.isnull)
new_blk_partitions = np.array([_map_partitions(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blk -> block everywhere would be preferable

@@ -1143,6 +1161,8 @@ def test_fillna_sanity(num_partitions=2):

zero_filled = test_data.tsframe.fillna(0)
ray_df = from_pandas(test_data.tsframe, num_partitions).fillna(0)
print(ray_df)
print(zero_filled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove prints


if index is not None:
self.index = index
dtype : Data type to force. Only a single dtype is allowed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype:

If axis=None or axis=0, this call applies df.all(axis=1)
to the transpose of df.
If axis=None or axis=0, this call applies on the column partitions,
otherwise operates on row partitions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent by 4 spaces

"""Updates the current DataFrame inplace
Note:
If `columns` or `index` are not supplied, they will revert to
default columns or index respectively, as this function does not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent by 4 spaces

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4662/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. Currently failing linting, but looks good to me.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4663/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4664/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4665/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4666/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4667/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4668/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4669/
Test PASSed.

@robertnishihara robertnishihara merged commit 0d9a7a3 into ray-project:master Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants