[DataFrame] Update architecture to be more flexible and performant #1821

devin-petersohn · 2018-04-03T19:16:30Z

What do these changes do?

Changed the underlying architecture to be partitioned based on blocks and allow for data to be accessed by either row or column. Also updated many method implementations that needed minor changes to be correct in the new architecture.

Related issue number

Setting up pandas concordant constructor adding column partitions to Dataframe architecture using copy to create temp ray df resolve merge issues renaming rebased variables to new names rename cols, rows to _col_partitions and _row_partitions Adding in the col_partitions and row_partitions properties Resolving some flake8 issues Shuffle on either axis added WIP comments to shuffle.py Modifications to shuffle actor drop on shuffle axis implemented rows_to_cols WIP implement transpose using col and row partitions Rebuild columns, implement index calculation * zipped index calculations * rename _index and _length to _row* * add _col_index and _col_length (currently broken) * pass index & columns in transpose resolving rebase from ray/master rearranging utils to match ray/master reimplement groupby and update_inplace implemented _rebuild_rows fix import issue update map functions to new architecture cast to df in sum and fix empty partition index adhoc fix for sum

any/all impl Resolving additional functions to architecture change any/all index join impl Implement __delitem__ WIP (untested) fix for when columns are non-duplicate replace columns with _col_index implemented inplace implemented insert, handtested uncomment tests

* Adding fixes * Adding iterrows update

* Fix items * Removing debug code

* changes to min/max dataframes functions * max/min now return a Series - fixed tests to check equality in pandas series objects * added error checking for axis in min and max * updated error checking for axis in min/max

* Update __delitem__ with row/col_partitions * Fix __neg__

…l(), and count() (ray-project#8) * cleanup and fixing eval * fixed eval and ffill dataframe functions * changing _col_index entries during eval * small change to documentation for applymap()

AmplabJenkins · 2018-04-04T08:44:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4650/
Test PASSed.

robertnishihara · 2018-04-04T20:29:57Z

python/ray/dataframe/dataframe.py

+    _row_partitions = property(_get_row_partitions, _set_row_partitions)
+
+    def _get_col_partitions(self):
+        @ray.remote


avoid defining remote function here

robertnishihara · 2018-04-04T20:32:02Z

python/ray/dataframe/dataframe.py

@@ -401,7 +661,12 @@ def isnull(self):
            True: cell contains null.
            False: otherwise.
        """
-        return self._map_partitions(lambda df: df.isnull)
+        new_blk_partitions = np.array([_map_partitions(


blk -> block everywhere would be preferable

robertnishihara · 2018-04-04T20:33:37Z

python/ray/dataframe/test/test_dataframe.py

@@ -1143,6 +1161,8 @@ def test_fillna_sanity(num_partitions=2):

    zero_filled = test_data.tsframe.fillna(0)
    ray_df = from_pandas(test_data.tsframe, num_partitions).fillna(0)
+    print(ray_df)
+    print(zero_filled)


remove prints

robertnishihara · 2018-04-04T20:42:00Z

python/ray/dataframe/dataframe.py

-
-        if index is not None:
-            self.index = index
+            dtype : Data type to force. Only a single dtype is allowed.


robertnishihara · 2018-04-04T20:46:06Z

python/ray/dataframe/dataframe.py

-            If axis=None or axis=0, this call applies df.all(axis=1)
-            to the transpose of df.
+            If axis=None or axis=0, this call applies on the column partitions,
+            otherwise operates on row partitions


indent by 4 spaces

robertnishihara · 2018-04-04T20:57:16Z

python/ray/dataframe/dataframe.py

-        """Updates the current DataFrame inplace
+        Note:
+            If `columns` or `index` are not supplied, they will revert to
+            default columns or index respectively, as this function does not


indent by 4 spaces

AmplabJenkins · 2018-04-05T06:42:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4662/
Test PASSed.

robertnishihara

Thanks for addressing the comments. Currently failing linting, but looks good to me.

AmplabJenkins · 2018-04-05T16:02:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4663/
Test PASSed.

AmplabJenkins · 2018-04-05T17:37:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4664/
Test FAILed.

AmplabJenkins · 2018-04-05T17:42:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4665/
Test FAILed.

AmplabJenkins · 2018-04-05T18:22:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4666/
Test FAILed.

AmplabJenkins · 2018-04-05T18:56:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4667/
Test PASSed.

AmplabJenkins · 2018-04-05T19:06:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4668/
Test PASSed.

AmplabJenkins · 2018-04-05T19:22:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4669/
Test PASSed.

devin-petersohn and others added 30 commits March 27, 2018 12:45

Modify groupby for efficiency

149890a

Updating implementation

bd7ebcb

Correcting shuffle index behavior

f652a80

Updating implementations

97f4f17

Adding better implementation

ce6b00e

Updating implementation

dc4839f

Finalizing groupby

8a16d35

Adding note

ef1ce76

Removing merge artifact

1e52092

Fixing the scheduling for ShuffleActors

3a704e1

Addressing comments

cc4f08e

Adding docs and minor bugfix

233cd21

Removed df.empty code, replace with setting n_cols/n_rows

8d3b11d

Map takes in a list of partitions

46cd07d

Updating implementation (#2)

9e9b097

Fix sum

86ed1de

Fix lint

04278f2

minor fixes

3e24e6a

fixing __delitem and initial pass at drop WIP (#3)

d3076dc

Adding fixes (#5)

4739a43

* Adding fixes * Adding iterrows update

Fix items() (#6)

69bd4b3

* Fix items * Removing debug code

Implemented max and min dataframes functions (#4)

027c15b

* changes to min/max dataframes functions * max/min now return a Series - fixed tests to check equality in pandas series objects * added error checking for axis in min and max * updated error checking for axis in min/max

Column rewrite (#7)

1cea5d4

* Update __delitem__ with row/col_partitions * Fix __neg__

helper methods to access index

249a1b2

Cleanups for dataframes functions max(), min(), eval(), bfill(), ffil…

6d17bdc

…l(), and count() (ray-project#8) * cleanup and fixing eval * fixed eval and ffill dataframe functions * changing _col_index entries during eval * small change to documentation for applymap()

Fixing __getitem__ (ray-project#9)

89a3a20

fixes for dtypes, get, idxmax, idxmin, and pop dataframes functions

bf806bd

fixes for idxmax and idxmin

95bd4b7

devin-petersohn mentioned this pull request Apr 4, 2018

[DataFrame] Update groupby and add shuffle with Actors #1694

Closed

Updating lint, passing tests

a7099da

devin-petersohn mentioned this pull request Apr 4, 2018

[DataFrame] ValueError while indexing a dataframe #1826

Closed

robertnishihara reviewed Apr 4, 2018

View reviewed changes

devin-petersohn added 2 commits April 4, 2018 22:38

Addressing comments

a266ffd

Fix test

3838ead

robertnishihara approved these changes Apr 5, 2018

View reviewed changes

Fix lint

e91a7fa

devin-petersohn added 2 commits April 5, 2018 10:06

Fixing backticks to single-quotes

d2fbd3f

Fixing Python2 compat

f956d21

devin-petersohn added 4 commits April 5, 2018 10:53

Fixing indexing

36460fb

Fix lint

a5756cd

Removing debug code

c76597d

Minor cleanup

61b4de9

devin-petersohn mentioned this pull request Apr 5, 2018

[DataFrame] Slice indexing #1832

Merged

robertnishihara merged commit 0d9a7a3 into ray-project:master Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Update architecture to be more flexible and performant #1821

[DataFrame] Update architecture to be more flexible and performant #1821

devin-petersohn commented Apr 3, 2018

AmplabJenkins commented Apr 4, 2018

robertnishihara Apr 4, 2018

robertnishihara Apr 4, 2018

robertnishihara Apr 4, 2018

robertnishihara Apr 4, 2018

robertnishihara Apr 4, 2018

robertnishihara Apr 4, 2018

AmplabJenkins commented Apr 5, 2018

robertnishihara left a comment

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

[DataFrame] Update architecture to be more flexible and performant #1821

[DataFrame] Update architecture to be more flexible and performant #1821

Conversation

devin-petersohn commented Apr 3, 2018

What do these changes do?

Related issue number

AmplabJenkins commented Apr 4, 2018

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

robertnishihara Apr 4, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Apr 5, 2018

robertnishihara left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018

AmplabJenkins commented Apr 5, 2018