[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

devin-petersohn · 2018-02-15T00:17:18Z

Resolves #1525 and adds new functionality for ray.Index object. Same as #1542.

AmplabJenkins · 2018-02-15T01:14:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3734/
Test PASSed.

AmplabJenkins · 2018-02-20T20:37:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3870/
Test PASSed.

devin-petersohn · 2018-02-21T04:40:57Z

@robertnishihara This is ready for a review pass. It passes travis except for the valgrind, which shouldn't have been affected by this PR.

robertnishihara · 2018-02-21T04:55:05Z

python/ray/dataframe/dataframe.py

+        # TODO: Clean up later.
+        # We will call get only when we access the object (and only once).
+        self._lengths = \
+            ray.get([_deploy_func.remote(_get_lengths, d) for d in self._df])


Another option (perhaps for later) that might make sense is to always pass along the length along with the object IDs, so df would be a list of (Object_ID, length) pairs. This assumes that anytime you create one of these partition object IDs, you can easily compute the length.

I think we will probably need a way to either compute or set it. We will eventually have the exact same constructor as Pandas, so passing it won't be an option in the future.

I do think that most of the time we can easily compute the length on creation. I will think more about the optimization for this and we can review it later.

robertnishihara · 2018-02-21T04:57:21Z

python/ray/dataframe/dataframe.py

@@ -311,8 +333,11 @@ def transpose(self, *args, **kwargs):
        """
        local_transpose = self._map_partitions(
            lambda df: df.transpose(*args, **kwargs))
+
+        # print(ray.get(local_transpose._df))


remove the print

robertnishihara · 2018-02-21T04:58:59Z

python/ray/dataframe/dataframe.py

+    # Because we sometimes have cases where we have summary statistics in our
+    # DataFrames
+    except TypeError:
+        return 0


Should we be returning 0 or letting the exception happen?

Since it happens automatically, we need to suppress the error. There are cases where we don't have a __len__ or size in our DataFrame because we are wrapping anything that gets returned (e.g. sum). In the future, we will change this so there's no ambiguity with what a DataFrame can return.

robertnishihara · 2018-02-21T04:59:44Z

python/ray/dataframe/index.py

+        """
+        k = index.idx.keys()
+        if index.pandas_type is pd.RangeIndex:
+            return pd.RangeIndex(min(k), max(k)+1)


i think the linting will ask for max(k) + 1

robertnishihara · 2018-02-21T05:00:18Z

python/ray/dataframe/index.py

+              "Length of index given does not match current dataframe")
+
+        return Index(
+          {pd_index[i]: dest_indices[i] for i in range(len(dest_indices))},


add two more spaces in lines 55 and 56

robertnishihara · 2018-02-21T05:00:57Z

python/ray/dataframe/test/test_dataframe.py

@@ -108,6 +111,8 @@ def test_keys(ray_df, pandas_df):

 @pytest.fixture
 def test_transpose(ray_df, pandas_df):
+    print("rd: ", rdf.to_pandas(ray_df.T))
+    print("pd: ", pandas_df.T)


probably remove the two print statements

robertnishihara · 2018-02-21T05:01:23Z

python/ray/dataframe/dataframe.py

        # Sum will collapse the NAs from the groupby
-        return local_transpose.reduce_by_index(lambda df: df.sum(), axis=1)
+        return local_transpose.reduce_by_index(
+          lambda df: df.apply(lambda x: x), axis=1)


two more spaces here

AmplabJenkins · 2018-02-21T05:40:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3877/
Test PASSed.

AmplabJenkins · 2018-02-21T06:21:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3878/
Test PASSed.

devin-petersohn · 2018-02-21T15:31:06Z

@robertnishihara I think this is good to get merged if you approve. This is blocking most of the other DataFrame methods.

devin-petersohn added 7 commits February 20, 2018 11:35

Index update

41b871c

Fixed transpose bug with nan values

f47e618

Fix lint

a9e2951

Addressing reviewer comments

3ba2a75

Update PR

f776716

Test code modification

a0169a8

Fixing minor error with tranpose

d371726

devin-petersohn force-pushed the df_patch02 branch from c6d20fd to d371726 Compare February 20, 2018 19:36

Fixing problem of failed tests due to sort issue in Python2

a0359be

robertnishihara reviewed Feb 21, 2018

View reviewed changes

Addressing reviwer comments

304ecaf

SaladRaider mentioned this pull request Feb 21, 2018

[DataFrame] Implements DataFrame.rename, DataFrame.rename_axis, and Index.set_names #1573

Merged

devin-petersohn mentioned this pull request Feb 21, 2018

[DataFrame] df.T.T is leaving out partitions #1548

Closed

robertnishihara approved these changes Feb 21, 2018

View reviewed changes

robertnishihara merged commit de6fa02 into ray-project:master Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

devin-petersohn commented Feb 15, 2018

AmplabJenkins commented Feb 15, 2018

AmplabJenkins commented Feb 20, 2018

devin-petersohn commented Feb 21, 2018

robertnishihara Feb 21, 2018

devin-petersohn Feb 21, 2018

robertnishihara Feb 21, 2018

robertnishihara Feb 21, 2018

devin-petersohn Feb 21, 2018

robertnishihara Feb 21, 2018

robertnishihara Feb 21, 2018

robertnishihara Feb 21, 2018

robertnishihara Feb 21, 2018

robertnishihara Feb 21, 2018

AmplabJenkins commented Feb 21, 2018

AmplabJenkins commented Feb 21, 2018

devin-petersohn commented Feb 21, 2018

[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

Conversation

devin-petersohn commented Feb 15, 2018

AmplabJenkins commented Feb 15, 2018

AmplabJenkins commented Feb 20, 2018

devin-petersohn commented Feb 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 21, 2018

AmplabJenkins commented Feb 21, 2018

devin-petersohn commented Feb 21, 2018