[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

devin-petersohn · 2018-02-25T23:03:51Z

Mostly used Pandas implementations for Index related things. Also was able to pull in some of their error checking.

AmplabJenkins · 2018-02-25T23:58:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3945/
Test PASSed.

robertnishihara · 2018-02-26T07:36:18Z

python/ray/dataframe/dataframe.py

+from pandas.core.dtypes.cast import maybe_upcast_putmask
+from pandas.compat import lzip
+
+import warnings


Should we be using the warnings module everywhere else in the codebase? Right now we're just printing warnings to stdout, but that's almost certainly the wrong approach.

Probably. We are using warnings here because that's what Pandas does. Warnings are typically sent to stderr. They also show up differently in a Jupyter Notebook.

robertnishihara · 2018-02-26T07:37:39Z

python/ray/dataframe/dataframe.py

+        cumulative = np.cumsum(self._lengths)
+        partitions = [value[cumulative[i-1]:cumulative[i]]
+                      for i in range(len(cumulative))
+                      if i != 0]


not really a big deal, but you could do range(1, len(cumulative)) and get rid of the if statement

Yes, I agree. @simon-mo can update this in his loc and iloc PR as an optimization.

robertnishihara · 2018-02-26T07:40:33Z

python/ray/dataframe/dataframe.py

@@ -792,7 +800,52 @@ def info(self, verbose=None, buf=None, max_cols=None, memory_usage=None,
        raise NotImplementedError("Not Yet implemented.")

    def insert(self, loc, column, value, allow_duplicates=False):
-        raise NotImplementedError("Not Yet implemented.")
+        """Insert column into DataFrame at specified location.


With documentation, it just occurred to me that while it is important to document arguments and return values, the documentation here is largely the same as pandas. The most important thing to draw attention to is any deviations from pandas. That doesn't apply to this function specifically, but just in general we should make sure to highlight differences where they occur.

The public API should be 100% the same, excluding possibly the constructor (in the future). Most people create DataFrames from files, so I doubt the constructor will be used outside of examples and small test cases. The return types and errors raised should all be the same for everything implemented so far. I think putting together some documentation for this in the readthedocs is probably going to happen after our blog post this week.

robertnishihara · 2018-02-26T07:45:21Z

python/ray/dataframe/dataframe.py

-    return pd.concat(ray.get(df._df))
+    pd_df = pd.concat(ray.get(df._df))
+    pd_df.index = df.index
+    pd_df.columns = df.columns


In what situations does to_pandas behave differently now? That is, when are the index and columns fields different from what you would get from pd.concat?

Do the tests compare equality of these fields? And are there tests where the old version of to_pandas would fail?

This is an internal change to make it faster to lookup and retrieve things in the index (i.e. don't require a remote task to fetch this information). It doesn't change the tests at all, but the old version of to_pandas will not have the correct index and columns. @simon-mo has an update to this to start making the index on the inner frame more like the design.

devin-petersohn added 3 commits February 25, 2018 14:59

Adding insert, set_axis, set_index, reset_index and tests

0c2ad34

Fix lint

e8e0373

Fix lint

c437a14

robertnishihara approved these changes Feb 26, 2018

View reviewed changes

robertnishihara merged commit 1fa59f1 into ray-project:master Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

devin-petersohn commented Feb 25, 2018

AmplabJenkins commented Feb 25, 2018

robertnishihara Feb 26, 2018

devin-petersohn Feb 26, 2018

robertnishihara Feb 26, 2018

devin-petersohn Feb 26, 2018

robertnishihara Feb 26, 2018

devin-petersohn Feb 26, 2018

robertnishihara Feb 26, 2018

devin-petersohn Feb 26, 2018

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

Conversation

devin-petersohn commented Feb 25, 2018

AmplabJenkins commented Feb 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment