Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603

Merged
merged 3 commits into from
Feb 26, 2018

Conversation

devin-petersohn
Copy link
Member

Mostly used Pandas implementations for Index related things. Also was able to pull in some of their error checking.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3945/
Test PASSed.

from pandas.core.dtypes.cast import maybe_upcast_putmask
from pandas.compat import lzip

import warnings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using the warnings module everywhere else in the codebase? Right now we're just printing warnings to stdout, but that's almost certainly the wrong approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. We are using warnings here because that's what Pandas does. Warnings are typically sent to stderr. They also show up differently in a Jupyter Notebook.

cumulative = np.cumsum(self._lengths)
partitions = [value[cumulative[i-1]:cumulative[i]]
for i in range(len(cumulative))
if i != 0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really a big deal, but you could do range(1, len(cumulative)) and get rid of the if statement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. @simon-mo can update this in his loc and iloc PR as an optimization.

@@ -792,7 +800,52 @@ def info(self, verbose=None, buf=None, max_cols=None, memory_usage=None,
raise NotImplementedError("Not Yet implemented.")

def insert(self, loc, column, value, allow_duplicates=False):
raise NotImplementedError("Not Yet implemented.")
"""Insert column into DataFrame at specified location.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With documentation, it just occurred to me that while it is important to document arguments and return values, the documentation here is largely the same as pandas. The most important thing to draw attention to is any deviations from pandas. That doesn't apply to this function specifically, but just in general we should make sure to highlight differences where they occur.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public API should be 100% the same, excluding possibly the constructor (in the future). Most people create DataFrames from files, so I doubt the constructor will be used outside of examples and small test cases. The return types and errors raised should all be the same for everything implemented so far. I think putting together some documentation for this in the readthedocs is probably going to happen after our blog post this week.

return pd.concat(ray.get(df._df))
pd_df = pd.concat(ray.get(df._df))
pd_df.index = df.index
pd_df.columns = df.columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations does to_pandas behave differently now? That is, when are the index and columns fields different from what you would get from pd.concat?

Do the tests compare equality of these fields? And are there tests where the old version of to_pandas would fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal change to make it faster to lookup and retrieve things in the index (i.e. don't require a remote task to fetch this information). It doesn't change the tests at all, but the old version of to_pandas will not have the correct index and columns. @simon-mo has an update to this to start making the index on the inner frame more like the design.

@robertnishihara robertnishihara merged commit 1fa59f1 into ray-project:master Feb 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants