-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603
[DataFrame] Adding insert, set_axis, set_index, reset_index and tests #1603
Conversation
Test PASSed. |
from pandas.core.dtypes.cast import maybe_upcast_putmask | ||
from pandas.compat import lzip | ||
|
||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using the warnings module everywhere else in the codebase? Right now we're just printing warnings to stdout, but that's almost certainly the wrong approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. We are using warnings
here because that's what Pandas does. Warnings are typically sent to stderr. They also show up differently in a Jupyter Notebook.
cumulative = np.cumsum(self._lengths) | ||
partitions = [value[cumulative[i-1]:cumulative[i]] | ||
for i in range(len(cumulative)) | ||
if i != 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a big deal, but you could do range(1, len(cumulative))
and get rid of the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. @simon-mo can update this in his loc
and iloc
PR as an optimization.
@@ -792,7 +800,52 @@ def info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, | |||
raise NotImplementedError("Not Yet implemented.") | |||
|
|||
def insert(self, loc, column, value, allow_duplicates=False): | |||
raise NotImplementedError("Not Yet implemented.") | |||
"""Insert column into DataFrame at specified location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With documentation, it just occurred to me that while it is important to document arguments and return values, the documentation here is largely the same as pandas. The most important thing to draw attention to is any deviations from pandas. That doesn't apply to this function specifically, but just in general we should make sure to highlight differences where they occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public API should be 100% the same, excluding possibly the constructor (in the future). Most people create DataFrames from files, so I doubt the constructor will be used outside of examples and small test cases. The return types and errors raised should all be the same for everything implemented so far. I think putting together some documentation for this in the readthedocs is probably going to happen after our blog post this week.
return pd.concat(ray.get(df._df)) | ||
pd_df = pd.concat(ray.get(df._df)) | ||
pd_df.index = df.index | ||
pd_df.columns = df.columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations does to_pandas
behave differently now? That is, when are the index
and columns
fields different from what you would get from pd.concat
?
Do the tests compare equality of these fields? And are there tests where the old version of to_pandas
would fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal change to make it faster to lookup and retrieve things in the index (i.e. don't require a remote task to fetch this information). It doesn't change the tests at all, but the old version of to_pandas
will not have the correct index
and columns
. @simon-mo has an update to this to start making the index on the inner frame more like the design.
Mostly used Pandas implementations for Index related things. Also was able to pull in some of their error checking.