-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Implementation for head, idxmax, idxmin, pop, tail, and Ray Index #1520
[DataFrame] Implementation for head, idxmax, idxmin, pop, tail, and Ray Index #1520
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Looks good to me.
python/ray/dataframe/dataframe.py
Outdated
|
||
def hist(self, data, column=None, by=None, grid=True, xlabelsize=None, | ||
xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, | ||
sharey=False, figsize=None, layout=None, bins=10, **kwds): | ||
raise NotImplementedError("Not Yet implemented.") | ||
|
||
def idxmax(self, axis=0, skipna=True): | ||
raise NotImplementedError("Not Yet implemented.") | ||
"""Get the index of the first occurrence of the maximum value of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to keep the first line of the docstrings in a single line (e.g., replae maximum -> max).
Same comment for the next function.
python/ray/dataframe/dataframe.py
Outdated
|
||
Returns: | ||
A Series with the index for each maximum value for the axis | ||
specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but can we indent "specified" by four spaces? Same comment for a couple other functions in this PR.
i = 0 | ||
while n > 0 and i < len(self._df): | ||
if (n - sizes[i]) < 0: | ||
new_dfs.append(_deploy_func.remote(lambda df: df.head(n), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we may want to consider defining the remote functions up front because deploying them every time could be slow (or maybe just deploying them once and then caching the result), but this is an optimization for later.
Test PASSed. |
Test PASSed. |
Test PASSed. |
Continuing to add functionality.
Ray Index will be needed to implement methods such as
__getitem__()
and will be extremely useful going forward. We cannot rely on the Pandas Index globally because we have a different structure overall. We can, however use the Pandas Index locally to access the item we need.Index structure:
key -> (partition, key)
. The secondkey
is the Pandas Index key.Edit: Additional work is needed to integrate the new Index structure, but we at least have the skeleton in this PR.