-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Implemented __setitem__, select_dtypes, and astype #1941
Conversation
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
python/ray/dataframe/dataframe.py
Outdated
"github.com/ray-project/ray.") | ||
if not columns: | ||
return DataFrame() | ||
col_idx = [self.columns.get_loc(columns[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to handle cases where some of the passed in values are not in self.columns. Pandas fills this case with NaNs.
python/ray/dataframe/dataframe.py
Outdated
@@ -3260,7 +3332,7 @@ def _getitem_array(self, key): | |||
new_parts = _map_partitions(lambda df: df[key], | |||
self._col_partitions) | |||
columns = self.columns | |||
index = self.index[key] | |||
index = self._col_metadata[key].index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
python/ray/dataframe/dataframe.py
Outdated
raise NotImplementedError( | ||
"To contribute to Pandas on Ray, please visit " | ||
"github.com/ray-project/ray.") | ||
if errors == 'raise': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not ignore errors when errors == 'raise'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalgosar What do you mean by your comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On errors == 'raise'
, these exceptions should be raised. On errors == 'ignore'
, they should be suppressed and self
should be returned as per the pandas documentation. Currently errors are suppressed in this if case.
python/ray/dataframe/dataframe.py
Outdated
"github.com/ray-project/ray.") | ||
if errors == 'raise': | ||
try: | ||
pd.DataFrame().astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check doesn't work on empty dataframes
python/ray/dataframe/dataframe.py
Outdated
raise NotImplementedError( | ||
"To contribute to Pandas on Ray, please visit " | ||
"github.com/ray-project/ray.") | ||
if not columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anytime that columns == None
this will return an empty dataframe
elif not exclude: | ||
exclude = [] | ||
|
||
sel = tuple(map(set, (include, exclude))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code (and below) is somewhat hard to read, and can be wrapped up into the if-else cases above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code using map in particular?
python/ray/dataframe/dataframe.py
Outdated
|
||
axis = pd.DataFrame()._get_axis_number(axis) if (axis) else 0 | ||
if axis == 1 or columns: | ||
def row_helper(df, col_idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the partitions currently have RangeIndex
on both rows and columns, this function won't work. Refer to utils._reindex_helper
and ask @devin-petersohn for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to get around that by translating the columns passed in to the helper by using the loc index of the columns desired (col_idx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this implementation of reindex for a future PR
Test FAILed. |
Test PASSed. |
Test PASSed. |
@@ -2150,7 +2168,7 @@ def test_reindex(): | |||
ray_df = create_test_dataframe() | |||
|
|||
with pytest.raises(NotImplementedError): | |||
ray_df.reindex() | |||
ray_df.reindex(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a sanity check test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code, was reindex dropped from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, reindex did not have full functionality (axes, errors) so I took it out for the PR against master
@@ -2150,7 +2168,7 @@ def test_reindex(): | |||
ray_df = create_test_dataframe() | |||
|
|||
with pytest.raises(NotImplementedError): | |||
ray_df.reindex() | |||
ray_df.reindex(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this line.
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Merged, thanks @jaeminkim1324! |
* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...
* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...
* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
What do these changes do?
Adds functionality for astype and select_dtypes methods for Dataframes