Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Implemented __setitem__, select_dtypes, and astype #1941

Merged
merged 9 commits into from
May 6, 2018

Conversation

jaeminkim1324
Copy link
Contributor

What do these changes do?

Adds functionality for astype and select_dtypes methods for Dataframes

@jaeminkim1324 jaeminkim1324 changed the title added reindex, __setitem__, select_dtypes, and astype functionality [DataFrame] added reindex, __setitem__, select_dtypes, and astype functionality Apr 23, 2018
@jaeminkim1324 jaeminkim1324 changed the title [DataFrame] added reindex, __setitem__, select_dtypes, and astype functionality [DataFrame] Added reindex, __setitem__, select_dtypes, and astype functionality Apr 23, 2018
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5041/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5052/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5053/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5054/
Test PASSed.

"github.com/ray-project/ray.")
if not columns:
return DataFrame()
col_idx = [self.columns.get_loc(columns[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to handle cases where some of the passed in values are not in self.columns. Pandas fills this case with NaNs.

@@ -3260,7 +3332,7 @@ def _getitem_array(self, key):
new_parts = _map_partitions(lambda df: df[key],
self._col_partitions)
columns = self.columns
index = self.index[key]
index = self._col_metadata[key].index
Copy link
Contributor

@kunalgosar kunalgosar Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

raise NotImplementedError(
"To contribute to Pandas on Ray, please visit "
"github.com/ray-project/ray.")
if errors == 'raise':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not ignore errors when errors == 'raise'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalgosar What do you mean by your comment?

Copy link
Contributor

@kunalgosar kunalgosar Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On errors == 'raise', these exceptions should be raised. On errors == 'ignore', they should be suppressed and self should be returned as per the pandas documentation. Currently errors are suppressed in this if case.

"github.com/ray-project/ray.")
if errors == 'raise':
try:
pd.DataFrame().astype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn't work on empty dataframes

raise NotImplementedError(
"To contribute to Pandas on Ray, please visit "
"github.com/ray-project/ray.")
if not columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime that columns == None this will return an empty dataframe

elif not exclude:
exclude = []

sel = tuple(map(set, (include, exclude)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code (and below) is somewhat hard to read, and can be wrapped up into the if-else cases above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code using map in particular?


axis = pd.DataFrame()._get_axis_number(axis) if (axis) else 0
if axis == 1 or columns:
def row_helper(df, col_idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the partitions currently have RangeIndex on both rows and columns, this function won't work. Refer to utils._reindex_helper and ask @devin-petersohn for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to get around that by translating the columns passed in to the helper by using the loc index of the columns desired (col_idx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this implementation of reindex for a future PR

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5062/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5061/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5071/
Test PASSed.

@@ -2150,7 +2168,7 @@ def test_reindex():
ray_df = create_test_dataframe()

with pytest.raises(NotImplementedError):
ray_df.reindex()
ray_df.reindex(None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a sanity check test here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code, was reindex dropped from this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, reindex did not have full functionality (axes, errors) so I took it out for the PR against master

@jaeminkim1324 jaeminkim1324 changed the title [DataFrame] Added reindex, __setitem__, select_dtypes, and astype functionality [DataFrame] Added __setitem__, select_dtypes, and astype functionality May 4, 2018
@jaeminkim1324 jaeminkim1324 changed the title [DataFrame] Added __setitem__, select_dtypes, and astype functionality [DataFrame] Implemented __setitem__, select_dtypes, and astype May 4, 2018
@@ -2150,7 +2168,7 @@ def test_reindex():
ray_df = create_test_dataframe()

with pytest.raises(NotImplementedError):
ray_df.reindex()
ray_df.reindex(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this line.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5229/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5230/
Test PASSed.

Copy link
Contributor

@kunalgosar kunalgosar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@devin-petersohn devin-petersohn merged commit 36fd648 into ray-project:master May 6, 2018
@devin-petersohn
Copy link
Member

Merged, thanks @jaeminkim1324!

alok added a commit to alok/ray that referenced this pull request May 8, 2018
* master: (21 commits)
  Expand local_dir in Trial init (ray-project#2013)
  Fixing ascii error for Python2 (ray-project#2009)
  [DataFrame] Implements df.update (ray-project#1997)
  [DataFrame] Implements df.as_matrix (ray-project#2001)
  [DataFrame] Implement quantile (ray-project#1992)
  [DataFrame] Impement sort_values and sort_index (ray-project#1977)
  [DataFrame] Implement rank (ray-project#1991)
  [DataFrame] Implemented prod, product, added test suite (ray-project#1994)
  [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941)
  [DataFrame] Implement diff (ray-project#1996)
  [DataFrame] Implemented nunique, skew (ray-project#1995)
  [DataFrame] Implements filter and dropna (ray-project#1959)
  [DataFrame] Implements df.pipe (ray-project#1999)
  [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
  Clean up syntax for supported Python versions. (ray-project#1963)
  [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956)
  [DataFrame] Fix dtypes (ray-project#1930)
  keep_dims -> keepdims (ray-project#1980)
  add pthread linking (ray-project#1986)
  [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984)
  ...
alok added a commit to alok/ray that referenced this pull request May 9, 2018
* master: (25 commits)
  [DataFrame] Add direct pandas imports for MVP (ray-project#1960)
  Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007)
  Expand local_dir in Trial init (ray-project#2013)
  Fixing ascii error for Python2 (ray-project#2009)
  [DataFrame] Implements df.update (ray-project#1997)
  [DataFrame] Implements df.as_matrix (ray-project#2001)
  [DataFrame] Implement quantile (ray-project#1992)
  [DataFrame] Impement sort_values and sort_index (ray-project#1977)
  [DataFrame] Implement rank (ray-project#1991)
  [DataFrame] Implemented prod, product, added test suite (ray-project#1994)
  [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941)
  [DataFrame] Implement diff (ray-project#1996)
  [DataFrame] Implemented nunique, skew (ray-project#1995)
  [DataFrame] Implements filter and dropna (ray-project#1959)
  [DataFrame] Implements df.pipe (ray-project#1999)
  [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
  Clean up syntax for supported Python versions. (ray-project#1963)
  [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956)
  [DataFrame] Fix dtypes (ray-project#1930)
  keep_dims -> keepdims (ray-project#1980)
  ...
alok added a commit to alok/ray that referenced this pull request May 9, 2018
* master:
  [DataFrame] Add direct pandas imports for MVP (ray-project#1960)
  Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007)
  Expand local_dir in Trial init (ray-project#2013)
  Fixing ascii error for Python2 (ray-project#2009)
  [DataFrame] Implements df.update (ray-project#1997)
  [DataFrame] Implements df.as_matrix (ray-project#2001)
  [DataFrame] Implement quantile (ray-project#1992)
  [DataFrame] Impement sort_values and sort_index (ray-project#1977)
  [DataFrame] Implement rank (ray-project#1991)
  [DataFrame] Implemented prod, product, added test suite (ray-project#1994)
  [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941)
  [DataFrame] Implement diff (ray-project#1996)
  [DataFrame] Implemented nunique, skew (ray-project#1995)
  [DataFrame] Implements filter and dropna (ray-project#1959)
  [DataFrame] Implements df.pipe (ray-project#1999)
  [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants