Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Implement where #1989

Merged
merged 9 commits into from
May 9, 2018

Conversation

devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented May 3, 2018

Implement DataFrame.where.

Still needs:

  • Fix where when Series objects are passed in as other.
  • Add sanity tests
  • Better Error checking

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5156/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5160/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5265/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5266/
Test PASSed.

Copy link
Contributor

@kunalgosar kunalgosar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few comments.

@@ -108,10 +108,11 @@ def to_pandas(df):
A new pandas DataFrame.
"""
if df._row_partitions is not None:
pd_df = pd.concat(ray.get(df._row_partitions))
print("Yes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

other_zipped = (v for k, v in self._copartition(other,
self.index))

new_partitions = [where_helper.remote(k, v, next(other_zipped),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can k, v be converted to lists and passed in by reference? Ray will automatically deserialize then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not without merging them together, then also passing the length of the left. Performance-wise it's not much different.

# from blocks and the axes are set according to the blocks. We have
# already correctly copartitioned everything, so there's no
# correctness problems with doing this.
left.reset_index(inplace=True, drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is concatenated into row partitions, can you only reset the column index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to reset the index here because that's what other is relying on.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5270/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5273/
Test PASSed.

args = (False, axis, level, errors, try_cast, raise_on_error)

@ray.remote
def where_helper(left, cond, other, left_columns, cond_columns,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous defining a remote function inside of a method call, because this will define a new remote function every time the method is called. Currently this is a bit heavyweight. We probably want to move this outside.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, thanks. I had this in during development and forgot to move it. Sorry about that!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5281/
Test PASSed.

@robertnishihara robertnishihara merged commit 72a3a6c into ray-project:master May 9, 2018
alok added a commit to alok/ray that referenced this pull request May 11, 2018
* master:
  [DataFrame] Implement where (ray-project#1989)
  [DataFrame] Add direct pandas imports for MVP (ray-project#1960)
  Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007)
alok added a commit to alok/ray that referenced this pull request May 11, 2018
* master:
  [xray] Fix UniqueID hashing for object and task IDs. (ray-project#2017)
  [DataFrame] Fixing bugs in groupby (ray-project#2031)
  [DataFrame] Fixes dropna subset bug (ray-project#2018)
  [DataFrame] Implement where (ray-project#1989)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants