Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Implementing API correct groupby with aggregation methods #1914

Merged
merged 38 commits into from
Apr 22, 2018

Conversation

devin-petersohn
Copy link
Member

What do these changes do?

Adds groupby and allows users to interact with the GroupBy object the same way they would in Pandas.

  • groupby implementation
  • DataFrameGroupBy object
  • agg / aggregate / apply for non-dictionary arguments
  • Some performance improvements overall.

@devin-petersohn devin-petersohn changed the title Implementing API correct groupby with aggregation methods [DataFrame] Implementing API correct groupby with aggregation methods Apr 17, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4968/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4969/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4970/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4985/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4989/
Test PASSed.

@robertnishihara
Copy link
Collaborator

Their seem to be some test failures, e.g.,

____________ ERROR collecting ray/dataframe/test/test_dataframe.py _____________
�[31m../../../.local/lib/python2.7/site-packages/pytest-3.5.0-py2.7.egg/_pytest/python.py:411: in _importtestmodule
    mod = self.fspath.pyimport(ensuresyspath=importmode)
../../../.local/lib/python2.7/site-packages/py-1.5.3-py2.7.egg/py/_path/local.py:668: in pyimport
    __import__(modname)
../../../.local/lib/python2.7/site-packages/pytest-3.5.0-py2.7.egg/_pytest/assertion/rewrite.py:213: in load_module
    py.builtin.exec_(co, mod.__dict__)
python/ray/dataframe/test/test_dataframe.py:9: in <module>
    import ray.dataframe as rdf
../../../.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/__init__.py:31: in <module>
    from .dataframe import DataFrame  # noqa: 402
../../../.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/dataframe.py:29: in <module>
    from .groupby import DataFrameGroupBy
E     File "/home/travis/.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/groupby.py", line 48
E       *part),
E       ^
E   SyntaxError: invalid syntax�[0m

sort,
group_keys,
squeeze,
*part),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a python 2 issue, you may need to do

                                       args=(by,
                                             axis,
                                             level,
                                             as_index,
                                             sort,
                                             group_keys,
                                             squeeze) + part

or something like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give this a try.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5014/
Test FAILed.

"To contribute to Pandas on Ray, please visit "
"github.com/ray-project/ray.")
elif is_list_like(arg):
from .concat import concat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this import with the other imports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclical import won't allow this.

else:
kwargs['temp_index'] = self.index

def remote_helper(df, arg, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More descriptive names would be helpful here for readability.


# This magic unzips the list comprehension returned from remote
is_series, new_parts, index, columns = \
[list(t) for t in zip(*remote_result)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need each variable in list form? zip should allow auto-unboxing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need most of them in list form. I'll go ahead and change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do actually need them all in lists because of the ray.get()

# DataFrame, and we have to determine which here. Shouldn't add
# too much to latency in either case because the booleans can
# be returned immediately
is_series = ray.get(is_series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't getting the booleans require a block on the rest of the parameters being calculated anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (de)serialization should be faster, so it should be ready sooner (given a large enough Series).

# return DataFrames
elif any(is_series):
raise ValueError("no results.")
elif axis == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's different between the last two cases in this if statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add better comments.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5021/
Test PASSed.

@robertnishihara robertnishihara merged commit 8f59546 into ray-project:master Apr 22, 2018
alok added a commit to alok/ray that referenced this pull request Apr 28, 2018
* master:
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)
royf added a commit to royf/ray that referenced this pull request Jun 22, 2018
* 'master' of https://github.com/ray-project/ray:
  [rllib] Fix broken link in docs (ray-project#1967)
  [DataFrame] Sample implement (ray-project#1954)
  [DataFrame] Implement Inter-DataFrame operations (ray-project#1937)
  remove UniqueIDHasher (ray-project#1957)
  [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946)
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants