Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a distributed Dataframe API to Ray #1330

Merged
merged 17 commits into from
Dec 20, 2017

Conversation

devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Dec 16, 2017

What do these changes do?

Adds a Pandas DataFrame wrapper for use in Ray:

  • Distributes the DataFrame.
  • Operates on the DataFrame partitions in remote tasks.
  • Currently provides some simple functionality.

TODO

  • Add/Format unit tests in test/dataframe.py
  • Clean up a couple of quick hacks
  • Add skeleton for full API with NotImplementedErrors
  • Docs in doc/source
  • Benchmark

Push the last 3 to next PR.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@devin-petersohn
Copy link
Member Author

I'll ping when the tests are in. Jenkins doesn't need to test until then.

@pcmoritz
Copy link
Contributor

ok to test

@robertnishihara
Copy link
Collaborator

add to whitelist

@robertnishihara
Copy link
Collaborator

Not a big deal, but the "add to whitelist" command doesn't seem to work for me. I had to go and add @devin-petersohn manually to the jenkins config.

cc @shaneknapp

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to add some simple tests for this? E.g., make a file test/dataframe_test.py and call it from .travis.yml.

Note that most of our tests use unittest but I think we'd like to switch to pytest, so if you make a new test file, let's use pytest.

ray.register_custom_serializer(pd.core.indexes.base.Index, use_pickle=True)


class DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do DataFrame(object) to make sure things work well in Python 2.

@@ -0,0 +1,467 @@
import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

at the top of every Python file (to make things consistent across Python 2 and 3).

import numpy as np
import ray

ray.register_custom_serializer(pd.DataFrame, use_pickle=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to work around #1305?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it turns out the pd.core.indexes.base.Index was also giving problems.

@@ -0,0 +1,467 @@
import pandas as pd
import numpy as np
import ray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to move the file to python/ray/dataframe/dataframe.py. And to add a file python/ray/dataframe/__init__.py that has some lines like

from .dataframe import DataFrame

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2822/
Test PASSed.

@@ -122,6 +122,7 @@ script:
- python test/trial_runner_test.py
- python test/trial_scheduler_test.py
- python test/cython_test.py
- python -m pytest test/dataframe.py
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing through pytest.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2824/
Test PASSed.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 19, 2017

@devin-petersohn Can you take a look at the Travis test failures at https://travis-ci.org/ray-project/ray/builds/317618386?utm_source=github_status&utm_medium=notification?

Looks like we need to install pandas on Travis. The files to modify are doc/requirements-doc.txt and .travis/install-dependencies.sh.

@pcmoritz
Copy link
Contributor

You will also need to solve the rebase conflict, otherwise Travis doesn't run.

@AmplabJenkins
Copy link

Build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2838/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2840/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2844/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2846/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2847/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2855/
Test PASSed.

import ray

ray.register_custom_serializer(pd.DataFrame, use_pickle=True)
ray.register_custom_serializer(pd.core.indexes.base.Index, use_pickle=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that this is in __init__.py, we can probably remove it from here

@@ -0,0 +1,8 @@
from .dataframe import DataFrame
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that! Please see above.

@@ -0,0 +1,177 @@
import pytest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add the "from future" stuff here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added above.

@pcmoritz
Copy link
Contributor

There are a few listing errors: https://travis-ci.org/ray-project/ray/jobs/318924011

Do you want to fix the linting and then we get the PR merged and work on the TODOs you mention above in the next PR?

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2856/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2857/
Test PASSed.

@pcmoritz pcmoritz changed the title [WIP] Add a distributed Dataframe API to Ray Add a distributed Dataframe API to Ray Dec 20, 2017
@pcmoritz pcmoritz self-requested a review December 20, 2017 17:30
@pcmoritz pcmoritz merged commit a75a473 into ray-project:master Dec 20, 2017
@simon-mo simon-mo mentioned this pull request Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants