-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a distributed Dataframe API to Ray #1330
Conversation
Can one of the admins verify this patch? |
I'll ping when the tests are in. Jenkins doesn't need to test until then. |
ok to test |
add to whitelist |
Not a big deal, but the "add to whitelist" command doesn't seem to work for me. I had to go and add @devin-petersohn manually to the jenkins config. cc @shaneknapp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to add some simple tests for this? E.g., make a file test/dataframe_test.py
and call it from .travis.yml
.
Note that most of our tests use unittest
but I think we'd like to switch to pytest
, so if you make a new test file, let's use pytest
.
python/ray/dataframe.py
Outdated
ray.register_custom_serializer(pd.core.indexes.base.Index, use_pickle=True) | ||
|
||
|
||
class DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do DataFrame(object)
to make sure things work well in Python 2.
python/ray/dataframe.py
Outdated
@@ -0,0 +1,467 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
at the top of every Python file (to make things consistent across Python 2 and 3).
python/ray/dataframe.py
Outdated
import numpy as np | ||
import ray | ||
|
||
ray.register_custom_serializer(pd.DataFrame, use_pickle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to work around #1305?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it turns out the pd.core.indexes.base.Index
was also giving problems.
python/ray/dataframe.py
Outdated
@@ -0,0 +1,467 @@ | |||
import pandas as pd | |||
import numpy as np | |||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to move the file to python/ray/dataframe/dataframe.py
. And to add a file python/ray/dataframe/__init__.py
that has some lines like
from .dataframe import DataFrame
Merged build finished. Test PASSed. |
Test PASSed. |
@@ -122,6 +122,7 @@ script: | |||
- python test/trial_runner_test.py | |||
- python test/trial_scheduler_test.py | |||
- python test/cython_test.py | |||
- python -m pytest test/dataframe.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing through pytest
.
Merged build finished. Test PASSed. |
Test PASSed. |
@devin-petersohn Can you take a look at the Travis test failures at https://travis-ci.org/ray-project/ray/builds/317618386?utm_source=github_status&utm_medium=notification? Looks like we need to install pandas on Travis. The files to modify are |
You will also need to solve the rebase conflict, otherwise Travis doesn't run. |
29c40b2
to
8f3f054
Compare
Build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
python/ray/dataframe/dataframe.py
Outdated
import ray | ||
|
||
ray.register_custom_serializer(pd.DataFrame, use_pickle=True) | ||
ray.register_custom_serializer(pd.core.indexes.base.Index, use_pickle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that this is in __init__.py
, we can probably remove it from here
@@ -0,0 +1,8 @@ | |||
from .dataframe import DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that! Please see above.
@@ -0,0 +1,177 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add the "from future" stuff here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added above.
There are a few listing errors: https://travis-ci.org/ray-project/ray/jobs/318924011 Do you want to fix the linting and then we get the PR merged and work on the TODOs you mention above in the next PR? |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
What do these changes do?
Adds a Pandas DataFrame wrapper for use in Ray:
TODO
Add skeleton for full API withNotImplementedErrors
Docs in doc/sourceBenchmarkPush the last 3 to next PR.