Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545

Merged
merged 9 commits into from
Feb 21, 2018

Conversation

devin-petersohn
Copy link
Member

Resolves #1525 and adds new functionality for ray.Index object. Same as #1542.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3734/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3870/
Test PASSed.

@devin-petersohn
Copy link
Member Author

@robertnishihara This is ready for a review pass. It passes travis except for the valgrind, which shouldn't have been affected by this PR.

# TODO: Clean up later.
# We will call get only when we access the object (and only once).
self._lengths = \
ray.get([_deploy_func.remote(_get_lengths, d) for d in self._df])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option (perhaps for later) that might make sense is to always pass along the length along with the object IDs, so df would be a list of (Object_ID, length) pairs. This assumes that anytime you create one of these partition object IDs, you can easily compute the length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will probably need a way to either compute or set it. We will eventually have the exact same constructor as Pandas, so passing it won't be an option in the future.

I do think that most of the time we can easily compute the length on creation. I will think more about the optimization for this and we can review it later.

@@ -311,8 +333,11 @@ def transpose(self, *args, **kwargs):
"""
local_transpose = self._map_partitions(
lambda df: df.transpose(*args, **kwargs))

# print(ray.get(local_transpose._df))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the print

# Because we sometimes have cases where we have summary statistics in our
# DataFrames
except TypeError:
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be returning 0 or letting the exception happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it happens automatically, we need to suppress the error. There are cases where we don't have a __len__ or size in our DataFrame because we are wrapping anything that gets returned (e.g. sum). In the future, we will change this so there's no ambiguity with what a DataFrame can return.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

"""
k = index.idx.keys()
if index.pandas_type is pd.RangeIndex:
return pd.RangeIndex(min(k), max(k)+1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the linting will ask for max(k) + 1

"Length of index given does not match current dataframe")

return Index(
{pd_index[i]: dest_indices[i] for i in range(len(dest_indices))},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add two more spaces in lines 55 and 56

@@ -108,6 +111,8 @@ def test_keys(ray_df, pandas_df):

@pytest.fixture
def test_transpose(ray_df, pandas_df):
print("rd: ", rdf.to_pandas(ray_df.T))
print("pd: ", pandas_df.T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably remove the two print statements

# Sum will collapse the NAs from the groupby
return local_transpose.reduce_by_index(lambda df: df.sum(), axis=1)
return local_transpose.reduce_by_index(
lambda df: df.apply(lambda x: x), axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more spaces here

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3877/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3878/
Test PASSed.

@devin-petersohn
Copy link
Member Author

@robertnishihara I think this is good to get merged if you approve. This is blocking most of the other DataFrame methods.

@robertnishihara robertnishihara merged commit de6fa02 into ray-project:master Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants