-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545
[DataFrame] Fix transpose with nan values and add functionality needed for Index #1545
Conversation
Test PASSed. |
c6d20fd
to
d371726
Compare
Test PASSed. |
@robertnishihara This is ready for a review pass. It passes travis except for the valgrind, which shouldn't have been affected by this PR. |
# TODO: Clean up later. | ||
# We will call get only when we access the object (and only once). | ||
self._lengths = \ | ||
ray.get([_deploy_func.remote(_get_lengths, d) for d in self._df]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option (perhaps for later) that might make sense is to always pass along the length along with the object IDs, so df
would be a list of (Object_ID, length)
pairs. This assumes that anytime you create one of these partition object IDs, you can easily compute the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will probably need a way to either compute or set it. We will eventually have the exact same constructor as Pandas, so passing it won't be an option in the future.
I do think that most of the time we can easily compute the length on creation. I will think more about the optimization for this and we can review it later.
python/ray/dataframe/dataframe.py
Outdated
@@ -311,8 +333,11 @@ def transpose(self, *args, **kwargs): | |||
""" | |||
local_transpose = self._map_partitions( | |||
lambda df: df.transpose(*args, **kwargs)) | |||
|
|||
# print(ray.get(local_transpose._df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the print
# Because we sometimes have cases where we have summary statistics in our | ||
# DataFrames | ||
except TypeError: | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be returning 0 or letting the exception happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it happens automatically, we need to suppress the error. There are cases where we don't have a __len__
or size
in our DataFrame
because we are wrapping anything that gets returned (e.g. sum
). In the future, we will change this so there's no ambiguity with what a DataFrame
can return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
python/ray/dataframe/index.py
Outdated
""" | ||
k = index.idx.keys() | ||
if index.pandas_type is pd.RangeIndex: | ||
return pd.RangeIndex(min(k), max(k)+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the linting will ask for max(k) + 1
python/ray/dataframe/index.py
Outdated
"Length of index given does not match current dataframe") | ||
|
||
return Index( | ||
{pd_index[i]: dest_indices[i] for i in range(len(dest_indices))}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add two more spaces in lines 55 and 56
@@ -108,6 +111,8 @@ def test_keys(ray_df, pandas_df): | |||
|
|||
@pytest.fixture | |||
def test_transpose(ray_df, pandas_df): | |||
print("rd: ", rdf.to_pandas(ray_df.T)) | |||
print("pd: ", pandas_df.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably remove the two print statements
python/ray/dataframe/dataframe.py
Outdated
# Sum will collapse the NAs from the groupby | ||
return local_transpose.reduce_by_index(lambda df: df.sum(), axis=1) | ||
return local_transpose.reduce_by_index( | ||
lambda df: df.apply(lambda x: x), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two more spaces here
Test PASSed. |
Test PASSed. |
@robertnishihara I think this is good to get merged if you approve. This is blocking most of the other DataFrame methods. |
Resolves #1525 and adds new functionality for
ray.Index
object. Same as #1542.