-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Fix blocking issue on _IndexMetadata passing #1965
Conversation
Test PASSed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to get numbers on the performance difference here.
python/ray/dataframe/dataframe.py
Outdated
|
||
# TODO: write explicit tests for "short and wide" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to do this on this PR?
python/ray/dataframe/dataframe.py
Outdated
@@ -109,18 +108,18 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, | |||
if row_partitions is not None: | |||
axis = 0 | |||
partitions = row_partitions | |||
bp_length = len(columns) if columns is not None else \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the bp
in bp_length
?
@@ -506,7 +514,8 @@ def add_prefix(self, prefix): | |||
new_cols = self.columns.map(lambda x: str(prefix) + str(x)) | |||
return DataFrame(block_partitions=self._block_partitions, | |||
columns=new_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get rid of this line since we only need the metadata now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't in this case, as discussed today. If we don't pass the columns then the metadata of the new dataframe won't have the column changes reflected. Therefore, we need to either copy the metadata and modify the copy and push the copy, or pass the new columns such that the constructor modifies the metadata object copy on its end.
@@ -517,7 +526,8 @@ def add_suffix(self, suffix): | |||
new_cols = self.columns.map(lambda x: str(x) + str(suffix)) | |||
return DataFrame(block_partitions=self._block_partitions, | |||
columns=new_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, can we get rid of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
Follow-up question: Can we make it so that |
Reference on performance numbers (these are off the top of my head, and testing on On
|
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look really great! Just a couple of minor quick comments.
python/ray/dataframe/utils.py
Outdated
d) | ||
@ray.remote | ||
def _build_col_widths(df_col): | ||
widths = np.array(ray.get([_deploy_func.remote(_get_widths, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc """Compute widths for each partition."""
def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | ||
group_keys=True, squeeze=False, **kwargs): | ||
raise NotImplementedError() | ||
# TODO(patyang): This produces inconsistent indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to add this fix in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually silently fixed by the other changes, setting a new DF will implicitly change the index on the metadata object as well. Comment removed.
|
||
Args: | ||
index (pd.Index): Index to wrap. | ||
def __getitem__(self, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a comment about how to use __getitem__
? We have had people using it in incorrect ways and trying to make it work for them, so this way hopefully we can avoid that.
Jenkins, retest this please |
Test PASSed. |
Merged, thanks @Veryku |
* magic-methods: fmt Fix IndentationError Write magic methods for SampleBatch/PartialRollout Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) [DataFrame] Fix blocking issue on _IndexMetadata passing (ray-project#1965)
What do these changes do?
Fixes a performance issue on passing
_IndexMetadata
objects for applymap and related functions that don't mutate indexes.