Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Fix blocking issue on _IndexMetadata passing #1965

Merged
merged 8 commits into from
May 2, 2018

Conversation

p-yang
Copy link
Contributor

@p-yang p-yang commented Apr 28, 2018

What do these changes do?

Fixes a performance issue on passing _IndexMetadata objects for applymap and related functions that don't mutate indexes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5108/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5112/
Test FAILed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to get numbers on the performance difference here.


# TODO: write explicit tests for "short and wide"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to do this on this PR?

@@ -109,18 +108,18 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
if row_partitions is not None:
axis = 0
partitions = row_partitions
bp_length = len(columns) if columns is not None else \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the bp in bp_length?

@@ -506,7 +514,8 @@ def add_prefix(self, prefix):
new_cols = self.columns.map(lambda x: str(prefix) + str(x))
return DataFrame(block_partitions=self._block_partitions,
columns=new_cols,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of this line since we only need the metadata now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't in this case, as discussed today. If we don't pass the columns then the metadata of the new dataframe won't have the column changes reflected. Therefore, we need to either copy the metadata and modify the copy and push the copy, or pass the new columns such that the constructor modifies the metadata object copy on its end.

@@ -517,7 +526,8 @@ def add_suffix(self, suffix):
new_cols = self.columns.map(lambda x: str(x) + str(suffix))
return DataFrame(block_partitions=self._block_partitions,
columns=new_cols,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, can we get rid of this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@devin-petersohn
Copy link
Member

Follow-up question: Can we make it so that index is cached in this PR?

@p-yang
Copy link
Contributor Author

p-yang commented Apr 29, 2018

Reference on performance numbers (these are off the top of my head, and testing on c69 yielded results with high variance):

On c69 limited to 8 ray workers, 5GB int64 CSV df.isna:

  1. With 1x6 partitions: pre-change 300ms, post-change 182ms
  2. With 2x6 partitions: pre-change 600ms, post-change 186ms

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5118/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5119/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look really great! Just a couple of minor quick comments.

d)
@ray.remote
def _build_col_widths(df_col):
widths = np.array(ray.get([_deploy_func.remote(_get_widths, d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc """Compute widths for each partition."""

def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs):
raise NotImplementedError()
# TODO(patyang): This produces inconsistent indexes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to add this fix in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually silently fixed by the other changes, setting a new DF will implicitly change the index on the metadata object as well. Comment removed.


Args:
index (pd.Index): Index to wrap.
def __getitem__(self, key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a comment about how to use __getitem__? We have had people using it in incorrect ways and trying to make it work for them, so this way hopefully we can avoid that.

@devin-petersohn
Copy link
Member

Jenkins, retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5141/
Test PASSed.

@devin-petersohn devin-petersohn merged commit 5589426 into ray-project:master May 2, 2018
@devin-petersohn
Copy link
Member

Merged, thanks @Veryku

alok added a commit to alok/ray that referenced this pull request May 6, 2018
* magic-methods:
  fmt
  Fix IndentationError
  Write magic methods for SampleBatch/PartialRollout
  Clean up syntax for supported Python versions. (ray-project#1963)
  [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956)
  [DataFrame] Fix dtypes (ray-project#1930)
  keep_dims -> keepdims (ray-project#1980)
  add pthread linking (ray-project#1986)
  [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984)
  [DataFrame] Fix blocking issue on _IndexMetadata passing (ray-project#1965)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants