Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Implement Indexer getitem #1955

Closed
wants to merge 2 commits into from

Conversation

simon-mo
Copy link
Contributor

What do these changes do?

  • Implement loc and iloc's getitem methods

Note

  • Tests passed in private repo.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5078/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! I left some comments and questions.

@@ -69,7 +72,11 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
Metadata for the new dataframe's rows
col_metadata (_IndexMetadata):
Metadata for the new dataframe's columns
partial (boolean):
Internal: row_metadata and col_metadata only covers part of the
block partitions. (Used in index 'vew' accessor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vew -> view.



@_inherit_docstrings(pd.DataFrame)
class DataFrame(object):

def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False, col_partitions=None, row_partitions=None,
block_partitions=None, row_metadata=None, col_metadata=None):
block_partitions=None, row_metadata=None, col_metadata=None,
partial=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have a DataFrameView object as a subclass of this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions is handled in the case that it's a view makes complicated code more complicated.

Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should have it as a subclass. DataFrameView object make sense.

I'm refactoring parts of my indexing.py to make enlargement inside the parent class of the _LocIndexer and _iLocIndexer. So that we have cleaner code in indexing.py

@@ -103,25 +111,28 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
if block_partitions is not None:
# put in numpy array here to make accesses easier since it's 2D
self._block_partitions = np.array(block_partitions)
if row_metadata is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in two places?

@@ -408,6 +434,10 @@ def dtypes(self):
Returns:
The dtypes for this DataFrame.
"""
# Deal with empty column case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer comment to say Deal with a DataFrame with no columns or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).

"""
self.partial = partial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial -> _partial

if is_2d(row_loc) and is_2d(col_loc):
return self._get_dataframe_view(row_loc, col_loc)

def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to implement these? It would be better to have a NotImplementedError than just pass. That way we don't have some silent internal error or something.

@@ -93,6 +100,7 @@ def __init__(self, data=None, index=None, columns=None, dtype=None,
axis = 0
columns = pd_df.columns
index = pd_df.index
self._row_metadata = self._col_metadata = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated (Remove this in favor of Line 73)

retrieved_rows_remote = self._map_partition(
lookup_dict, col_label, indexer='loc')
joined_df = pd.concat(ray.get(retrieved_rows_remote))
def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.

# The returned result need to be indexed series/df
# Re-index is needed.
joined_df.index = index_loc.index
def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same on this file about method level documentation.

@simon-mo
Copy link
Contributor Author

simon-mo commented May 5, 2018

@devin-petersohn I'll make the following change in the indexing.py, removing the triage redundancy:
iloc_refactor

@simon-mo
Copy link
Contributor Author

simon-mo commented May 9, 2018

Closed via #2020

@simon-mo simon-mo closed this May 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants