-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Implement Indexer getitem #1955
Conversation
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! I left some comments and questions.
@@ -69,7 +72,11 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, | |||
Metadata for the new dataframe's rows | |||
col_metadata (_IndexMetadata): | |||
Metadata for the new dataframe's columns | |||
partial (boolean): | |||
Internal: row_metadata and col_metadata only covers part of the | |||
block partitions. (Used in index 'vew' accessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vew
-> view
.
|
||
|
||
@_inherit_docstrings(pd.DataFrame) | ||
class DataFrame(object): | ||
|
||
def __init__(self, data=None, index=None, columns=None, dtype=None, | ||
copy=False, col_partitions=None, row_partitions=None, | ||
block_partitions=None, row_metadata=None, col_metadata=None): | ||
block_partitions=None, row_metadata=None, col_metadata=None, | ||
partial=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to have a DataFrameView
object as a subclass of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions
is handled in the case that it's a view makes complicated code more complicated.
Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should have it as a subclass. DataFrameView
object make sense.
I'm refactoring parts of my indexing.py
to make enlargement inside the parent class of the _LocIndexer
and _iLocIndexer
. So that we have cleaner code in indexing.py
@@ -103,25 +111,28 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, | |||
if block_partitions is not None: | |||
# put in numpy array here to make accesses easier since it's 2D | |||
self._block_partitions = np.array(block_partitions) | |||
if row_metadata is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in two places?
@@ -408,6 +434,10 @@ def dtypes(self): | |||
Returns: | |||
The dtypes for this DataFrame. | |||
""" | |||
# Deal with empty column case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer comment to say Deal with a DataFrame with no columns
or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).
""" | ||
self.partial = partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial
-> _partial
if is_2d(row_loc) and is_2d(col_loc): | ||
return self._get_dataframe_view(row_loc, col_loc) | ||
|
||
def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to implement these? It would be better to have a NotImplementedError
than just pass
. That way we don't have some silent internal error or something.
@@ -93,6 +100,7 @@ def __init__(self, data=None, index=None, columns=None, dtype=None, | |||
axis = 0 | |||
columns = pd_df.columns | |||
index = pd_df.index | |||
self._row_metadata = self._col_metadata = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated (Remove this in favor of Line 73)
retrieved_rows_remote = self._map_partition( | ||
lookup_dict, col_label, indexer='loc') | ||
joined_df = pd.concat(ray.get(retrieved_rows_remote)) | ||
def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.
# The returned result need to be indexed series/df | ||
# Re-index is needed. | ||
joined_df.index = index_loc.index | ||
def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same on this file about method level documentation.
@devin-petersohn I'll make the following change in the |
Closed via #2020 |
What do these changes do?
loc
andiloc
's getitem methodsNote