-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Data] Simplify ArrowBlock and PandasBlock #58883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request simplifies ArrowBlock and PandasBlock by removing the TableRow base class and moving its functionality directly into ArrowRow and PandasRow, or into helper functions. This is a good refactoring that reduces complexity and improves code clarity. The changes are consistent across the modified files. I've identified a couple of minor issues with incorrect type hints that should be addressed to enhance maintainability.
| @staticmethod | ||
| def _build_tensor_row( | ||
| row: ArrowRow, col_name: str = TENSOR_COLUMN_NAME | ||
| row: ArrowRow, row_idx: int, col_name: str = TENSOR_COLUMN_NAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for the row parameter is ArrowRow, but it appears to be incorrect. The only call site in this file, ArrowRow.__getitem__, passes self._row, which is a pyarrow.Table. If row were an ArrowRow instance, the expression row[col_name] within this method would trigger a recursive call to ArrowRow.__getitem__. To improve type safety and clarity, please update the type hint to row: "pyarrow.Table".
| row: ArrowRow, row_idx: int, col_name: str = TENSOR_COLUMN_NAME | |
| row: "pyarrow.Table", row_idx: int, col_name: str = TENSOR_COLUMN_NAME |
|
|
||
| @staticmethod | ||
| def _build_tensor_row(row: PandasRow) -> np.ndarray: | ||
| def _build_tensor_row(row: PandasRow, row_idx: int) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for the row parameter is PandasRow, which seems incorrect. If row were a PandasRow instance, row[TENSOR_COLUMN_NAME] would call PandasRow.__getitem__, which returns a tuple. Calling .iloc[row_idx] on a tuple would raise an error. The row parameter is likely expected to be a pandas.DataFrame. Please update the type hint to row: "pandas.DataFrame" for correctness and clarity.
| def _build_tensor_row(row: PandasRow, row_idx: int) -> np.ndarray: | |
| def _build_tensor_row(row: "pandas.DataFrame", row_idx: int) -> np.ndarray: |
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
[Data] Simplify ArrowBlock and PandasBlock
Simplify inheritance hierarchy for
ArrowBlockandPandasBlockby removingTableRowto improve code maintainability.Related issues
Additional information