Skip to content

Conversation

@srinathk10
Copy link
Contributor

@srinathk10 srinathk10 commented Nov 21, 2025

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

[Data] Simplify ArrowBlock and PandasBlock

Simplify inheritance hierarchy for ArrowBlock and PandasBlock by removing TableRow to improve code maintainability.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 requested a review from a team as a code owner November 21, 2025 08:27
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies ArrowBlock and PandasBlock by removing the TableRow base class and moving its functionality directly into ArrowRow and PandasRow, or into helper functions. This is a good refactoring that reduces complexity and improves code clarity. The changes are consistent across the modified files. I've identified a couple of minor issues with incorrect type hints that should be addressed to enhance maintainability.

@staticmethod
def _build_tensor_row(
row: ArrowRow, col_name: str = TENSOR_COLUMN_NAME
row: ArrowRow, row_idx: int, col_name: str = TENSOR_COLUMN_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for the row parameter is ArrowRow, but it appears to be incorrect. The only call site in this file, ArrowRow.__getitem__, passes self._row, which is a pyarrow.Table. If row were an ArrowRow instance, the expression row[col_name] within this method would trigger a recursive call to ArrowRow.__getitem__. To improve type safety and clarity, please update the type hint to row: "pyarrow.Table".

Suggested change
row: ArrowRow, row_idx: int, col_name: str = TENSOR_COLUMN_NAME
row: "pyarrow.Table", row_idx: int, col_name: str = TENSOR_COLUMN_NAME


@staticmethod
def _build_tensor_row(row: PandasRow) -> np.ndarray:
def _build_tensor_row(row: PandasRow, row_idx: int) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for the row parameter is PandasRow, which seems incorrect. If row were a PandasRow instance, row[TENSOR_COLUMN_NAME] would call PandasRow.__getitem__, which returns a tuple. Calling .iloc[row_idx] on a tuple would raise an error. The row parameter is likely expected to be a pandas.DataFrame. Please update the type hint to row: "pandas.DataFrame" for correctness and clarity.

Suggested change
def _build_tensor_row(row: PandasRow, row_idx: int) -> np.ndarray:
def _build_tensor_row(row: "pandas.DataFrame", row_idx: int) -> np.ndarray:

@srinathk10 srinathk10 added the go add ONLY when ready to merge, run all tests label Nov 21, 2025
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 21, 2025
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bveeramani bveeramani merged commit 9b39965 into master Nov 24, 2025
6 checks passed
@bveeramani bveeramani deleted the srinathk10/optimize_arrow_iter_rows branch November 24, 2025 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants