Skip to content

Conversation

@myandpr
Copy link
Member

@myandpr myandpr commented Jan 21, 2026

Description

PR2: Make logical operators frozen dataclasses

Issue goal:

Make LogicalOperator immutable and comparable to avoid cross-dataset contamination.

The issue is split into two PRs for easier review.

This PR focuses on:

converting the LogicalOperator hierarchy to frozen dataclasses and making updates immutable via dataclasses.replace.

Related issues

Link related issues: "Fixes #60312 ", "Closes #60312 ", or "Related to #60312 ".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Implementation details

  • all logical operators are now @dataclass(frozen=True);
  • setters/in-place updates become “return-new” methods (e.g., with_per_block_limit, with_detected_parallelism);
  • DAG transforms use dataclasses.replace;
  • __repr__/dag_str are preserved to keep explain output stable;
  • inherit_batch_format only applies replace on ops that actually define _batch_format; Read.detected_parallelism is propagated immutably and the physical planning stage sets the input factory without mutating the logical op.

API changes

logical operators are now immutable; external behavior unchanged; logical layer no longer maintains _output_dependencies.

Tests:

logical optimization, planning, and operator-fusion related tests were exercised (see test notes).

@myandpr myandpr requested a review from a team as a code owner January 21, 2026 11:17
@myandpr myandpr changed the title [Data] [1/n]Make LogicalPlan stateless and comparable [Data] [2/n]Make LogicalPlan stateless and comparable Jan 21, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large-scale refactoring to make logical operators stateless and comparable by converting them to frozen dataclasses. The changes are consistent and well-executed, replacing mutable patterns with immutable ones using dataclasses.replace. This is a great improvement for the robustness of the logical plan representation. I've found one critical issue with the initialization pattern used for the new dataclasses that could lead to partially initialized objects.

@myandpr myandpr force-pushed the pr2-logical-operator-dataclass branch from e8b4a51 to d453410 Compare January 21, 2026 12:11
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 21, 2026
@myandpr
Copy link
Member Author

myandpr commented Jan 21, 2026

re-trigger ci

just note: local ut success, and CI failure might be due to environment flakiness

@myandpr
Copy link
Member Author

myandpr commented Jan 21, 2026

ut python/ray/train/v2/tests/test_placement_group_handle.py::test_slice_handle_shutdown still flake about //python/ray/train/v2:test_placement_group_handle because of timeout.

re-trigger ci.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment on lines 21 to 30
_name: str
_input_dependencies: Tuple["LogicalOperator", ...]
_num_outputs: Optional[int]

def __init__(
self,
name: str,
input_dependencies: List["LogicalOperator"],
name: Optional[str] = None,
input_dependencies: Optional[List["LogicalOperator"]] = None,
num_outputs: Optional[int] = None,
**kwargs: Any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than introducing this boilerplate where the dataclass fields are prefixed with an underscore and the constructor parameters aren't, can we (in maybe an isolated PR) rename the attributes to match the parameters (i.e., remove the underscores)?

Signed-off-by: yaommen <myanstu@163.com>
Signed-off-by: yaommen <myanstu@163.com>
@myandpr myandpr force-pushed the pr2-logical-operator-dataclass branch from 0970212 to 0e4a56f Compare January 24, 2026 20:51
@myandpr
Copy link
Member Author

myandpr commented Jan 24, 2026

Change Summary — for easier review

  • Goal: per Issue [Data] Make LogicalPlan stateless and comparable #60312, make LogicalOperator and implementations frozen dataclasses, and use dataclasses.replace for immutable updates to avoid in‑place mutation
    and shared side effects.

  • Main changes (by direction + file scope):

    1. Stop using _input_dependencies; use LogicalOperator.input_dependencies everywhere
      • python/ray/data/_internal/logical/interfaces/logical_operator.py
      • python/ray/data/_internal/logical/operators/*_operator.py
      • python/ray/data/_internal/logical/rules/*.py
      • python/ray/data/_internal/planner/*.py
    2. Make logical operator fields public (e.g., limit / batch_format / ray_remote_args)
      • python/ray/data/_internal/logical/interfaces/logical_operator.py
      • python/ray/data/_internal/logical/operators/*_operator.py
      • Call‑site updates: python/ray/data/dataset.py, python/ray/data/grouped_data.py, python/ray/data/read_api.py
    3. Remove output_dependencies from the logical layer (no reverse-edge maintenance)
      • python/ray/data/_internal/logical/interfaces/logical_operator.py
      • python/ray/data/_internal/logical/rules/*, python/ray/data/_internal/logical/util.py
    4. Other updates: switch construction/updates to keyword args + replace, reduce handwritten init
      • logical operators: python/ray/data/_internal/logical/operators/*_operator.py
      • rules/planner: python/ray/data/_internal/logical/rules/, python/ray/data/_internal/planner/
      • tests: python/ray/data/tests/*
  • Why the change is large:
    This is an API‑level migration. Once LogicalOperator becomes a frozen dataclass with public attrs, all construction and access sites must be updated, otherwise they break.
    I kept the changes scoped to “field names / construction / replace” and avoided any new functionality or extra refactors.

  • Additional Notes (clarifying a few non‑obvious choices)

    • Why not use dataclasses(kw_only=True): Ray still supports Python 3.9, so kw_only isn’t available. Instead, I kept auto init and standardized all logical operator construction to keyword args to avoid positional arguments being captured by name/input_dependencies/num_outputs.
    • Why force input_op=... / fn=... keyword construction: it’s the safest pattern with public attrs + frozen dataclass, and prevents silent mis‑binding if field order changes.
    • ......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Make LogicalPlan stateless and comparable

2 participants