Skip to content

[FLINK-39176][runtime] Introduce NodeHealthManager abstraction#27701

Open
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:feature/FLINK-39176-node-health-manager-abstraction
Open

[FLINK-39176][runtime] Introduce NodeHealthManager abstraction#27701
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:feature/FLINK-39176-node-health-manager-abstraction

Conversation

@featzhang
Copy link
Member

What is the purpose of the change

This PR introduces the NodeHealthManager abstraction layer as the first phase of implementing a generic blacklist mechanism for compute nodes in Flink. The abstraction allows pluggable node health management strategies while maintaining backward compatibility by defaulting to a No-Op implementation.

Brief change log

  • Introduced NodeHealthManager interface to define the contract for node health management
  • Added NodeHealthStatus data class to represent node health information
  • Implemented NoOpNodeHealthManager that treats all nodes as healthy (default behavior)
  • Implemented DefaultNodeHealthManager that manages node health states using ConcurrentHashMap
  • Integrated NodeHealthManager into ResourceManager as a member variable
  • Added comprehensive unit tests in NodeHealthManagerTest

Verifying this change

  • Added NodeHealthManagerTest with 9 test cases covering all core functionalities

  • Tests verify:

    • No-Op implementation always returns healthy status
    • Default implementation correctly manages node health states
    • Concurrent access scenarios
    • Health status retrieval and updates
  • All unit tests pass successfully (13 tests)

Does this pull request potentially affect one of the following parts

  • Dependencies: No
  • The public API: No (internal API)
  • The runtime per-record code paths: No
  • The state snapshotting: No
  • The state backends: No
  • The message passing between components: No
  • The allocation of resources: No
  • The logging behavior: No
  • The connector ecosystem: No

Documentation

This is an internal refactoring change. No documentation updates are required as the feature is not exposed to users yet. Documentation will be added in subsequent PRs when the blacklist mechanism is fully implemented and exposed to users.

This PR introduces the NodeHealthManager abstraction layer for the
upcoming generic blacklist feature.

Changes:
- Add NodeHealthManager interface with methods for checking node health,
  marking nodes as quarantined, removing quarantine, listing all statuses,
  and cleaning up expired entries
- Add NodeHealthStatus data class to hold node health information
- Add NoOpNodeHealthManager implementation that always considers nodes
  healthy (no-op implementation for backward compatibility)
- Add DefaultNodeHealthManager implementation using ConcurrentHashMap
  to manage node health states
- Integrate NodeHealthManager into ResourceManager with NoOpNodeHealthManager
  as the default implementation (no behavior change in this PR)
- Add comprehensive unit tests for all implementations

This is the first phase of the generic blacklist feature and does not
change any existing behavior.
@flinkbot
Copy link
Collaborator

flinkbot commented Feb 27, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants