[FLINK-39176][runtime] Introduce NodeHealthManager abstraction by featzhang · Pull Request #27701 · apache/flink

featzhang · 2026-02-27T11:39:25Z

What is the purpose of the change

This PR introduces the NodeHealthManager abstraction layer as the first phase of implementing a generic blacklist mechanism for compute nodes in Flink. The abstraction allows pluggable node health management strategies while maintaining backward compatibility by defaulting to a No-Op implementation.

Brief change log

Introduced NodeHealthManager interface to define the contract for node health management
Added NodeHealthStatus data class to represent node health information
Implemented NoOpNodeHealthManager that treats all nodes as healthy (default behavior)
Implemented DefaultNodeHealthManager that manages node health states using ConcurrentHashMap
Integrated NodeHealthManager into ResourceManager as a member variable
Added comprehensive unit tests in NodeHealthManagerTest

Verifying this change

Added NodeHealthManagerTest with 9 test cases covering all core functionalities
Tests verify:
- No-Op implementation always returns healthy status
- Default implementation correctly manages node health states
- Concurrent access scenarios
- Health status retrieval and updates
All unit tests pass successfully (13 tests)

Does this pull request potentially affect one of the following parts

Dependencies: No
The public API: No (internal API)
The runtime per-record code paths: No
The state snapshotting: No
The state backends: No
The message passing between components: No
The allocation of resources: No
The logging behavior: No
The connector ecosystem: No

Documentation

This is an internal refactoring change. No documentation updates are required as the feature is not exposed to users yet. Documentation will be added in subsequent PRs when the blacklist mechanism is fully implemented and exposed to users.

This PR introduces the NodeHealthManager abstraction layer for the upcoming generic blacklist feature. Changes: - Add NodeHealthManager interface with methods for checking node health, marking nodes as quarantined, removing quarantine, listing all statuses, and cleaning up expired entries - Add NodeHealthStatus data class to hold node health information - Add NoOpNodeHealthManager implementation that always considers nodes healthy (no-op implementation for backward compatibility) - Add DefaultNodeHealthManager implementation using ConcurrentHashMap to manage node health states - Integrate NodeHealthManager into ResourceManager with NoOpNodeHealthManager as the default implementation (no behavior change in this PR) - Add comprehensive unit tests for all implementations This is the first phase of the generic blacklist feature and does not change any existing behavior.

flinkbot · 2026-02-27T11:44:27Z

CI report:

fab4143 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39176][runtime] Introduce NodeHealthManager abstraction#27701

[FLINK-39176][runtime] Introduce NodeHealthManager abstraction#27701
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:feature/FLINK-39176-node-health-manager-abstraction

featzhang commented Feb 27, 2026

Uh oh!

flinkbot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

featzhang commented Feb 27, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts

Documentation

Uh oh!

flinkbot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flinkbot commented Feb 27, 2026 •

edited

Loading