Skip to content

[FLINK-39025] Add InstanceID index to ResourceManager for O(1) lookups#27516

Open
nateab wants to merge 1 commit intoapache:masterfrom
nateab:add-instanceid-index-to-resourcemanager
Open

[FLINK-39025] Add InstanceID index to ResourceManager for O(1) lookups#27516
nateab wants to merge 1 commit intoapache:masterfrom
nateab:add-instanceid-index-to-resourcemanager

Conversation

@nateab
Copy link
Contributor

@nateab nateab commented Feb 4, 2026

What is the purpose of the change

This pull request adds a secondary HashMap index (taskExecutorsByInstanceId) to the ResourceManager for O(1) lookups of WorkerRegistration by InstanceID. Previously, getWorkerByInstanceId() performed an O(n)
linear scan through all registered TaskExecutors, which could become a performance bottleneck in clusters with many TaskExecutors. This addresses the TODO comment in the existing code: "Improve performance by
having an index on the instanceId".

Brief change log

  • Added taskExecutorsByInstanceId HashMap field to ResourceManager for fast InstanceID lookups
  • Initialize the index in the ResourceManager constructor
  • Maintain index consistency by updating it alongside the primary taskExecutors map:
    • Add entry when TaskExecutor registers
    • Remove old entry when TaskExecutor re-registers (replacement)
    • Remove entry when TaskExecutor connection is closed
    • Clear index when ResourceManager state is cleared
  • Replaced O(n) loop in getWorkerByInstanceId() with O(1) HashMap lookup

Verifying this change

This change is already covered by existing tests, such as:

  • ResourceManagerTaskExecutorTest (6 tests) - covers TaskExecutor registration, re-registration, and disconnection scenarios
  • ResourceManagerTest (14 tests) - covers general ResourceManager functionality
  • ActiveResourceManagerTest (18 tests) - covers the releaseResource() path which is the primary caller of getWorkerByInstanceId()

All 75 ResourceManager-related tests pass with the changes.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes - ResourceManager is involved in TaskExecutor registration and resource management,
    but this is an internal optimization that does not change behavior
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 4, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants