Skip to content

Add a Redis-based InstanceStepConcurrencyHandler implementation.#125

Merged
jun-he merged 2 commits intomainfrom
jun/redis-concurrency
Jul 7, 2025
Merged

Add a Redis-based InstanceStepConcurrencyHandler implementation.#125
jun-he merged 2 commits intomainfrom
jun/redis-concurrency

Conversation

@jun-he
Copy link
Contributor

@jun-he jun-he commented Jul 2, 2025

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes (Please run ./gradlew build --write-locks to refresh dependencies)
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Add a Redis-based InstanceStepConcurrencyHandler implementation.

@jun-he jun-he requested a review from Copilot July 2, 2025 22:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a Redis-based implementation for handling instance-step concurrency limits and wires it into the Spring boot configuration, along with necessary configuration properties, tests, and infrastructure changes.

  • Added RedisInstanceStepConcurrencyHandler and unit tests covering success, failure, and exception paths
  • Introduced RedisProperties and extended AwsProperties/AwsConfiguration to configure and create a Redisson client bean
  • Updated Docker Compose, Gradle build, and lockfiles to add Redis and related dependencies

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
maestro-aws/src/main/java/com/netflix/maestro/engine/concurrency/RedisInstanceStepConcurrencyHandler.java New Redis-backed concurrency handler implementation
maestro-aws/src/test/java/com/netflix/maestro/engine/concurrency/RedisInstanceStepConcurrencyHandlerTest.java Tests for the new handler
maestro-aws/src/main/java/com/netflix/maestro/engine/properties/RedisProperties.java New configuration properties for Redis
maestro-aws/src/main/java/com/netflix/maestro/engine/properties/AwsProperties.java Added redis field for binding Redis properties
maestro-aws/src/main/java/com/netflix/maestro/engine/autoconfigure/AwsConfiguration.java Beans to create Redisson and RedisInstanceStepConcurrencyHandler
maestro-aws/docker-compose.yml Added a Redis service for local development
maestro-aws/build.gradle Added Redisson dependency with a dynamic version specifier
Comments suppressed due to low confidence (2)

maestro-aws/docker-compose.yml:17

  • Pin the Redis image to a specific version (e.g., redis:7.0.11) instead of latest to ensure repeatable CI and local environments.
    image: redis:latest # Uses the latest redis image

maestro-aws/src/test/java/com/netflix/maestro/engine/concurrency/RedisInstanceStepConcurrencyHandlerTest.java:124

  • Consider adding a test case where instanceStepConcurrency is unset (null) to verify that the handler correctly applies the default behavior.
    RunRequest runRequest =

} catch (Exception e) {
LOG.warn(
"Cannot add instance for concurrency id [{}], depth [{}], uuid [{}] due to ",
idWithHash,
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log in the catch block uses idWithHash (depth=0) instead of the actual depth-specific key or raw concurrencyId. Consider using idWithDepth or concurrencyId for clearer and accurate logging.

Suggested change
idWithHash,
idWithDepth,

Copilot uses AI. Check for mistakes.
@@ -23,4 +23,5 @@
public class AwsProperties {
private String snsTopic;
private SqsProperties sqs;
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’ve added a redis field but the class has no getters/setters or Lombok annotations. Spring’s @ConfigurationProperties binding may not work without them. Please add accessors or annotate with Lombok.

Suggested change
private SqsProperties sqs;
private SqsProperties sqs;
@Getter
@Setter

Copilot uses AI. Check for mistakes.
break;
}

return (Redisson) Redisson.create(redisConfig);
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting to the concrete Redisson type can break if the implementation changes. Consider returning RedissonClient instead of Redisson and avoid the cast.

Suggested change
return (Redisson) Redisson.create(redisConfig);
return Redisson.create(redisConfig);

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it.

implementation('io.awspring.cloud:spring-cloud-aws-starter-sqs:3.+') {
exclude group: 'ch.qos.logback'
}
implementation('org.redisson:redisson:3.+')
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a dynamic version (3.+) can lead to unexpected upgrades. Specify an exact version that matches your lockfile for reproducible builds.

Copilot uses AI. Check for mistakes.
@jun-he jun-he merged commit 2c3fcf7 into main Jul 7, 2025
1 of 2 checks passed
@jun-he jun-he deleted the jun/redis-concurrency branch July 7, 2025 18:07
derek-miller pushed a commit to derek-miller/maestro that referenced this pull request Mar 4, 2026
derek-miller pushed a commit to derek-miller/maestro that referenced this pull request Mar 4, 2026
* add a SEL function to check if a param exists and improve the SEL doc (Netflix#118)

* fix some warnings from the build (Netflix#120)

* add get workflow instances batch endpoint (Netflix#121)

* Add a SEL function to get the value of error retries. Also add one example of k8s step to set env. (Netflix#123)

* Add a Redis-based InstanceStepConcurrencyHandler implementation. (Netflix#125)

* Add a Redis-based InstanceStepConcurrencyHandler implementation.

* address the comment

* Add postgres based Maestro persistence layer to support postgressql db. (Netflix#119)

* Add postgres based Maestro persistence layer to support postgressql db.

* Update indexed columns to use the COLLATE "C".

* Make this minor update to use JSON instead JSONB to address some tricky cases.
Note that Postgres JSONB will not preserve the original order of map field.
It might cause troubles in certain cases, e.g. param map order.

* update the dependency lock

* address the comments

* Improve Maestro to retry certain db errors, e.g. connection is closed. (Netflix#127)

* Improve Maestro to retry certain db errors, e.g. connection is closed.

* fix null pointer issue in the set contains call

* Update delete queue config to match the deletion processor timeout. (Netflix#130)

* Update delete queue config to match the deletion processor timeout.

* address the comment

* Explicitly clear the pending action after passing it to the step runtime. (Netflix#131)

* Explicitly clear the pending action after passing it to the step runtime.

* Add tests to verify that the pending action is reset.

* Add comments for the config.

* Improve the test a bit.

* Fix edge case where unblocking workflow doesn't enqueue job event (Netflix#132)

* Fix edge case where unblocking workflow doesn't enqueue job event

There is an edge case when we unblock all failed instances in batches and default batch size = 100, the only failed instance was unblocked in first while loop but didn't enqueue a job event because the condition requires unblocked count > 0 (in 1st loop this value is not updated and is still 0). This PR updated the if condition to account for this case.

* improve comment

* Check if string map param is literal before returning the value. (Netflix#133)

* Check if string map param is literal before returning the value.

* Add additional endpoints to workflow controller. (Netflix#136)

* Check if map param is literal before returning the value. (Netflix#140)

* Check if map param is literal before returning the value.

* address the comment

* Maestro workers should not retry in certain cases (Netflix#141)

* While a worker processes a job, the business logic might throw MaestroNotFoundException, e.g. the workflow is deleted. The worker should not retry in this case.

* address the comment

* fix a corner case during launching a subworkflow if the subworkflow instance already exists (Netflix#142)

* getAction logic should not pick up the upstream actions in async execution mode (Netflix#143)

* Subworkflow should wake up its child workflow than itself. (Netflix#144)

* Improve step runtime to support setting the next polling delay. (Netflix#145)

* Invalid job event should be removed from the queue. (Netflix#146)

* Invalid job event should be removed from the queue.

* address the comment

* Improve the queue performance by adding an explicit lock mechanism (Netflix#129)

* Improve the queue performance by adding an explicit lock mechanism because SKIP LOCKED does not perform well if there are a large number of rows.

* Use a dedicated queue_lock table for locking queue id.

* update dependencies field to JSONB so to support first_value() (Netflix#148)

* Add while step runtime support (Netflix#147)

* add while step runtime to support while loop

* address the comments

* fix batch loading rollup

* Support waking up flow engine with an action code. (Netflix#149)

* Support waking up flow engine with an action code.
Meaning of action codes are determined by the Task implementation.

* address the comment

* deduplicate actions to avoid unnecessary executions in the flow engine (Netflix#150)

* Implement tag permit support for Maestro. (Netflix#151)

* Implement tag permit support for Maestro.

* add extra metrics and small improvements related to tag permit feature (Netflix#152)

* Jun/unblock event (Netflix#153)

* fix missing action and message fields in the unblock event

* upgrade dependency locks

* Fix an issue for workflow timeout action: when the workflow is timed out, both the workflow instance and its steps should be marked as timeout.

* Slightly improve the code readability with additional comments. (Netflix#155)

* Fix the http status code in MaestroRuntimeException. (Netflix#156)

* Improve the code and a unit test (Netflix#157)

* calling Array.free() to release resources in loop cases (Netflix#159)

* Improve actors to cancel the scheduled task ping action during activate/wakeup (Netflix#160)

* Actor improvement: during activate call, cancel any existing scheduled task ping as activate will schedule a new ping.

* Improve the code a bit and add extra unit tests.

* fix tag permit release for queued cases (Netflix#161)

* Support extension function in SEL Util class. (Netflix#162)

* Support extension function in SEL Util class.
* Also add a toJson extension function to Maestro SEL expression evaluator.
* Address the comments.

* Rename testcontainerPostgresDep to testcontainerDep in all build.gradle files

To match the renamed dependency in dependencies.gradle after OSS merge

* Regenerate all gradle lockfiles after OSS merge

Generated by running: ./gradlew build --write-locks -x test -x integrationTest

* Add db-type and ownership-timeout properties to production config

- Add db-type: postgres to engine.configs (required by OSS)
- Add ownership-timeout: 125000 to queue 5 for deletion processor
- Tag permit features left disabled (not needed for Airbnb deployment)

* Add OSS sync script for merging upstream changes

- Prompts for branch name to avoid hardcoding usernames
- Fetches latest from origin and oss remotes
- Creates branch from origin/main and merges oss/master
- Lists conflicts if any, with helpful next steps
- Auto-commits if no conflicts detected

---------

Co-authored-by: jun-he <jun-he@users.noreply.github.com>
Co-authored-by: yingyiz-netflix <yingyiz@netflix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants