Skip to content

Conversation

@kiryazovi-redis
Copy link
Collaborator

Summary:
This PR implements functional test infrastructure for validating relaxed timeout
configuration during Redis Enterprise maintenance operations. The changes ensure
that command timeouts are properly relaxed during maintenance events and return
to normal behavior afterward.

Key Features:

  • Relaxed Timeout Configuration Tests: New comprehensive test suite
    (RelaxedTimeoutConfigurationTest) that validates timeout behavior during
    Redis Enterprise maintenance events
  • Maintenance Push Notification Tests: Enhanced test infrastructure
    (MaintenanceNotificationTest) for validating client reception and processing
    of maintenance notifications during operations like migration, failover, and
    endpoint rebinding
  • Improved Test Reliability: Added proper test cleanup, enhanced endpoint
    tracking, and improved logging for better test stability

Technical Improvements:

  • Code Organization: Refactored cluster state management methods from
    MaintenanceNotificationTest to RedisEnterpriseConfig for better separation
    of concerns
  • Test Infrastructure: Added MaintenancePushNotificationMonitor and
    MaintenanceNotificationCapture utilities for robust maintenance event testing
  • Cleanup: Removed debug code and verbose logging while maintaining all test
    functionality for improved code readability

Files Changed:

  • src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java - New comprehensive timeout configuration tests
  • src/test/java/io/lettuce/scenario/MaintenanceNotificationTest.java - Enhanced maintenance notification testing
  • src/test/java/io/lettuce/scenario/MaintenancePushNotificationMonitor.java - New monitoring utility
  • src/test/java/io/lettuce/scenario/MaintenanceNotificationCapture.java - New notification capture utility
  • src/test/java/io/lettuce/scenario/RedisEnterpriseConfig.java - Refactored cluster state management

Testing:
All tests validate proper timeout relaxation and notification handling during
Redis Enterprise maintenance scenarios including migrations, failovers, and
endpoint rebinding operations.

This PR addresses CAE-1130 requirements for timeout configuration testing while
also incorporating related maintenance notification test improvements from CAE-633.

…icationTest to RedisEnterpriseConfig

- Moved refreshClusterConfig, recordOriginalClusterState, and restoreOriginalClusterState methods
- Updated call sites in MaintenanceNotificationTest to use RedisEnterpriseConfig methods
- Added required imports and static variables to RedisEnterpriseConfig
- Maintained existing functionality while improving code organization

Improvements to RelaxedTimeoutConfigurationTest:
- Simplified traffic generation logic by removing complex multi-phase testing
- Streamlined BLPOP command execution with better timeout detection
- Added relaxed timeout detection and recording during maintenance events
- Improved logging and error handling for timeout analysis
- Enhanced test assertions to focus on relaxed timeout detection rather than success counts
- Added MOVING operation duration tracking for better test analysis
- Remove large debug block with reflection-based debugging in RelaxedTimeoutConfigurationTest
- Simplify excessive debug logging and verbose markers (*** and ===)
- Clean up maintenance notification test logging
- Improve push notification monitor message formatting
- Maintain all test functionality while improving code readability

This comment was marked as outdated.

Copy link
Collaborator

@tishun tishun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good progress, thanks!

- Move all inline comments to be above the code they reference in:
  * RelaxedTimeoutConfigurationTest.java
  * RedisEnterpriseConfig.java
  * MaintenanceNotificationTest.java
- Replace string != comparisons with .equals() for proper string comparison
- Apply code formatting via Maven formatter

This improves code readability and follows Java best practices for string comparison.
@kiryazovi-redis
Copy link
Collaborator Author

I will merge in order to speed up development but any comments will be taken care of afterward :), same as the previous PR.

@kiryazovi-redis kiryazovi-redis requested a review from Copilot August 4, 2025 12:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive functional test infrastructure for validating relaxed timeout configuration during Redis Enterprise maintenance operations, addressing requirement CAE-1130. It introduces new timeout configuration tests alongside enhanced maintenance notification test infrastructure for testing timeout behavior during events like migrations, failovers, and endpoint rebinding.

Key Changes:

  • Added comprehensive relaxed timeout configuration tests that validate timeout behavior during maintenance events
  • Enhanced maintenance notification testing with improved utilities and cleanup mechanisms
  • Refactored cluster state management for better separation of concerns and test reliability

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
RelaxedTimeoutConfigurationTest.java New comprehensive test suite validating timeout relaxation during Redis Enterprise maintenance events
RedisEnterpriseConfig.java Enhanced with cluster state restoration methods and improved endpoint tracking
MaintenancePushNotificationMonitor.java New utility for monitoring RESP3 push notifications across different test types
MaintenanceNotificationTest.java Refactored to use new monitoring utilities and improved cleanup mechanisms
MaintenanceNotificationCapture.java New interface for standardizing notification capture across test implementations
Comments suppressed due to low confidence (1)

src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java:736

  • The timeoutUnrelaxedOnMovingTest is disabled due to flakiness. This represents a gap in test coverage for an important timeout un-relaxation scenario that should be addressed.
    @Disabled("This test is flaky and needs to be fixed")

import org.junit.jupiter.api.Tag;
import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unused imports that should be removed to improve code maintainability. Several imports like AtomicLong, AtomicReference, CompletableFuture, ExecutionException, and TimeoutException appear to be unused in the actual code.

Copilot uses AI. Check for mistakes.
long startTime = System.currentTimeMillis();
try {
// Use the normal timeout duration for BLPOP to test timeout behavior
RedisFuture<KeyValue<String, String>> future = mainConnection.async().blpop(10, "traffic-key-" + commandCount);
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BLPOP timeout value (10 seconds) is hardcoded. Consider using a named constant or making it configurable to improve maintainability and consistency with other timeout values.

Copilot uses AI. Check for mistakes.
long startTime = System.currentTimeMillis();
try {
// Use the normal timeout duration for BLPOP to test if timeouts are back to normal
RedisFuture<KeyValue<String, String>> future = context.connection.async().blpop(10, "normal-test-key-" + i);
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another hardcoded BLPOP timeout value (10 seconds). This should be consistent with the other BLPOP timeouts and use a named constant.

Suggested change
RedisFuture<KeyValue<String, String>> future = context.connection.async().blpop(10, "normal-test-key-" + i);
RedisFuture<KeyValue<String, String>> future = context.connection.async().blpop(NORMAL_COMMAND_TIMEOUT.getSeconds(), "normal-test-key-" + i);

Copilot uses AI. Check for mistakes.
try {
// Use the normal timeout duration for BLPOP to test if timeouts are back to normal
// CRITICAL: Use mainConnection like traffic generation does, not context.connection
RedisFuture<KeyValue<String, String>> future = context.capture.getMainConnection().async().blpop(10,
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third instance of hardcoded BLPOP timeout value (10 seconds). All BLPOP timeout values should use a consistent named constant.

Suggested change
RedisFuture<KeyValue<String, String>> future = context.capture.getMainConnection().async().blpop(10,
RedisFuture<KeyValue<String, String>> future = context.capture.getMainConnection().async().blpop(NORMAL_BLPOP_TIMEOUT_SECONDS,

Copilot uses AI. Check for mistakes.
try {
// Wait for lettuce's built-in auto-reconnect to work
// This is the same pattern used in ClientIntegrationTests
int maxWaitTime = 10000; // 10 seconds max wait
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconnection timeout (10 seconds) is hardcoded. Consider using a named constant or Duration for consistency with other timeout values in the class.

Copilot uses AI. Check for mistakes.
// Wait for lettuce's built-in auto-reconnect to work
// This is the same pattern used in ClientIntegrationTests
int maxWaitTime = 10000; // 10 seconds max wait
int waitInterval = 100; // Check every 100ms
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait interval (100ms) for reconnection polling is hardcoded. Consider using a named constant for better maintainability.

Suggested change
int waitInterval = 100; // Check every 100ms
int waitInterval = RECONNECTION_POLL_INTERVAL_MS; // Check every 100ms

Copilot uses AI. Check for mistakes.

// Wait for all traffic threads to complete
try {
CompletableFuture.allOf(trafficThreads.toArray(new CompletableFuture[0])).get(5, TimeUnit.SECONDS);
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traffic thread shutdown timeout (5 seconds) is hardcoded. Consider using a named constant for consistency with other timeout values.

Copilot uses AI. Check for mistakes.
@kiryazovi-redis kiryazovi-redis merged commit f35a488 into redis:feature/maintenance-events Aug 4, 2025
2 checks passed
ggivo pushed a commit that referenced this pull request Aug 29, 2025
* initial WIP, with lots of debugging, and some non-working tests

* debug

* more attemtps at debugging

* Refactor: Move cluster state management methods from MaintenanceNotificationTest to RedisEnterpriseConfig

- Moved refreshClusterConfig, recordOriginalClusterState, and restoreOriginalClusterState methods
- Updated call sites in MaintenanceNotificationTest to use RedisEnterpriseConfig methods
- Added required imports and static variables to RedisEnterpriseConfig
- Maintained existing functionality while improving code organization

Improvements to RelaxedTimeoutConfigurationTest:
- Simplified traffic generation logic by removing complex multi-phase testing
- Streamlined BLPOP command execution with better timeout detection
- Added relaxed timeout detection and recording during maintenance events
- Improved logging and error handling for timeout analysis
- Enhanced test assertions to focus on relaxed timeout detection rather than success counts
- Added MOVING operation duration tracking for better test analysis

* Improve test reliability and cleanup: Add @AfterEach cleanup, enhance endpoint tracking, and improve logging

* add un-relaxed tests. will investigate further why they got broken at some point via diff

* CAE-1130: Update timeout configuration test and watchdog implementation

* Reset MaintenanceAwareConnectionWatchdog.java and log4j2-test.xml to upstream versions

* Clean up debug info and outdated comments from timeout tests

- Remove large debug block with reflection-based debugging in RelaxedTimeoutConfigurationTest
- Simplify excessive debug logging and verbose markers (*** and ===)
- Clean up maintenance notification test logging
- Improve push notification monitor message formatting
- Maintain all test functionality while improving code readability

* Refactor: Move inline comments above code and fix string comparisons

- Move all inline comments to be above the code they reference in:
  * RelaxedTimeoutConfigurationTest.java
  * RedisEnterpriseConfig.java
  * MaintenanceNotificationTest.java
- Replace string != comparisons with .equals() for proper string comparison
- Apply code formatting via Maven formatter

This improves code readability and follows Java best practices for string comparison.
tishun added a commit that referenced this pull request Sep 1, 2025
* [Hitless Upgrades] React to maintenance events (#3345)

* v0.1

* Simple reconnect now working

* Bind address from message is now considered

* Self-register the handler

* Format code

* Filter push messages in a more stable way

* (very hacky) Relax comand expire timers globbaly

* Configure if timeout relaxing should be applied

* Proper way to close channel

* Configure the timneout relaxing

* Sequential handover implemented

* Did not address formatting

* Prolong the rebind windwow for relaxed tiemouts

* PubSub no longer required; CommandExpiryWriter is now channel aware; Polishing

* Use the new MOVING push message from the RE server

* Unit test was not chaining delgates in the same way that the RedisClient/RedisClusterClient was

* Fix REBIND message validation

* Fixed the expiry mechanism

* Polishing

* Fix NPE.

Seems like AttributeMap.attr is not accurate and actually return's  null causing some unit test failures.

* Add support for MIGRATING/MIGRATED message handling in command expiry

This commit adds the ability to listen for MIGRATING and MIGRATED messages
and trigger extended command expiry timeouts during Redis shard migration.

Key changes:
- Enhanced RebindAwareConnectionWatchdog to detect MIGRATING/MIGRATED messages
- RebindAwareExpiryWriter to trigger timeout relaxation whenever MIGRATING message is received

This feature allows commands to have relaxed timeouts during shard migration
operations, preventing unnecessary timeouts when Redis is temporarily busy
with migration tasks.

* formating

* Fix Disabling relaxTimeouts after upgrade can interfere with an ongoing one from re-bind

* Additional fix for timeout relaxing disabled

* Fix push message listener registered multiple times after rebind.

* Fix: Report correct command timeout when relaxTimeout is configured

* Disable relaxedTimeout after configured grace period

- Introduce a delay before disabling relaxedTimeout
- Grace period duration is provided via push notification

* Code clean up
  - Remove reading from pub/sub chanel and relay only on push notifications

* Add FAILING_OVER/FAILED_OVER

* Polishing : Rename components to use the word 'maintenace'

---------

Co-authored-by: Igor Malinovskiy <u.glide@gmail.com>
Co-authored-by: ggivo <ivo.gaydazhiev@redis.com>
# Conflicts:
#	src/main/java/io/lettuce/core/ClientOptions.java

* [Hitless upgrades] Add unit tests for the newly introduced classes #3355 (#3356)

* Unit tests for the maintanence aware classes

* Did not format properly

* Proper license

* Cae 633 add functional tests notifications (#3357) - excluding JsonExample.java

* Cae 1130 timeout tests (#3377)

* initial WIP, with lots of debugging, and some non-working tests

* debug

* more attemtps at debugging

* Refactor: Move cluster state management methods from MaintenanceNotificationTest to RedisEnterpriseConfig

- Moved refreshClusterConfig, recordOriginalClusterState, and restoreOriginalClusterState methods
- Updated call sites in MaintenanceNotificationTest to use RedisEnterpriseConfig methods
- Added required imports and static variables to RedisEnterpriseConfig
- Maintained existing functionality while improving code organization

Improvements to RelaxedTimeoutConfigurationTest:
- Simplified traffic generation logic by removing complex multi-phase testing
- Streamlined BLPOP command execution with better timeout detection
- Added relaxed timeout detection and recording during maintenance events
- Improved logging and error handling for timeout analysis
- Enhanced test assertions to focus on relaxed timeout detection rather than success counts
- Added MOVING operation duration tracking for better test analysis

* Improve test reliability and cleanup: Add @AfterEach cleanup, enhance endpoint tracking, and improve logging

* add un-relaxed tests. will investigate further why they got broken at some point via diff

* CAE-1130: Update timeout configuration test and watchdog implementation

* Reset MaintenanceAwareConnectionWatchdog.java and log4j2-test.xml to upstream versions

* Clean up debug info and outdated comments from timeout tests

- Remove large debug block with reflection-based debugging in RelaxedTimeoutConfigurationTest
- Simplify excessive debug logging and verbose markers (*** and ===)
- Clean up maintenance notification test logging
- Improve push notification monitor message formatting
- Maintain all test functionality while improving code readability

* Refactor: Move inline comments above code and fix string comparisons

- Move all inline comments to be above the code they reference in:
  * RelaxedTimeoutConfigurationTest.java
  * RedisEnterpriseConfig.java
  * MaintenanceNotificationTest.java
- Replace string != comparisons with .equals() for proper string comparison
- Apply code formatting via Maven formatter

This improves code readability and follows Java best practices for string comparison.

* [Hitless Upgrades] Zero-server-side configuration with client-side opt-in (#3380)

* Support for Client-side opt-in

 A client can tell the server if it wants to receive maintenance push notifications via the following command:

 CLIENT MAINT_NOTIFICATIONS <ON | OFF> [parameter value parameter value ...]

* update maintenance events to latest format

 - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds.
 - MIGRATED <seq_number> <shard_id-s>: A shard migration ended.
 - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started.
 - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended.
 - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds

* clean up

* Update FAILED_OVER & MIGRATED to include additional time field

* update is private reserver check & add unit tests

update is private reserver check

* add unit tests for handshake with enabled maintenance events

* add missing copyrights/docs

* format

* address review comments

* Revert address after rebind operation expires

* Update event's validation spec

  - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds.
  - MIGRATED <seq_number> <shard_id-s>: A shard migration ended.
  - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started.
  - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended.
  - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds

* rebase

* format after rebase

* Apply suggestions from code review

Co-authored-by: Tihomir Krasimirov Mateev <tihomir.mateev@redis.com>

* javadoc updated

* Update src/main/java/io/lettuce/core/internal/NetUtils.java

Co-authored-by: Tihomir Krasimirov Mateev <tihomir.mateev@redis.com>

---------

Co-authored-by: Tihomir Krasimirov Mateev <tihomir.mateev@redis.com>

* [hitless upgrade] Support for none moving-endpoint-type in maintenance event notifications (CAE-1285) (#3396)

* support MOVING with none

none indicates that the MOVING message doesn’t need to contain an endpoint. In such a case, the client is expected to schedule a graceful reconnect to its currently configured endpoint after half of the grace period that was communicated by the server is over.

* formatting

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix NPE

* Add test to cover null rebindAddress

null for rebind adress can be returned as part of MOVING notification if client is connected using 'moving-endpoint-type=none'

* Add java docs to RebindAwareAddressSupplier

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Hitless Upgrades]  Enable maintenance events support by default (CAE-1303) (#3415)

* set default relaxed timeout to 10s

* Enable maintenance events support by default

* Enable maintenance events support by default

* fix tests
   - ensure MaintenanceAwareExpiryWriter is registered for events when wathcdog is created
   - command timeout was not applied

* fix tests

   - sporadic failure because of timeout of 50ms
   RedisHandshakeUnitTests.handshakeDelayedCredentialProvider:153 » ConditionTimeout
   - new command introduced during handshake, increase the timeout to 100ms

* resolve errors after rebase on main

  - reset() - removed with
    issue#3328 - remove deprecated code from issue#907 (#3395)

* Remove LettuceMaintenanceEventsDemo.java

    - no longer needed.

---------

Co-authored-by: Igor Malinovskiy <u.glide@gmail.com>
Co-authored-by: ggivo <ivo.gaydazhiev@redis.com>
Co-authored-by: kiryazovi-redis <ivaylo.kiryazov@redis.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants