Skip to content

LettuceConnectionFactory.SharedConnection#resetConnection hangs forever and cause deadlock  #1861

Closed
@coney

Description

@coney

Bug Report

LettuceConnectionFactory.SharedConnection#resetConnection hangs forever and cause deadlock

Current Behavior

I have enabled validateConnection for Lettuce connection factory, and occasionally my service can't serve any incoming request. The thread dump shows that all the http threads are waiting for the connection

Stack trace
// http threads, take one for example
"reactor-http-epoll-6" #126 daemon prio=5 os_prio=0 cpu=16164.68ms elapsed=26788.53s allocated=1510M defined_classes=693 tid=0x0000560e1cfc1000 nid=0x168b waiting for monitor entry  [0x00007fdb977c2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1295)
	- waiting to lock <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
	at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
	at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)
	at reactor.core.publisher.MonoSupplier.call(MonoSupplier.java:85)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:224)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)

And all https threads are waiting for a lock which hold by the thread as below:

"lettuce-epollEventLoop-5-1" #31 daemon prio=5 os_prio=0 cpu=7049.44ms elapsed=26823.40s allocated=1441M defined_classes=171 tid=0x0000560e1dd67000 nid=0x13de waiting on condition  [0x00007fdbb8753000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method)
	- parking to wait for  <0x00000007197dec70> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/Unknown Source)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.8/Unknown Source)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.8/Unknown Source)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.8/Unknown Source)
	at java.util.concurrent.CompletableFuture.join(java.base@11.0.8/Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceFutureUtils.join(LettuceFutureUtils.java:68)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.release(LettuceConnectionProvider.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.release(LettuceConnectionFactory.java:1596)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.resetConnection(LettuceConnectionFactory.java:1360)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.validateConnection(LettuceConnectionFactory.java:1346)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1302)
	- locked <0x000000070a63d728> (a java.lang.Object)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457)
	at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101)
	at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198)
	at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source)

Input Code

Input Code Our application is using webflux to handle API request's, but I found that lettuce using `synchronized` to protect getConnection:
// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#getConnection
@Nullable
StatefulConnection<E, E> getConnection() {

	synchronized (this.connectionMonitor) {

		if (this.connection == null) {
			this.connection = getNativeConnection();
		}

		if (getValidateConnection()) {
			validateConnection();
		}

		return this.connection;
	}
}

And inside the validateConnection the resetConnection hangs:

// org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#validateConnection
		void validateConnection() {

			synchronized (this.connectionMonitor) {

				boolean valid = false;

				if (connection != null && connection.isOpen()) {
					try {

						if (connection instanceof StatefulRedisConnection) {
							((StatefulRedisConnection) connection).sync().ping();
						}

						if (connection instanceof StatefulRedisClusterConnection) {
							((StatefulRedisClusterConnection) connection).sync().ping();
						}
						valid = true;
					} catch (Exception e) {
						log.debug("Validation failed", e);
					}
				}

				if (!valid) {

					log.info("Validation of shared connection failed. Creating a new connection.");
                                       // the line below hangs
					resetConnection();
					this.connection = getNativeConnection();
				}
			}
		}

Expected behavior/code

reset connection could be over in time and no deadlock.

Environment

  • Lettuce version(s): 6.1.2.RELEASE
  • Redis version: 5.0.9
  • SpringDataRedis: 2.5.1

redis relevant configuration:

spring.redis.cluster.nodes={{spring_redis_cluster_nodes}} // we have 6 nodes
spring.redis.password={{spring_redis_password}}
spring.redis.cluster.max-redirects=5
spring.redis.cluster.topology-refresh-interval=10
spring.redis.lettuce.pool.min-idle=500
spring.redis.lettuce.pool.max-active=5000
spring.redis.lettuce.pool.max-wait=-1
spring.redis.lettuce.pool.max-idle=1000
spring.redis.timeout=10000
spring.redis.database=0 

Possible Solution

In org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider#release, seems that it will wait for future forever, maybe a timeout could partially avoid this situation? Still don't know why release hangs.

	default void release(StatefulConnection<?, ?> connection) {
		LettuceFutureUtils.join(releaseAsync(connection));
	}

Additional context

stacktrace.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    for: external-projectFor an external project and not something we can fix

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions