Skip to content

NullPointerException in AbstractMetricsCollector basicCancel method due to race condition in ChannelN basicCancel method #1592

Closed
@KirillMakrutin

Description

@KirillMakrutin

Describe the bug

When sending RPC-style messages using 'spring-rabbit' RabbitTemplate.sendAndReceive(...) with useDirectReplyToContainer = true, a NullPointerException sometimes occurs on closing idle 'replay-to' channel in AbstractMetricsCollector.basicCancel(...) due to a race condition between :

  • the thread executing ChannelN.basicCancel(...)
  • the async thread handling the Basic.CancelOk RPC, which removes ChannelState in AbstractMetricsCollector.closeChannel() before AbstractMetricsCollector.basicCancel(...) is executed.

Details
When new 'reply-to' com.rabbitmq.client.Channel is created, that is happening when we use RabbitTemplate.sendAndReceive(...), it is usually paired with MetricsCollector.newChannel() call (which is an instance of MicrometerMetricsCollector) and metrics collector register a shutdown listener that will call MetricsCollector.closeChannel().

When it is time to close idle consumer with underlying channel, in its turn ChannelN.basicCancel() do RPC with Basic.Cancel that is asynchronously handled by listener container and underlaying ChannelN.close() executes shutdown listeners and by that running MetricsCollector.closeChannel() that clears ChannelState.
In parallel with RPC MetricsCollector.basicCancel() can be called that trying to access ChannelState, but at this point channel state can already been removed, that ends with logger message Error while computing metrics in basicCancel: Cannot read field "lock" because "channelState" is null

Reproduction steps

  1. Configure Spring Boot to use RabbitTemplate with useDirectReplyToContainer=true
  2. Send RPC-style messages using rabbitTemplate.sendAndReceive(...)
  3. Let the temp channel remain idle for channelRpcTimeout (e.g., 60s)
  4. Observe DirectReplyToMessageListenerContainer.reduceConsumersIfIdle() triggering a call to:
  5. RabbitUtils.cancel(...) → ChannelN.basicCancel(...)
  6. Occasionally, this will lead to: Error while computing metrics in basicCancel: Cannot read field "lock" because "channelState" is null

Expected behavior

The AbstractMetricsCollector.basicCancel(...) should not throw an NPE — ideally, the metrics collector should be resilient to a missing ChannelState, or the cancel flow should be ordered to avoid state removal before metrics are recorded.

Additional context

Here’s a detailed method chain, with emphasis on the race:

DirectReplyToMessageListenerContainer
 └── reduceConsumersIfIdle()
     └── cancelConsumer()
         └── RabbitUtils.cancel()
             └── ChannelN.basicCancel()
                 ├── rpc(Basic.Cancel) 
                 │
                 │  [ASYNC: RPC is handled here]
                 │   └── DirectReplyToMessageListenerContainer.handleCancelOk()
                 │       └── finalizeConsumer()
                 │           └── closeChannel()
                 │               └── RabbitUtils.closeChannel()
                 │                   └── ChannelN.close()
                 │                       └── notifyListeners()
                 │                           └── AbstractMetricsCollector.closeChannel()
                 │                               └── Clears ChannelState
                 │
                 └── AbstractMetricsCollector.basicCancel() ← NPE here
				     └── Clears ChannelState

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions