Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double attempt at reconnecting #310

Merged

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Sep 6, 2023

Motivation

There is a sequence of conditions that can trigger a client to schedule multiple reconnections to the broker. This is due to fact that we're not checking whether such an attempt is already in progress.

example:

  1. Receive CloseConsumer command from broker
    1a. Schedule for reconnection in 100ms
  2. Connection is closed (eg: broker shutdown has initiated)
    2a. Schedule for reconnection in 200ms (since the backoff was already incremented)

Result is that we're going to call grabCnx() twice

Modifications

Use atomic flag to ignore the 2nd attempt, just waiting for the 1st attempt to finish.

@merlimat merlimat added the bug Something isn't working label Sep 6, 2023
@merlimat merlimat added this to the 3.4.0 milestone Sep 6, 2023
@merlimat merlimat self-assigned this Sep 6, 2023
@BewareMyPower BewareMyPower merged commit 2e2f90b into apache:main Sep 7, 2023
@merlimat merlimat deleted the fix-double-reconnection-attempt branch September 7, 2023 15:51
BewareMyPower pushed a commit to streamnative/pulsar-client-cpp that referenced this pull request Sep 14, 2023
### Motivation

There is a sequence of conditions that can trigger a client to schedule multiple reconnections to the broker. This is due to fact that we're not checking whether such an attempt is already in progress.

example:
 1. Receive `CloseConsumer` command from broker
    1a.  Schedule for reconnection in 100ms
 2. Connection is closed (eg: broker shutdown has initiated)
   2a. Schedule for reconnection in 200ms (since the backoff was already incremented)

Result is that we're going to call `grabCnx()` twice

### Modifications

Use atomic flag to ignore the 2nd attempt, just waiting for the 1st attempt to finish.

(cherry picked from commit 2e2f90b)
BewareMyPower pushed a commit that referenced this pull request Oct 16, 2023
…ble attempt at reconnecting (#328)

Related Issue: #235

### Motivation
A potential double scheduling of reconnection due to a broker shutdown was observed.

The reconnect can be scheduled with either of the following codes
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209)
or
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121)

If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`:
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63)
--> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217)

The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag `reconnectionPending_` has been introduced via #310 .

However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after #310(I tried with b35ae1a):
```
# Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled.
...
2023-09-26 15:42:05.736 INFO  [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5
2023-09-26 15:42:05.736 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s
...

# Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again.
...
2023-09-26 15:42:05.836 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
2023-09-26 15:42:05.837 WARN  [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16
2023-09-26 15:42:05.837 WARN  [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable
2023-09-26 15:42:05.837 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s
...

# During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled.
# After that, two subscribe requests are sent to broker2.
2023-09-26 15:42:06.034 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError
2023-09-26 15:42:06.515 INFO  [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot>
2023-09-26 15:42:06.515 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s
...
2023-09-26 15:42:06.907 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.912 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...
2023-09-26 15:42:07.103 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...
```

To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.

### Modifications
The timing for setting `reconnectionPending_` to false has been changed.
BewareMyPower pushed a commit to streamnative/pulsar-client-cpp that referenced this pull request Oct 20, 2023
…ble attempt at reconnecting (apache#328)

Related Issue: apache#235

### Motivation
A potential double scheduling of reconnection due to a broker shutdown was observed.

The reconnect can be scheduled with either of the following codes
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209)
or
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121)

If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`:
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63)
--> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217)

The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 .

However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a):
```
# Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled.
...
2023-09-26 15:42:05.736 INFO  [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5
2023-09-26 15:42:05.736 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s
...

# Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again.
...
2023-09-26 15:42:05.836 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
2023-09-26 15:42:05.837 WARN  [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16
2023-09-26 15:42:05.837 WARN  [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable
2023-09-26 15:42:05.837 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s
...

# During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled.
# After that, two subscribe requests are sent to broker2.
2023-09-26 15:42:06.034 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError
2023-09-26 15:42:06.515 INFO  [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot>
2023-09-26 15:42:06.515 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s
...
2023-09-26 15:42:06.907 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.912 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>]
...
2023-09-26 15:42:07.103 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>]
...
```

To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.

### Modifications
The timing for setting `reconnectionPending_` to false has been changed.

(cherry picked from commit ebae92e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants