-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double attempt at reconnecting #310
Merged
BewareMyPower
merged 1 commit into
apache:main
from
merlimat:fix-double-reconnection-attempt
Sep 7, 2023
Merged
Avoid double attempt at reconnecting #310
BewareMyPower
merged 1 commit into
apache:main
from
merlimat:fix-double-reconnection-attempt
Sep 7, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BewareMyPower
approved these changes
Sep 7, 2023
BewareMyPower
pushed a commit
to streamnative/pulsar-client-cpp
that referenced
this pull request
Sep 14, 2023
### Motivation There is a sequence of conditions that can trigger a client to schedule multiple reconnections to the broker. This is due to fact that we're not checking whether such an attempt is already in progress. example: 1. Receive `CloseConsumer` command from broker 1a. Schedule for reconnection in 100ms 2. Connection is closed (eg: broker shutdown has initiated) 2a. Schedule for reconnection in 200ms (since the backoff was already incremented) Result is that we're going to call `grabCnx()` twice ### Modifications Use atomic flag to ignore the 2nd attempt, just waiting for the 1st attempt to finish. (cherry picked from commit 2e2f90b)
BewareMyPower
pushed a commit
that referenced
this pull request
Oct 16, 2023
…ble attempt at reconnecting (#328) Related Issue: #235 ### Motivation A potential double scheduling of reconnection due to a broker shutdown was observed. The reconnect can be scheduled with either of the following codes [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209) or [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121) If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`: [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63) --> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217) The problem is that the consumer will no longer be able to manage events coming from the broker. To cope with this issue, a new flag `reconnectionPending_` has been introduced via #310 . However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely. In fact, the double reconnects have been observed even after #310(I tried with b35ae1a): ``` # Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled. ... 2023-09-26 15:42:05.736 INFO [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5 2023-09-26 15:42:05.736 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s ... # Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again. ... 2023-09-26 15:42:05.836 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool 2023-09-26 15:42:05.837 WARN [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16 2023-09-26 15:42:05.837 WARN [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable 2023-09-26 15:42:05.837 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s ... # During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled. # After that, two subscribe requests are sent to broker2. 2023-09-26 15:42:06.034 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError 2023-09-26 15:42:06.515 INFO [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot> 2023-09-26 15:42:06.515 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s ... 2023-09-26 15:42:06.907 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.912 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... 2023-09-26 15:42:07.103 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... ``` To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed. ### Modifications The timing for setting `reconnectionPending_` to false has been changed.
BewareMyPower
pushed a commit
to streamnative/pulsar-client-cpp
that referenced
this pull request
Oct 20, 2023
…ble attempt at reconnecting (apache#328) Related Issue: apache#235 ### Motivation A potential double scheduling of reconnection due to a broker shutdown was observed. The reconnect can be scheduled with either of the following codes [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209) or [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121) If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`: [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63) --> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217) The problem is that the consumer will no longer be able to manage events coming from the broker. To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 . However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely. In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a): ``` # Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled. ... 2023-09-26 15:42:05.736 INFO [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5 2023-09-26 15:42:05.736 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s ... # Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again. ... 2023-09-26 15:42:05.836 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool 2023-09-26 15:42:05.837 WARN [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16 2023-09-26 15:42:05.837 WARN [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable 2023-09-26 15:42:05.837 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s ... # During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled. # After that, two subscribe requests are sent to broker2. 2023-09-26 15:42:06.034 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError 2023-09-26 15:42:06.515 INFO [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot> 2023-09-26 15:42:06.515 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s ... 2023-09-26 15:42:06.907 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.912 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... 2023-09-26 15:42:07.103 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... ``` To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed. ### Modifications The timing for setting `reconnectionPending_` to false has been changed. (cherry picked from commit ebae92e)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
There is a sequence of conditions that can trigger a client to schedule multiple reconnections to the broker. This is due to fact that we're not checking whether such an attempt is already in progress.
example:
CloseConsumer
command from broker1a. Schedule for reconnection in 100ms
2a. Schedule for reconnection in 200ms (since the backoff was already incremented)
Result is that we're going to call
grabCnx()
twiceModifications
Use atomic flag to ignore the 2nd attempt, just waiting for the 1st attempt to finish.