-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[azeventhubs] Fixing issues with connection and link robustness for Event Hubs #19683
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If I understand correctly, we add a timeout via context to the close operation, and if close times out (or is cancelled), then we consider the |
richardpark-msft
force-pushed
the
eh-robust-close
branch
2 times, most recently
from
December 17, 2022 03:59
fde0a13
to
cb411d9
Compare
richardpark-msft
force-pushed
the
eh-robust-close
branch
from
January 5, 2023 03:08
cb411d9
to
a1851dc
Compare
There were some issues with how we do recovery, which could make things hang or even leave us in an inconsistent state. 1. We didn't have any bound on how long a Close() could take, but since that requires a network hop it's possible for it to hang infinitely if the service doesn't respond. This now has a max bound of 60 seconds. 2. If a Close() of a link is cancelled we signal that the connection needs to be reset, since not doing that will leave us in an inconsistent state where we believe a link is active, but the service does not. 3. We now ignore errors that come back from trying to close a link (except for the previously mentioned cancellation). The error that comes back will be redundant (ie, it'll be the same error that initiated the recovery in the first place) or it'll be something we can't immediately troubleshoot and will be dealt with on the next recovery round. There are also a number of testing quality fixes that I made to validate this change. I've got gomock now for mocks in newer tests (and will start to phase out the previous uses of my custom mocks). I've also added some more tests around recovery and improved logging to log more context per line so it's easier to correlate activity between links.
richardpark-msft
force-pushed
the
eh-robust-close
branch
from
January 5, 2023 19:54
0a981eb
to
7dbd43f
Compare
richardpark-msft
changed the title
Adding in a close timeout in areas where we can get stuck in a not-co…
[azeventhubs] Fixing issues with connection and link robustness for Event Hubs
Jan 5, 2023
richardpark-msft
added
Event Hubs
Client
This issue points to a problem in the data-plane of the library.
labels
Jan 5, 2023
/azp run go - azeventhubs |
Azure Pipelines successfully started running 1 pipeline(s). |
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
reviewed
Jan 5, 2023
…'s there to enable unit testing.
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
reviewed
Jan 5, 2023
jhendrixMSFT
approved these changes
Jan 5, 2023
/azp run go - azeventhubs |
Azure Pipelines successfully started running 1 pipeline(s). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixing issues with connection and link robustness for Event Hubs.
There were some issues with how we do recovery, which could make things
hang or even leave us in an inconsistent state.
We didn't have any bound on how long a Close() could take, but since
that requires a network hop it's possible for it to hang infinitely if
the service doesn't respond. This now has a max bound of 60 seconds.
If a Close() of a link is cancelled we signal that the connection needs
to be reset, since not doing that will leave us in an inconsistent state
where we believe a link is active, but the service does not.
We now ignore errors that come back from trying to close a link (except
for the previously mentioned cancellation). The error that comes back will
be redundant (ie, it'll be the same error that initiated the recovery in
the first place) or it'll be something we can't immediately troubleshoot
and will be dealt with on the next recovery round.
There are also a number of testing quality fixes that I made to validate
this change. I've got gomock now for mocks in newer tests (and will start
to phase out the previous uses of my custom mocks). I've also added some
more tests around recovery and improved logging to log more context per
line so it's easier to correlate activity between links.