Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azeventhubs] Fixing issues with connection and link robustness for Event Hubs #19683

Merged
merged 8 commits into from
Jan 5, 2023

Conversation

richardpark-msft
Copy link
Member

@richardpark-msft richardpark-msft commented Dec 9, 2022

Fixing issues with connection and link robustness for Event Hubs.

There were some issues with how we do recovery, which could make things
hang or even leave us in an inconsistent state.

  1. We didn't have any bound on how long a Close() could take, but since
    that requires a network hop it's possible for it to hang infinitely if
    the service doesn't respond. This now has a max bound of 60 seconds.

  2. If a Close() of a link is cancelled we signal that the connection needs
    to be reset, since not doing that will leave us in an inconsistent state
    where we believe a link is active, but the service does not.

  3. We now ignore errors that come back from trying to close a link (except
    for the previously mentioned cancellation). The error that comes back will
    be redundant (ie, it'll be the same error that initiated the recovery in
    the first place) or it'll be something we can't immediately troubleshoot
    and will be dealt with on the next recovery round.

There are also a number of testing quality fixes that I made to validate
this change. I've got gomock now for mocks in newer tests (and will start
to phase out the previous uses of my custom mocks). I've also added some
more tests around recovery and improved logging to log more context per
line so it's easier to correlate activity between links.

@jhendrixMSFT
Copy link
Member

If I understand correctly, we add a timeout via context to the close operation, and if close times out (or is cancelled), then we consider the amqp.Conn to be in a bad state and recreate it. Did I get that right?

There were some issues with how we do recovery, which could make things
hang or even leave us in an inconsistent state.

1. We didn't have any bound on how long a Close() could take, but since
that requires a network hop it's possible for it to hang infinitely if
the service doesn't respond. This now has a max bound of 60 seconds.

2. If a Close() of a link is cancelled we signal that the connection needs
to be reset, since not doing that will leave us in an inconsistent state
where we believe a link is active, but the service does not.

3. We now ignore errors that come back from trying to close a link (except
for the previously mentioned cancellation). The error that comes back will
be redundant (ie, it'll be the same error that initiated the recovery in
the first place) or it'll be something we can't immediately troubleshoot
and will be dealt with on the next recovery round.

There are also a number of testing quality fixes that I made to validate
this change. I've got gomock now for mocks in newer tests (and will start
to phase out the previous uses of my custom mocks). I've also added some
more tests around recovery and improved logging to log more context per
line so it's easier to correlate activity between links.
@richardpark-msft richardpark-msft changed the title Adding in a close timeout in areas where we can get stuck in a not-co… [azeventhubs] Fixing issues with connection and link robustness for Event Hubs Jan 5, 2023
@richardpark-msft richardpark-msft added Event Hubs Client This issue points to a problem in the data-plane of the library. labels Jan 5, 2023
@richardpark-msft
Copy link
Member Author

/azp run go - azeventhubs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft richardpark-msft marked this pull request as ready for review January 5, 2023 21:55
@richardpark-msft
Copy link
Member Author

/azp run go - azeventhubs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft richardpark-msft merged commit 00a8837 into Azure:main Jan 5, 2023
@richardpark-msft richardpark-msft deleted the eh-robust-close branch January 5, 2023 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants