-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgpd: Fix BGP session stuck in OpenConfirm state #6959
Conversation
Outdated results 💚Basic BGPD CI results: SUCCESS, 0 tests failed
For details, please contact louberger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds valid for me, I'm just curious how did you manage to replicate those cases?
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopo tests part 0 on Ubuntu 16.04 i386: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-13741/test Topology Tests failed for Topo tests part 0 on Ubuntu 16.04 i386 Successful on other platforms/tests
Warnings Generated during build:Debian 10 amd64 build: Successful with additional warningsDebian Package lintian failed for Debian 10 amd64 build:
|
Is it possible that your branch is behind master? I believe this CI failure has been fixed (or at least worked-around) in more recent master code... |
please break this up into 2 commits. There are 2 issues here and each one is really subtle. I would prefer 2 commits. Additionally mark is correct your branch is 300+ commits behind |
48c6d8d
to
dbaa4d4
Compare
Issue: 1. Initially BGP start listening to socket. 2. Start timer expires and BGP tries to connect to peer and moved to Idle->connect (lets say peer datastructre X) 3. Peer datastrcture Y FD-X receives OPEN and moved from Opensent-> Openconfirm state and start the hold timer. 4. In the OpenConfirm state, the hold timer is stopped. So peer X waits for Keepalive message from peer. If the Keepalive message is not received, then it will be in OpenConfirm state for indefinite time. 5. Due to this it neither close the existing connection nor it will accept any connection from peer. Fix: In the OpenConfirm state, don't stop the hold timer. 1. Upon receipt of a neighbor’s Keepalive, the state is moved to Established. 2. But If the hold timer expires, a stop event occurs, the state is moved to Idle. This is as per RFC. Signed-off-by: Sarita Patra <saritap@vmware.com>
Issue: 1. Initially BGP start listening to socket. 2. Start timer expires and BGP tries to connect to peer and moved to Idle->connect (lets say peer datastructre X) 3. Connect for X succeeds and hence moved from idle ->connect with FD-x. 4. A incoming connection is accepted and a new peer datastructure Y is created with FD-y moves from idle->Active state. 5. Peer datastercture Y FD-y sends out OPEN and moves to Active->Opensent state. 6. Peer datastrcture Y FD-y receives OPEN and moved from Opensent-> Openconfirm state. 7. Meanwhile on peer datastrcture X FD-x sends out a OPEN message and moved from connect->Opensent. 8. For peer datastrcture Y FD-y keep alive is received and it is moved from OpenConfirm->Established. 9. In this case peer datastructure Y FD-y is a accepted connection so we try to copy all its parameter to peer datastructure X and delete Y. 10. During this process TCP connection for the accepted connection (FD-y) goes down and hence get remote address and port fails. 11. With this failure bgp_stop function for both peer datastrure X and peer datastructure Y is called. 12. By this time all the parameters include state for datastrcture for X and Y are exchanged. Peer Y FD-y when it entered this function had state OpenConfirm still which has been moved to peer datastrcture X. 13. In bgp_stop it will stop all the timers and take action only if peer is in established state. Now that peer datastrcture X and Y are not in established state (in this function) it will simply close all timers and close the socket and assigns socket for both the peer datastrcture to -1. 14. Peer datastrcture Y will be deleted as it is a datastrcture created due to accept of connection where as peer datastrcture X will be held as it is created with configuration. 15. Now peer datastrcture X now holds a state of OpenConfirm without any timers running. 16. With this any new incoming connection will never be able to establish as there is config connection X which is stuck in OpenConfirm. Fix: While transferring the peer datastructure Y FD-y (accepted connection) to the peer datastructure X, if TCP connection for FD-y goes down, then 1. Call fsm event bgp_stop for X (do cleanup with bgp_stop and move the state to Idle) and 2. Call fsm event bgp_stop for Y (do cleanup with bgp_stop and gets deleted since it is an accept connection). Signed-off-by: Sarita Patra <saritap@vmware.com>
dbaa4d4
to
6c4d873
Compare
The below issues we got due to some misconfiguration of mtu in the setup Manually also we tried commenting out the code and the issue seen.
|
Addressed. Thanks |
Addressed. Thanks |
There is an issue also opened. This fix will address this. |
Outdated results 💚Basic BGPD CI results: SUCCESS, 0 tests failed
For details, please contact louberger |
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
Continuous Integration Result: SUCCESSFULContinuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13756/ This is a comment from an automated CI system. Warnings Generated during build:Debian 10 amd64 build: Successful with additional warningsDebian Package lintian failed for Debian 10 amd64 build:
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base1 Static Analyzer issues remaining.See details at |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13757/ This is a comment from an automated CI system. Warnings Generated during build:Debian 10 amd64 build: Successful with additional warningsDebian Package lintian failed for Debian 10 amd64 build:
CLANG Static Analyzer Summary
No Changes in Static Analysis warnings compared to base1 Static Analyzer issues remaining.See details at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Issue 1:
to Idle->connect (lets say peer datastructre X)
Openconfirm state and start the hold timer.
waits for Keepalive message from peer. If the Keepalive message
is not received, then it will be in OpenConfirm state for
indefinite time.
accept any connection from peer.
Fix:
In the OpenConfirm state, don't stop the hold timer.
Established.
is moved to Idle.
This is as per RFC.
Issue 2:
to Idle->connect (lets say peer datastructre X)
FD-x.
is created with FD-y moves from idle->Active state.
Active->Opensent state.
Openconfirm state.
and moved from connect->Opensent.
moved from OpenConfirm->Established.
so we try to copy all its parameter to peer datastructure X and
delete Y.
(FD-y) goes down and hence get remote address and port fails.
and peer datastructure Y is called.
for X and Y are exchanged. Peer Y FD-y when it entered this
function had state OpenConfirm still which has been moved to peer
datastrcture X.
peer is in established state. Now that peer datastrcture X and Y
are not in established state (in this function) it will simply
close all timers and close the socket and assigns socket for both
the peer datastrcture to -1.
due to accept of connection where as peer datastrcture X will be held
as it is created with configuration.
timers running.
as there is config connection X which is stuck in OpenConfirm.
Fix:
While transferring the peer datastructure Y FD-y (accepted connection)
to the peer datastructure X, if TCP connection for FD-y goes down, then
state to Idle) and
since it is an accept connection).
Signed-off-by: Sarita Patra saritap@vmware.com