Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: Fix BGP session stuck in OpenConfirm state #6959

Merged
merged 2 commits into from
Aug 25, 2020

Conversation

patrasar
Copy link
Contributor

Issue 1:

  1. Initially BGP start listening to socket.
  2. Start timer expires and BGP tries to connect to peer and moved
    to Idle->connect (lets say peer datastructre X)
  3. Peer datastrcture Y FD-X receives OPEN and moved from Opensent->
    Openconfirm state and start the hold timer.
  4. In the OpenConfirm state, the hold timer is stopped. So peer X
    waits for Keepalive message from peer. If the Keepalive message
    is not received, then it will be in OpenConfirm state for
    indefinite time.
  5. Due to this it neither close the existing connection nor it will
    accept any connection from peer.

Fix:
In the OpenConfirm state, don't stop the hold timer.

  1. Upon receipt of a neighbor’s Keepalive, the state is moved to
    Established.
  2. But If the hold timer expires, a stop event occurs, the state
    is moved to Idle.
    This is as per RFC.

Issue 2:

  1. Initially BGP start listening to socket.
  2. Start timer expires and BGP tries to connect to peer and moved
    to Idle->connect (lets say peer datastructre X)
  3. Connect for X succeeds and hence moved from idle ->connect with
    FD-x.
  4. A incoming connection is accepted and a new peer datastructure Y
    is created with FD-y moves from idle->Active state.
  5. Peer datastercture Y FD-y sends out OPEN and moves to
    Active->Opensent state.
  6. Peer datastrcture Y FD-y receives OPEN and moved from Opensent->
    Openconfirm state.
  7. Meanwhile on peer datastrcture X FD-x sends out a OPEN message
    and moved from connect->Opensent.
  8. For peer datastrcture Y FD-y keep alive is received and it is
    moved from OpenConfirm->Established.
  9. In this case peer datastructure Y FD-y is a accepted connection
    so we try to copy all its parameter to peer datastructure X and
    delete Y.
  10. During this process TCP connection for the accepted connection
    (FD-y) goes down and hence get remote address and port fails.
  11. With this failure bgp_stop function for both peer datastrure X
    and peer datastructure Y is called.
  12. By this time all the parameters include state for datastrcture
    for X and Y are exchanged. Peer Y FD-y when it entered this
    function had state OpenConfirm still which has been moved to peer
    datastrcture X.
  13. In bgp_stop it will stop all the timers and take action only if
    peer is in established state. Now that peer datastrcture X and Y
    are not in established state (in this function) it will simply
    close all timers and close the socket and assigns socket for both
    the peer datastrcture to -1.
  14. Peer datastrcture Y will be deleted as it is a datastrcture created
    due to accept of connection where as peer datastrcture X will be held
    as it is created with configuration.
  15. Now peer datastrcture X now holds a state of OpenConfirm without any
    timers running.
  16. With this any new incoming connection will never be able to establish
    as there is config connection X which is stuck in OpenConfirm.

Fix:
While transferring the peer datastructure Y FD-y (accepted connection)
to the peer datastructure X, if TCP connection for FD-y goes down, then

  1. Call fsm event bgp_stop for X (do cleanup with bgp_stop and move the
    state to Idle) and
  2. Call fsm event bgp_stop for Y (do cleanup with bgp_stop and gets deleted
    since it is an accept connection).

Signed-off-by: Sarita Patra saritap@vmware.com

@LabN-CI
Copy link
Collaborator

LabN-CI commented Aug 20, 2020

Outdated results 💚

Basic BGPD CI results: SUCCESS, 0 tests failed

_ _
Result SUCCESS git merge/6959 48c6d8d
Date 08/20/2020
Start 10:48:25
Finish 11:14:23
Run-Time 25:58
Total 1818
Pass 1818
Fail 0
Valgrind-Errors 0
Valgrind-Loss 0
Details vncregress-2020-08-20-10:48:25.txt
Log autoscript-2020-08-20-10:49:31.log.bz2
Memory 472 498 430

For details, please contact louberger

Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds valid for me, I'm just curious how did you manage to replicate those cases?

@NetDEF-CI
Copy link
Collaborator

NetDEF-CI commented Aug 20, 2020

Continuous Integration Result: FAILED

Continuous Integration Result: FAILED

See below for issues.
CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13741/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Get source / Pull Request: Successful

Building Stage: Successful

Basic Tests: Failed

Topo tests part 0 on Ubuntu 16.04 i386: Failed (click for details)

Topology Test Results are at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-13741/test

Topology Tests failed for Topo tests part 0 on Ubuntu 16.04 i386
see full log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13741/artifact/TOPOI386/ErrorLog/log_topotests.txt

Successful on other platforms/tests
  • Addresssanitizer topotests part 2
  • Topo tests part 1 on Ubuntu 18.04 amd64
  • Debian 8 deb pkg check
  • Topo tests part 0 on Ubuntu 16.04 amd64
  • CentOS 7 rpm pkg check
  • Addresssanitizer topotests part 0
  • IPv4 ldp protocol on Ubuntu 18.04
  • Debian 9 deb pkg check
  • Topo tests part 1 on Ubuntu 18.04 arm8
  • Topo tests part 1 on Ubuntu 16.04 amd64
  • Addresssanitizer topotests part 1
  • IPv6 protocols on Ubuntu 18.04
  • Topo tests part 2 on Ubuntu 18.04 arm8
  • Topo tests part 0 on Ubuntu 18.04 amd64
  • Topo tests part 1 on Ubuntu 16.04 i386
  • Topo tests part 2 on Ubuntu 16.04 amd64
  • Debian 10 deb pkg check
  • Topo tests part 2 on Ubuntu 18.04 amd64
  • IPv4 protocols on Ubuntu 18.04
  • Static analyzer (clang)
  • Fedora 29 rpm pkg check
  • Topo tests part 0 on Ubuntu 18.04 arm8
  • Topo tests part 2 on Ubuntu 16.04 i386
  • Ubuntu 18.04 deb pkg check
  • Ubuntu 20.04 deb pkg check
  • Ubuntu 16.04 deb pkg check

Warnings Generated during build:

Debian 10 amd64 build: Successful with additional warnings

Debian Package lintian failed for Debian 10 amd64 build:
(see full package build log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13741/artifact/DEB10BUILD/ErrorLog/log_lintian.txt)

W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr-rpki-rtrlib: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200820-00-g48c6d8d2a-0 (missing) -> 7.5-dev-20200820-00-g48c6d8d2a-0~deb10u1
W: frr-snmp: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200820-00-g48c6d8d2a-0 (missing) -> 7.5-dev-20200820-00-g48c6d8d2a-0~deb10u1
W: frr-pythontools: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200820-00-g48c6d8d2a-0 (missing) -> 7.5-dev-20200820-00-g48c6d8d2a-0~deb10u1
W: frr: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200820-00-g48c6d8d2a-0 (missing) -> 7.5-dev-20200820-00-g48c6d8d2a-0~deb10u1
W: frr-doc: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200820-00-g48c6d8d2a-0 (missing) -> 7.5-dev-20200820-00-g48c6d8d2a-0~deb10u1

@mjstapp
Copy link
Contributor

mjstapp commented Aug 20, 2020

Is it possible that your branch is behind master? I believe this CI failure has been fixed (or at least worked-around) in more recent master code...

@donaldsharp
Copy link
Member

please break this up into 2 commits. There are 2 issues here and each one is really subtle. I would prefer 2 commits. Additionally mark is correct your branch is 300+ commits behind

@patrasar patrasar force-pushed the bgp_collision_issue branch from 48c6d8d to dbaa4d4 Compare August 21, 2020 06:34
Issue:

1. Initially BGP start listening to socket.
2. Start timer expires and BGP tries to connect to peer and moved
   to Idle->connect (lets say peer datastructre X)
3. Peer datastrcture Y FD-X receives OPEN and moved from Opensent->
   Openconfirm state and start the hold timer.
4. In the OpenConfirm state, the hold timer is stopped. So peer X
   waits for Keepalive message from peer. If the Keepalive message
   is not received, then it will be in OpenConfirm state for
   indefinite time.
5. Due to this it neither close the existing connection nor it will
   accept any connection from peer.

Fix:
In the OpenConfirm state, don't stop the hold timer.
 1. Upon receipt of a neighbor’s Keepalive, the state is moved to
    Established.
 2. But If the hold timer expires, a stop event occurs, the state
    is moved to Idle.
This is as per RFC.

Signed-off-by: Sarita Patra <saritap@vmware.com>
Issue:
1. Initially BGP start listening to socket.
2. Start timer expires and BGP tries to connect to peer and moved
   to Idle->connect (lets say peer datastructre X)
3. Connect for X succeeds and hence moved from idle ->connect with
   FD-x.
4. A incoming connection is accepted and a new peer datastructure Y
   is created with FD-y moves from idle->Active state.
5. Peer datastercture Y FD-y sends out OPEN and moves to
   Active->Opensent state.
6. Peer datastrcture Y FD-y receives OPEN and moved from Opensent->
   Openconfirm state.
7. Meanwhile on peer datastrcture X FD-x sends out a OPEN message
   and moved from connect->Opensent.
8. For peer datastrcture Y FD-y keep alive is received and it is
   moved from OpenConfirm->Established.
9. In this case peer datastructure Y FD-y is a accepted connection
   so we try to copy all its parameter to peer datastructure X and
   delete Y.
10. During this process TCP connection for the accepted connection
    (FD-y) goes down and hence get remote address and port fails.
11. With this failure bgp_stop function for both peer datastrure X
    and peer datastructure Y is called.
12. By this time all the parameters include state for datastrcture
    for X and Y are exchanged. Peer Y FD-y when it entered this
    function had state OpenConfirm still which has been moved to peer
    datastrcture X.
13. In bgp_stop it will stop all the timers and take action only if
    peer is in established state. Now that peer datastrcture X and Y
    are not in established state (in this function) it will simply
    close all timers and close the socket and assigns socket for both
    the peer datastrcture to -1.
14. Peer datastrcture Y will be deleted as it is a datastrcture created
    due to accept of connection where as peer datastrcture X will be held
    as it is created with configuration.
15. Now peer datastrcture X now holds a state of OpenConfirm without any
    timers running.
16. With this any new incoming connection will never be able to establish
    as there is config connection X which is stuck in OpenConfirm.

Fix:
 While transferring the peer datastructure Y FD-y (accepted connection)
 to the peer datastructure X, if TCP connection for FD-y goes down, then
 1. Call fsm event bgp_stop for X (do cleanup with bgp_stop and move the
    state to Idle) and
 2. Call fsm event bgp_stop for Y (do cleanup with bgp_stop and gets deleted
    since it is an accept connection).

Signed-off-by: Sarita Patra <saritap@vmware.com>
@patrasar patrasar force-pushed the bgp_collision_issue branch from dbaa4d4 to 6c4d873 Compare August 21, 2020 06:36
@patrasar
Copy link
Contributor Author

that sounds valid for me, I'm just curious how did you manage to replicate those cases?

The below issues we got due to some misconfiguration of mtu in the setup
Issue 1 : TCP goes down for the accept socket.
Issue 2 : Keep alive got missed after the OpenConfirm state.

Manually also we tried commenting out the code and the issue seen.

  1. static int bgp_fsm_open(struct peer peer)
    {
    /
    Send keepalive and make keepalive timer */
    // bgp_keepalive_send(peer); >>>>>>>>>>>>> Comment this code in one peer.
    }

  2. Fail bgp_getsockname in the function peer_xfer_stats() which does the peer transfer.
    // if (bgp_getsockname(peer) < 0) { >>>>>>>>>>>>> Comment this code in one peer.
    if (-1 < 0) { >>>>>>>>>>>>> fail the bgp_getsockname()
    flog_err(
    EC_LIB_SOCKET,
    "%%bgp_getsockname() failed for %s peer %s fd %d (from_peer fd %d)",
    (CHECK_FLAG(peer->sflags, PEER_STATUS_ACCEPT_PEER)
    ? "accept"
    : ""),
    peer->host, peer->fd, from_peer->fd);

@patrasar
Copy link
Contributor Author

patrasar commented Aug 21, 2020

Is it possible that your branch is behind master? I believe this CI failure has been fixed (or at least worked-around) in more recent master code...

Addressed. Thanks

@patrasar
Copy link
Contributor Author

please break this up into 2 commits. There are 2 issues here and each one is really subtle. I would prefer 2 commits. Additionally mark is correct your branch is 300+ commits behind

Addressed. Thanks

@patrasar
Copy link
Contributor Author

There is an issue also opened. This fix will address this.
#6090

@LabN-CI
Copy link
Collaborator

LabN-CI commented Aug 21, 2020

Outdated results 💚

Basic BGPD CI results: SUCCESS, 0 tests failed

_ _
Result SUCCESS git merge/6959 dbaa4d4
Date 08/21/2020
Start 02:35:34
Finish 03:01:29
Run-Time 25:55
Total 1815
Pass 1815
Fail 0
Valgrind-Errors 0
Valgrind-Loss 0
Details vncregress-2020-08-21-02:35:34.txt
Log autoscript-2020-08-21-02:36:32.log.bz2
Memory 499 499 428

For details, please contact louberger

@LabN-CI
Copy link
Collaborator

LabN-CI commented Aug 21, 2020

💚 Basic BGPD CI results: SUCCESS, 0 tests failed

Results table
_ _
Result SUCCESS git merge/6959 6c4d873
Date 08/21/2020
Start 03:59:47
Finish 04:25:50
Run-Time 26:03
Total 1815
Pass 1815
Fail 0
Valgrind-Errors 0
Valgrind-Loss 0
Details vncregress-2020-08-21-03:59:47.txt
Log autoscript-2020-08-21-04:00:49.log.bz2
Memory 493 500 426

For details, please contact louberger

@NetDEF-CI
Copy link
Collaborator

NetDEF-CI commented Aug 21, 2020

Continuous Integration Result: SUCCESSFUL

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13756/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Warnings Generated during build:

Debian 10 amd64 build: Successful with additional warnings

Debian Package lintian failed for Debian 10 amd64 build:
(see full package build log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13756/artifact/DEB10BUILD/ErrorLog/log_lintian.txt)

W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr-doc: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-gdbaa4d475-0 (missing) -> 7.5-dev-20200821-03-gdbaa4d475-0~deb10u1
W: frr-pythontools: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-gdbaa4d475-0 (missing) -> 7.5-dev-20200821-03-gdbaa4d475-0~deb10u1
W: frr-rpki-rtrlib: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-gdbaa4d475-0 (missing) -> 7.5-dev-20200821-03-gdbaa4d475-0~deb10u1
W: frr-snmp: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-gdbaa4d475-0 (missing) -> 7.5-dev-20200821-03-gdbaa4d475-0~deb10u1
W: frr: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-gdbaa4d475-0 (missing) -> 7.5-dev-20200821-03-gdbaa4d475-0~deb10u1

CLANG Static Analyzer Summary

  • Github Pull Request 6959, comparing to Git base SHA 92b4f62

No Changes in Static Analysis warnings compared to base

1 Static Analyzer issues remaining.

See details at
https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13756/artifact/shared/static_analysis/index.html

@NetDEF-CI
Copy link
Collaborator

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13757/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Warnings Generated during build:

Debian 10 amd64 build: Successful with additional warnings

Debian Package lintian failed for Debian 10 amd64 build:
(see full package build log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13757/artifact/DEB10BUILD/ErrorLog/log_lintian.txt)

W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr-pythontools: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-g6c4d8732e-0 (missing) -> 7.5-dev-20200821-03-g6c4d8732e-0~deb10u1
W: frr: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-g6c4d8732e-0 (missing) -> 7.5-dev-20200821-03-g6c4d8732e-0~deb10u1
W: frr-doc: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-g6c4d8732e-0 (missing) -> 7.5-dev-20200821-03-g6c4d8732e-0~deb10u1
W: frr-snmp: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-g6c4d8732e-0 (missing) -> 7.5-dev-20200821-03-g6c4d8732e-0~deb10u1
W: frr-rpki-rtrlib: changelog-file-missing-explicit-entry 6.0-2 -> 7.5-dev-20200821-03-g6c4d8732e-0 (missing) -> 7.5-dev-20200821-03-g6c4d8732e-0~deb10u1

CLANG Static Analyzer Summary

  • Github Pull Request 6959, comparing to Git base SHA 92b4f62

No Changes in Static Analysis warnings compared to base

1 Static Analyzer issues remaining.

See details at
https://ci1.netdef.org/browse/FRR-FRRPULLREQ-13757/artifact/shared/static_analysis/index.html

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@riw777 riw777 merged commit de4fa7e into FRRouting:master Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants