Description
Problem: Sessions stuck in OpenConfirm for extended periods, never proceeding to Established or going back to Idle.
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.1.1.3 4 65290 1931 1143 0 0 0 13:42:29 OpenConfirm
Very occasional problem seen by our customers (maybe 2-3 times total in the wild in the last couple of years.) Now able to reproduce in a couple of hours by one of our testers doing very unnatural things to the router (flooding it with tcp opens, clearing neighbors, etc.). Code path that creates the issue is now pretty well known but need to find the right solution to it.
In the process of going from OpenConfirm to Established, bgp_establish calls peer_xfer_conn to handle the doppelganger. If the peer xfer fails, bgp_establish returns a -1 to the caller.
other = peer->doppelganger;
peer = peer_xfer_conn(peer);
if (!peer) {
flog_err(EC_BGP_CONNECT, "%%Neighbor failed in xfer_conn");
return -1;
}
Since we've been able to reproduce the problem, we know based on the below error message in the log that in the cases where the session has gotten stuck in OpenConfirm, this error is hit in peer_xfer_conn:
if (bgp_getsockname(peer) < 0) {
flog_err(
EC_LIB_SOCKET,
"%%bgp_getsockname() failed for %s peer %s fd %d (from_peer fd %d)",
(CHECK_FLAG(peer->sflags, PEER_STATUS_ACCEPT_PEER)
? "accept"
: ""),
peer->host, peer->fd, from_peer->fd);
bgp_stop(peer);
bgp_stop(from_peer);
return NULL;
}
When the Null return from peer_xfer_conn leads to the -1 return from bgp_establish, bgp_event_update takes this action:
if (ret >= 0) {
....snip...
} else {
/*
* If we got a return value of -1, that means there was an
* error, restart the FSM. Since bgp_stop() was called on the
* peer. only a few fields are safe to access here. In any case
* we need to indicate that the peer was stopped in the return
* code.
*/
if (!dyn_nbr && !passive_conn && peer->bgp) {
flog_err(
EC_BGP_FSM,
"%s [FSM] Failure handling event %s in state %s, "
"prior events %s, %s, fd %d",
peer->host, bgp_event_str[peer->cur_event],
lookup_msg(bgp_status_msg, peer->status, NULL),
bgp_event_str[peer->last_event],
bgp_event_str[peer->last_major_event],
peer->fd);
bgp_stop(peer);
bgp_fsm_change_status(peer, Idle);
bgp_timer_set(peer);
}
ret = FSM_PEER_STOPPED;
}
Since in every connection, one side is the passive side until the state is changed to established, if the error above occurs to the passive peer, the call to bgp_fsm_change_status does not occur and the router stays in this state indefinitely. If anyone has ideas to deal with it, please provide ideas (or diffs!) or take the issue and solve it. We would also be glad to test any potential fix in the testbed that can recreate it.
Very difficult to reproduce. If details from our tester on how he was able to recreate it would be useful, I'll be glad to provide it.
Activity