Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and fix zebra GR #13145

Merged
merged 10 commits into from
Apr 5, 2023
Merged

Improve and fix zebra GR #13145

merged 10 commits into from
Apr 5, 2023

Conversation

donaldsharp
Copy link
Member

See individual commits but the high points:

a) Zebra was storing afi/safi data for GR but only acting on AFI/SAFI_UNICAST. Remove the redundancy
b) Fix GR code to not accidently stop iterating when a node goes away
c) Fix GR code to process the AFI when it receives the notification that the afi is done
d) Fix GR code to run after anything in Meta-Q since we have a race condition

The safi has no 0 value which it is set to as part of the
initialization.  Let's just set it to a sensible value.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The info->do_delete variable was being set to true only when
u.val was 1.  The problem with this is that u.val is a union
and the various ways that we can call this event causes
different values to be written to the union value on the thread.

This makes no sense.  Just set the variable to what we want it to
be when we need it to be true.  Since it was only ever set during
a thread_execute section.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Indentation was deep and hard to understand in
zebra_gr_delete_stale_route

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When GR is running and attempting to clear up a node
if the node that is currently saved and we are coming
back to happens to be deleted during the time zebra
suspends the GR code due to hitting the node limit
then zebra GR code will just completely stop processing
and potentially leave stale nodes around forever.

Let's just remove this hole and process what we can.
Can you imagine trying to debug this after the fact?

If we remove a node then that counts toward the maximum
to process of ZEBRA_MAX_STALE_ROUTE_COUNT.  This should
prevent any non-processing with a slightly larger cost
of having to look at a few nodes repeatedly

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
By the time this function is called we have already
ensured that the pointers are good several times.
I like consistency but this is a bit much

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We have code that tracks both afi and safi's,
but we only ever operate on the afi's.  So lets
limit our work being done to something more sensible.

I'm leaving the safi being broadcast through the zapi
message, as that I am not sure what else should be ripped
out at this point in time.

Finally re-arrange the zread_client_capabilites function
to stop the multiple levels of function calling that really
serve no purpose.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The zebra_gr code had 3 functions when effectively only
1 was needed.  Cleans up some code weirdness around
multiple switch statements for the same api->cap
as well as consolidating down to only caring about
SAFI_UNICAST, since that is all we care about at the
moment.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The GR code in FRR used to wait till all AFI's were complete
before cleaning up the routes from the upper level protocol.
This of course can lead to some weird situations where say
ipv4 finishes and then v6 is stuck waiting for a peer to come
up and never finishes.  v4 when it finishes signals zebra that
it is done but no action is taken at that moment.

Modify the code to allow the zebra_gr.c code to handle a per
afi removal, instead of doing it all at the end.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
After the restructure of the gr code to allow zebra_gr
to have individual cleanups of afi, this is no longer necessary.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@@ -3727,6 +3755,12 @@ static void early_route_meta_queue_free(struct meta_queue *mq, struct list *l,
}
}

static void rib_meta_queue_gr_run_free(struct meta_queue *mq, struct list *l,
struct zebra_vrf *zvrf)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to free anything here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's overrated if you ask me. I've fixed it though

@NetDEF-CI
Copy link
Collaborator

NetDEF-CI commented Mar 29, 2023

Continuous Integration Result: SUCCESSFUL

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-10450/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Warnings Generated during build:

Checkout code: Successful with additional warnings
Report for zebra_gr.c | 4 issues
===============================================
< WARNING: else is not generally useful after a break or return
< #403: FILE: /tmp/f1-996779/zebra_gr.c:403:
< WARNING: void function return statements are not generally useful
< #549: FILE: /tmp/f1-996779/zebra_gr.c:549:
Report for zebra_rib.c | 4 issues
===============================================
< WARNING: void function return statements are not generally useful
< #3116: FILE: /tmp/f1-996779/zebra_rib.c:3116:
< WARNING: void function return statements are not generally useful
< #3762: FILE: /tmp/f1-996779/zebra_rib.c:3762:

BGP signals to zebra that a afi has converged immediately
after it has finished processing all routes for a given
afi/safi.  This generates events in zebra in this order

a) Routes received from BGP, placed on early-rib Meta-Q
b) Signal GR for the afi.

Now imagine that zebra reads GR code and immediately
processes routes that are in the actual rib and
removes some routes.  This generates a

c) route deletion to the kernel for some number of
routes that may be in the the early-rib Meta-Q
d) Process the Meta-Q, and re-install the routes

This is undesirable behavior in zebra.  In that
while we may end up in a correct state, there
will be a blip for some number of routes that
happen to be in the early rib Meta-Q.

Modify the GR code to have it's own processing
entry at the end of the Meta-Q.  This will
allow all routes to be processed and ready
for handling by the Graceful Restart code.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@NetDEF-CI
Copy link
Collaborator

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-10454/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

@Jafaral Jafaral self-requested a review April 4, 2023 15:52
@Jafaral Jafaral merged commit 92c4494 into FRRouting:master Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants