Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #4469: Bugfix: Resolve a deadlock in cluster memberlist maintanance #4472

Commits on Dec 13, 2022

  1. Bugfix: Resolve a deadlock in cluster memberlist maintanance

    The issue is several Antrea Agent are out of memory in a large scale cluster, and
    we observe that the memory of the failed Antrea Agent is continuously increasing,
    from 400MB to 1.8G in less than 24 hours.
    
    After profiling Agent memory and call stack, we find that most memory is taken by
    Node resources received by Node informer watch function. From the goroutines, we
    find a dead lock that,
    1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked
       by requiring "Memberlist.nodeLock".
    2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds
       the lock "Memberlist.nodeLock". It is blocking at sending message to
       "Cluster.nodeEventsCh", while the consumer is also blocking.
    
    The issue may happen in a large scale setup. Although Antrea has 1024 messages
    buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the
    channel is full before Agent completes sending out the Member join message on the
    existing Nodes.
    
    To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join()
    in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.
    
    Signed-off-by: wenyingd <wenyingd@vmware.com>
    wenyingd committed Dec 13, 2022
    Configuration menu
    Copy the full SHA
    93645a1 View commit details
    Browse the repository at this point in the history