Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce "Failed" phase in ANP status #4608

Merged
merged 1 commit into from
May 5, 2023

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Feb 7, 2023

Modify the ANP status phase as "Failed" when all Agents have reported the status and at least one failure is received.

Signed-off-by: wenyingd wenyingd@vmware.com

Modify the ANP status phase as "Failed" when all Agents have reported
the status and at least one Agent reports failure.

Signed-off-by: wenyingd <wenyingd@vmware.com>
@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Merging #4608 (faf986d) into main (bc74667) will increase coverage by 4.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4608      +/-   ##
==========================================
+ Coverage   65.52%   69.61%   +4.09%     
==========================================
  Files         390      400      +10     
  Lines       58147    58236      +89     
==========================================
+ Hits        38098    40539    +2441     
+ Misses      17361    14913    -2448     
- Partials     2688     2784      +96     
Flag Coverage Δ
e2e-tests 38.29% <0.00%> (-0.05%) ⬇️
integration-tests 34.63% <ø> (+0.03%) ⬆️
kind-e2e-tests 46.66% <0.00%> (-1.35%) ⬇️
unit-tests 59.62% <100.00%> (+1.95%) ⬆️
Impacted Files Coverage Δ
pkg/controller/networkpolicy/status_controller.go 72.02% <100.00%> (+1.25%) ⬆️

... and 85 files with indirect coverage changes

@wenyingd
Copy link
Contributor Author

wenyingd commented Feb 7, 2023

/test-all

Comment on lines +339 to 341
} else if currentNodes+len(failedNodes) == desiredNodes {
phase = crdv1alpha1.NetworkPolicyFailed
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this calculation is needed? we may just check failedNodes only like below:

if len(failedNodes) > 0 {
  phase = crdv1alpha1.NetworkPolicyFailed
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is one of the reasons why K8s API doc recommends to use conditions over phase.

Rather than encouraging clients to infer implicit properties from phases, we prefer to explicitly expose the individual conditions that clients need to monitor.

When there are both failed Nodes and realizing Nodes, it's hard to say which phase name is more proper, "Failed" or "Realizing", user may infer different properties regardless of which one is used. But since there is already an individual condition "RealizationFailure" to indicate failed Node exist, I think the phase should describe the overall progress of realization instead of whether there is any failure, so if currentNodes+len(failedNodes) == desiredNodes makes more sense to me.

cc @jianjuns for opinions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - in K8s Node case, do we have permanent realization failure - that is agent already gave up retrying realizing?

I agree "failed" and "realized" seems not match, but on the other hand people also wonder why we just have "realized" which seems a condition too (rather a phase), but no phase to express failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Pod case:

  • If the Pod's restart policy is never
    • If any of containers is still running, the phase is Running
    • If none of containers is running, and there is any failure, the phase is Failed
    • If none of contaienrs is running, and there is no failure, the phase is Succeeded
  • If the Pod's restart policy is always
    • The phase is running regardless whethere there is running Pod or failed Pod

Its principle seems: if the system keeps trying to reach desired state, the phase will always be "XXXing"; if the system is asked to not retry, the phase will be "XXXed".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, does agent always keep realizing?

Copy link
Contributor Author

@wenyingd wenyingd Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antrea Agent always try to realize the ANP rules. Nephe may set "RealizationFailure: true" when the ANP rules are failed to realize on cloud.

@tnqn tnqn added api-review Categorizes an issue or PR as actively needing an API review. area/network-policy/api Issues or PRs related to the network policy API. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API. labels Mar 2, 2023
@wenyingd
Copy link
Contributor Author

Any other comments on this patch @tnqn @jianjuns @luolanzone ?

@wenyingd wenyingd added this to the Antrea v1.12 release milestone Mar 27, 2023
@wenyingd
Copy link
Contributor Author

/test-all
/test-windows-all

@wenyingd
Copy link
Contributor Author

/test-e2e

@wenyingd
Copy link
Contributor Author

/test-e2e
/test-windows-all

@jianjuns jianjuns merged commit c9c9559 into antrea-io:main May 5, 2023
@wenyingd wenyingd deleted the anp_failed_phase branch May 30, 2023 06:49
ceclinux pushed a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
Set the ANP status phase as "Failed" when all Agents have reported
the status and at least one Agent reports failure.

Signed-off-by: wenyingd <wenyingd@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. area/network-policy/api Issues or PRs related to the network policy API. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants