Add namespace name to antrea network policy logs #3794

jsalatiel · 2022-05-16T17:17:47Z

It would be really nice if we could get the namespace name on the antrea network policy logs. I do not even know if this would be possible, but if so it would be a lot more friendly to debug if you see any drops because you can clearly see if those namespaces should be communicating or not.
of course if the traffic comes from/goes to some external entity that should issue a "null" namespace as source or destination.

antoninbas · 2022-05-16T17:46:00Z

It's easy enough to add the Namespace / Name for the local Pod (the Pod on which the network policy is applied) to the logs. The Antrea Agent has this information already. However, getting that information for the other endpoint - if it is a Pod - is not really straightforward. The Pod will typically be on a different Node, and the Agent generating the log entry has no knowledge about that Pod, it only has the IP address and doesn't know how to map it to a Pod Namespace & Name. And we want to keep it this way: an Antrea Agent doesn't need to know about Pods running on other Nodes. If all Agents need to know about all Pods, that put some unnecessary load on both the K8s APIs and on the Agents, which have to keep a lot of state up-to-date.

I think we are trying to keep audit logs light weight: self-contained and easily generated locally by just the Antrea Agent. For more advanced use cases, we are investing in the Flow Aggregator (and the surrounding Flow Visibility solution). The Aggregator aggregates flow information from all the different Agents, and can report all blocked connections, with full information for each connection (K8s Namespaces, Names, ...), even for Service traffic. See https://github.com/antrea-io/antrea/blob/main/docs/network-flow-visibility.md. for an overview. Think of the Flow Aggregator as the Antrea Controller equivalent for network flows. We are very receptive to feedback when it comes to the Flow Aggregator and what helpful features we can build with it.

Coming back to your original question: if you are happy with the Namespace & Name of the local endpoint (Pod to which the policy is applied), we can add that to the policy audit logs. However, it will only give you half the information you need to answer your original "question:"

you can clearly see if those namespaces should be communicating or not.

jsalatiel · 2022-05-17T10:57:12Z

It would be great to have the namespace/pod name for local on the logs, even if it is only half of the information it would be really helpful if it could be implemented. Also i suppose if both pods are on the same node i will get the full information.

projx · 2022-05-18T20:09:55Z

I've justed started to evaluate the usage of ACNP and ACP's, and this was the first major issue that stood out to me. I understand your looking to focus more on the Flow data, but I think prioritising that is a mistake, because

A lot of teams are not equipped with the infrastructure to consume and analyse flow data, and its a huge ask to get up and running, especially for small projects or teams who are just evaluating Antrea - They are more likely to just uninstall it and move onto the next CNI extension.
NetFlow and related technologies (sFlow etc) are intended to be based upon samples, to provide a rough picture, the associated storage/analytics infrastructure will sized around that. But a 30-60 second sample rate, would not be adequate to provide low-level troubleshooting, this would require reducing the sampling to near-realtime, which would exponentially increase the infrastructure requirements.
Audit logs are meant to be comprehensive, especially when they are kept for forensics or compliance requirements - hence sampled flows are not suitable and given the transient nature of containers, the policy log quickly become obsolete, as there no easy means to know what an IP was issued to a few hours earlier.. hence, adding the NS/Pod will make make them much useable in the long term.

antoninbas · 2022-05-19T18:32:29Z

Thanks for the feedback. I'd like to make the following points though:

we don't intend for the flow visibility tools to miss any connection. We want to have at least one flow record for every connection. We listen to conntrack for events (at the moment we poll, but I'd like us to change that and even when polling, connections stay in conntrack for long enough for us to catch everything)
we are trying to simplify the infrastructure required to run these tools, this is why we moved away from the ELK stack.

That being said, I can understand this request. I am just trying to avoid too much redundancy, and also too much resource consumption by Antrea components.

Ideally, what would be the best consumption model for audit logs for you? If we need to add Pod namespace/name, we are likely to do it in a central location (could be the Flow Aggregator, or something else). Should we then expose an API, write the logs to a file, send them to a configurable webhook (like K8s audit events)?

jianjuns · 2022-05-21T01:38:10Z

Yes, it seems hard to add remote Pod information for every logged packet, and probably also hard to stream all logs to a central service to add Pod information there.

Maybe we can have an async service (in antrea-agent or a separate one) to add information for dropped traffic (assuming dropped traffic is less), and try best to add information for other with some rate control?

jsalatiel · 2022-05-21T10:36:39Z

Thanks for the feedback. I'd like to make the following points though:

we don't intend for the flow visibility tools to miss any connection. We want to have at least one flow record for every connection. We listen to conntrack for events (at the moment we poll, but I'd like us to change that and even when polling, connections stay in conntrack for long enough for us to catch everything)

we are trying to simplify the infrastructure required to run these tools, this is why we moved away from the ELK stack.

That being said, I can understand this request. I am just trying to avoid too much redundancy, and also too much resource consumption by Antrea components.

Ideally, what would be the best consumption model for audit logs for you? If we need to add Pod namespace/name, we are likely to do it in a central location (could be the Flow Aggregator, or something else). Should we then expose an API, write the logs to a file, send them to a configurable webhook (like K8s audit events)?

I like the approach of just writing to a file. If it's going to be on a central location, add to a log file in the node running the controller.

github-actions · 2022-08-20T00:40:21Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

jsalatiel · 2022-08-22T11:08:29Z

/remove-stale

github-actions · 2022-11-22T00:44:10Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

jsalatiel · 2022-11-22T00:55:50Z

/remove-stale

github-actions · 2023-02-21T00:41:21Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

jsalatiel · 2023-02-21T00:44:30Z

/remove-stale

Flows are written to a local "log" file, in CSV format, with support for log rotation. Not all the fields from the flow records are included. Configuration options include the ability to filter flows based on ingress / egress network policy rule actions (in the future, additional filtering capabilities could be introduced). For antrea-io#3794 Signed-off-by: Antonin Bas <abas@vmware.com>

Flows are written to a local "log" file, in CSV format, with support for log rotation. Not all the fields from the flow records are included. Configuration options include the ability to filter flows based on ingress / egress network policy rule actions (in the future, additional filtering capabilities could be introduced). For #3794 Signed-off-by: Antonin Bas <abas@vmware.com>

…ea-io#4855) Flows are written to a local "log" file, in CSV format, with support for log rotation. Not all the fields from the flow records are included. Configuration options include the ability to filter flows based on ingress / egress network policy rule actions (in the future, additional filtering capabilities could be introduced). For antrea-io#3794 Signed-off-by: Antonin Bas <abas@vmware.com>

antoninbas · 2023-06-02T23:45:47Z

Now that we have added support for file logging to the FlowAggregator, I think the only remaining ask would be to include the appliedTo (local) Pod namespace/name to audit logs. As a reminder, we do not believe that including the peer Pod namespace/name is a good idea, for scalability reasons.

jsalatiel · 2023-06-03T00:00:38Z

Great. Which version should it make available?

antoninbas · 2023-06-03T00:02:04Z

Great. Which version should it make available?

File logging is available in the FlowAggregator in Antrea v1.12
I expect that the local Pod names will be added to audit logs starting with Antrea v1.13

…ea-io#4855) Flows are written to a local "log" file, in CSV format, with support for log rotation. Not all the fields from the flow records are included. Configuration options include the ability to filter flows based on ingress / egress network policy rule actions (in the future, additional filtering capabilities could be introduced). For antrea-io#3794 Signed-off-by: Antonin Bas <abas@vmware.com>

We include 2 new fields in the audit logs: * the "direction" of the NP rule (`Ingress` or `Egress`) * the reference of the Pod to which the NP rule is applied (as `<namespace>/<name>`). These new fields are *NOT* added to the end of the logs, which could break existing consumers. We also refactor the e2e tests for AuditLogging to improve correctness and readbility. Some logs were not validated properly because of an early "break" statement, and some log fields (e.g., logLabel) were not validated. Fixes antrea-io#3794 Signed-off-by: Antonin Bas <abas@vmware.com>

We include 2 new fields in the audit logs: * the "direction" of the NP rule (`Ingress` or `Egress`) * the reference of the Pod to which the NP rule is applied (as `<namespace>/<name>`). These new fields are *NOT* added to the end of the logs, which could break existing consumers. We also refactor the e2e tests for AuditLogging to improve correctness and readbility. Some logs were not validated properly because of an early "break" statement, and some log fields (e.g., logLabel) were not validated. Fixes #3794 Signed-off-by: Antonin Bas <abas@vmware.com>

antoninbas · 2023-06-13T00:44:41Z

@jsalatiel we have closed this issue, as what we discussed above has been implemented

as a reminder, we have the following capabilities:

the FlowAggregator can now export flow records as a log file: this is similar to Audit Logging in the Agent, but 1) it happens in a centralized location, and 2) each log entry includes more information, including namespace / name for both sides of the connection. Of course, this requires deploying the Flow Aggregator. This is available in Antrea v1.12.
the namespace / name of the local Pod has been added to the Audit Logs generated by the Agent. This will be available in Antrea v1.13.

jsalatiel added the kind/feature Categorizes issue or PR as related to a new feature. label May 16, 2022

antoninbas added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 16, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2022

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022

tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023

antoninbas self-assigned this Apr 13, 2023

antoninbas mentioned this issue Apr 13, 2023

Add new exporter to FlowAggregator to write flows to local file #4855

Merged

antoninbas added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Apr 13, 2023

luolanzone mentioned this issue Jun 2, 2023

Include Pod Name in Antrea Network Policy logs #5073

Closed

antoninbas mentioned this issue Jun 8, 2023

Add appliedTo Pod namespace / name to audit logs #5101

Merged

antoninbas closed this as completed in #5101 Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add namespace name to antrea network policy logs #3794

Add namespace name to antrea network policy logs #3794

jsalatiel commented May 16, 2022

antoninbas commented May 16, 2022

jsalatiel commented May 17, 2022 •

edited

Loading

projx commented May 18, 2022

antoninbas commented May 19, 2022

jianjuns commented May 21, 2022

jsalatiel commented May 21, 2022 •

edited

Loading

github-actions bot commented Aug 20, 2022

jsalatiel commented Aug 22, 2022

github-actions bot commented Nov 22, 2022

jsalatiel commented Nov 22, 2022

github-actions bot commented Feb 21, 2023

jsalatiel commented Feb 21, 2023

antoninbas commented Jun 2, 2023

jsalatiel commented Jun 3, 2023

antoninbas commented Jun 3, 2023

antoninbas commented Jun 13, 2023

Add namespace name to antrea network policy logs #3794

Add namespace name to antrea network policy logs #3794

Comments

jsalatiel commented May 16, 2022

antoninbas commented May 16, 2022

jsalatiel commented May 17, 2022 • edited Loading

projx commented May 18, 2022

antoninbas commented May 19, 2022

jianjuns commented May 21, 2022

jsalatiel commented May 21, 2022 • edited Loading

github-actions bot commented Aug 20, 2022

jsalatiel commented Aug 22, 2022

github-actions bot commented Nov 22, 2022

jsalatiel commented Nov 22, 2022

github-actions bot commented Feb 21, 2023

jsalatiel commented Feb 21, 2023

antoninbas commented Jun 2, 2023

jsalatiel commented Jun 3, 2023

antoninbas commented Jun 3, 2023

antoninbas commented Jun 13, 2023

jsalatiel commented May 17, 2022 •

edited

Loading

jsalatiel commented May 21, 2022 •

edited

Loading