Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add namespace name to antrea network policy logs #3794

Closed
jsalatiel opened this issue May 16, 2022 · 16 comments · Fixed by #5101
Closed

Add namespace name to antrea network policy logs #3794

jsalatiel opened this issue May 16, 2022 · 16 comments · Fixed by #5101
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.

Comments

@jsalatiel
Copy link

It would be really nice if we could get the namespace name on the antrea network policy logs. I do not even know if this would be possible, but if so it would be a lot more friendly to debug if you see any drops because you can clearly see if those namespaces should be communicating or not.
of course if the traffic comes from/goes to some external entity that should issue a "null" namespace as source or destination.

@jsalatiel jsalatiel added the kind/feature Categorizes issue or PR as related to a new feature. label May 16, 2022
@antoninbas
Copy link
Contributor

It's easy enough to add the Namespace / Name for the local Pod (the Pod on which the network policy is applied) to the logs. The Antrea Agent has this information already. However, getting that information for the other endpoint - if it is a Pod - is not really straightforward. The Pod will typically be on a different Node, and the Agent generating the log entry has no knowledge about that Pod, it only has the IP address and doesn't know how to map it to a Pod Namespace & Name. And we want to keep it this way: an Antrea Agent doesn't need to know about Pods running on other Nodes. If all Agents need to know about all Pods, that put some unnecessary load on both the K8s APIs and on the Agents, which have to keep a lot of state up-to-date.

I think we are trying to keep audit logs light weight: self-contained and easily generated locally by just the Antrea Agent. For more advanced use cases, we are investing in the Flow Aggregator (and the surrounding Flow Visibility solution). The Aggregator aggregates flow information from all the different Agents, and can report all blocked connections, with full information for each connection (K8s Namespaces, Names, ...), even for Service traffic. See https://github.com/antrea-io/antrea/blob/main/docs/network-flow-visibility.md. for an overview. Think of the Flow Aggregator as the Antrea Controller equivalent for network flows. We are very receptive to feedback when it comes to the Flow Aggregator and what helpful features we can build with it.

Coming back to your original question: if you are happy with the Namespace & Name of the local endpoint (Pod to which the policy is applied), we can add that to the policy audit logs. However, it will only give you half the information you need to answer your original "question:"

you can clearly see if those namespaces should be communicating or not.

@antoninbas antoninbas added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 16, 2022
@jsalatiel
Copy link
Author

jsalatiel commented May 17, 2022

It would be great to have the namespace/pod name for local on the logs, even if it is only half of the information it would be really helpful if it could be implemented. Also i suppose if both pods are on the same node i will get the full information.

@projx
Copy link

projx commented May 18, 2022

I've justed started to evaluate the usage of ACNP and ACP's, and this was the first major issue that stood out to me. I understand your looking to focus more on the Flow data, but I think prioritising that is a mistake, because

  1. A lot of teams are not equipped with the infrastructure to consume and analyse flow data, and its a huge ask to get up and running, especially for small projects or teams who are just evaluating Antrea - They are more likely to just uninstall it and move onto the next CNI extension.

  2. NetFlow and related technologies (sFlow etc) are intended to be based upon samples, to provide a rough picture, the associated storage/analytics infrastructure will sized around that. But a 30-60 second sample rate, would not be adequate to provide low-level troubleshooting, this would require reducing the sampling to near-realtime, which would exponentially increase the infrastructure requirements.

  3. Audit logs are meant to be comprehensive, especially when they are kept for forensics or compliance requirements - hence sampled flows are not suitable and given the transient nature of containers, the policy log quickly become obsolete, as there no easy means to know what an IP was issued to a few hours earlier.. hence, adding the NS/Pod will make make them much useable in the long term.

@antoninbas
Copy link
Contributor

Thanks for the feedback. I'd like to make the following points though:

  • we don't intend for the flow visibility tools to miss any connection. We want to have at least one flow record for every connection. We listen to conntrack for events (at the moment we poll, but I'd like us to change that and even when polling, connections stay in conntrack for long enough for us to catch everything)
  • we are trying to simplify the infrastructure required to run these tools, this is why we moved away from the ELK stack.

That being said, I can understand this request. I am just trying to avoid too much redundancy, and also too much resource consumption by Antrea components.

Ideally, what would be the best consumption model for audit logs for you? If we need to add Pod namespace/name, we are likely to do it in a central location (could be the Flow Aggregator, or something else). Should we then expose an API, write the logs to a file, send them to a configurable webhook (like K8s audit events)?

@jianjuns
Copy link
Contributor

Yes, it seems hard to add remote Pod information for every logged packet, and probably also hard to stream all logs to a central service to add Pod information there.

Maybe we can have an async service (in antrea-agent or a separate one) to add information for dropped traffic (assuming dropped traffic is less), and try best to add information for other with some rate control?

@jsalatiel
Copy link
Author

jsalatiel commented May 21, 2022

Thanks for the feedback. I'd like to make the following points though:

  • we don't intend for the flow visibility tools to miss any connection. We want to have at least one flow record for every connection. We listen to conntrack for events (at the moment we poll, but I'd like us to change that and even when polling, connections stay in conntrack for long enough for us to catch everything)
  • we are trying to simplify the infrastructure required to run these tools, this is why we moved away from the ELK stack.

That being said, I can understand this request. I am just trying to avoid too much redundancy, and also too much resource consumption by Antrea components.

Ideally, what would be the best consumption model for audit logs for you? If we need to add Pod namespace/name, we are likely to do it in a central location (could be the Flow Aggregator, or something else). Should we then expose an API, write the logs to a file, send them to a configurable webhook (like K8s audit events)?

I like the approach of just writing to a file. If it's going to be on a central location, add to a log file in the node running the controller.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2022
@jsalatiel
Copy link
Author

/remove-stale

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022
@jsalatiel
Copy link
Author

/remove-stale

@tnqn tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023
@jsalatiel
Copy link
Author

/remove-stale

@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 13, 2023
Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas self-assigned this Apr 13, 2023
@antoninbas antoninbas added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Apr 13, 2023
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 13, 2023
Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 14, 2023
Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Apr 25, 2023
Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit that referenced this issue Apr 26, 2023
Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For #3794

Signed-off-by: Antonin Bas <abas@vmware.com>
jainpulkit22 pushed a commit to urharshitha/antrea that referenced this issue Apr 28, 2023
…ea-io#4855)

Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas
Copy link
Contributor

Now that we have added support for file logging to the FlowAggregator, I think the only remaining ask would be to include the appliedTo (local) Pod namespace/name to audit logs. As a reminder, we do not believe that including the peer Pod namespace/name is a good idea, for scalability reasons.

@jsalatiel
Copy link
Author

Great. Which version should it make available?

@antoninbas
Copy link
Contributor

Great. Which version should it make available?

File logging is available in the FlowAggregator in Antrea v1.12
I expect that the local Pod names will be added to audit logs starting with Antrea v1.13

ceclinux pushed a commit to ceclinux/antrea that referenced this issue Jun 5, 2023
…ea-io#4855)

Flows are written to a local "log" file, in CSV format, with support for
log rotation. Not all the fields from the flow records are
included. Configuration options include the ability to filter flows
based on ingress / egress network policy rule actions (in the future,
additional filtering capabilities could be introduced).

For antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 8, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 8, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 8, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 9, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jun 9, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes antrea-io#3794

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit that referenced this issue Jun 13, 2023
We include 2 new fields in the audit logs:

* the "direction" of the NP rule (`Ingress` or `Egress`)
* the reference of the Pod to which the NP rule is applied (as
  `<namespace>/<name>`).

These new fields are *NOT* added to the end of the logs, which could
break existing consumers.

We also refactor the e2e tests for AuditLogging to improve correctness
and readbility. Some logs were not validated properly because of an
early "break" statement, and some log fields (e.g., logLabel) were not
validated.

Fixes #3794

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas
Copy link
Contributor

@jsalatiel we have closed this issue, as what we discussed above has been implemented

as a reminder, we have the following capabilities:

  • the FlowAggregator can now export flow records as a log file: this is similar to Audit Logging in the Agent, but 1) it happens in a centralized location, and 2) each log entry includes more information, including namespace / name for both sides of the connection. Of course, this requires deploying the Flow Aggregator. This is available in Antrea v1.12.
  • the namespace / name of the local Pod has been added to the Audit Logs generated by the Agent. This will be available in Antrea v1.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants