-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPFIX start and stop timestamps are wrong/too large #1417
Comments
@dreamtalen do you think you could take a look? |
Sure, I will try. @robcowart Could you please share me the steps of capturing this packet with wrong start and stop time? |
The environment is a new lab, consisting of...
Antrea was deployed by...
That's it. Just a really simple starting point. When setup in this manner, the dates I am receiving are as shown above. Interestingly |
It may be completely unrelated to this issue, but I have also noticed that when flow export is enabled, the antrea-agent will consume 100% of a CPU core. It returns to normal after disabling flow export. |
I deployed a cluster with Antrea and the same config as you (except for
I generated a lot of traffic (but not that many connections). I left it running for a while but things were stable. Is there anything useful in the antrea-agent logs? Anything specific I should be doing to reproduce? On a side not we should probably have a mechanism to enable Go profiling and the pprof HTTP server in the Antrea components so that when a user runs into an issue like this one, it is easy to collect information (goroutine backtrace, CPU profile, etc.). I'll open an issue to this effect. |
@antoninbas I am going to walk through the whole process again and will document each step for repeatability. If the CPU utilization remains I feel like I should open a separate issue unless you feel it should stay here. |
Hi Rob, I failed to reproduce the wrong start and stop timestamps on my side. I'm not sure if it's related to differenct IPFIX collectors we are using. Would you please try our ELK Flow Collector at https://github.com/vmware-tanzu/antrea/blob/master/docs/network-flow-visibility.md? There will be a Kibana dashboard which is easy to use and recommeneded for Antrea flow visualization. Also, please tell me the tool you used to catch this IPFIX record so I can have a try too. Thanks! |
@robcowart Thanks! Yes please open a separate issue when you get the steps documented. @dreamtalen @zyiou could we take a deeper look at the timestamp thing? I know that the timestamps look correct (at least
Whereas in our case we are using a signed 64-bit integer:
The RFC says that the encoding should be the same as the ExportTime field, for which we use a |
The collector is irrelevant in this case as I was seeing this on the wire via a PCAP before it reached the collector. Seeing it also in the collector just confirms what was seen on the wire. Can you share anything else about your environment? As I mentioned above I am on Ubuntu 20.04, K8S 1.19.3 via I will walk through my environment setup from scratch again, but I will be a few days. |
@antoninbas I remembered that |
Thanks @robcowart for raising this issue. Currently, we consider uint 64-bit datatype for dateTimeSeconds, which is a bug. As pointed correctly by @antoninbas, according to RFC, dateTimeSeconds should be unsigned 32 bit, but dateTimeMilliseconds still be unsigned 64 bit. In addition, I concur with Rob's suggestion that we should let the user of go-ipfix lib give directly in uint32 for dateTimeSeconds and uint64 for dateTimeMilliseconds rather than expecting in time.Unix() format that is provided in go time package. |
An orthogonal point: Right now, flowEndTime is not properly encoded in the flow record. Conntrack flows do not add the stopTime until the conntrack flow is deleted. Currently, we consider stopTime as flowEndTime for every flow record of conntrack flow, so we see some arbitrary values. @antoninbas @robcowart Any suggestions? |
I would suggest using 0 to make it clear that the value is not valid yet. I couldn't find any guidance for BTW, a related question: when a flow ends, are we guaranteed to export correct final packet / byte count values? Are does this guarantee only hold if the polling interval is less than the time-wait delay (120s by default IIRC)? Have we considered listening to conntrack events as an alternative / complementary implementation to polling, at least for Linux? |
Most network devices send flow records based on one or more configurable timeout settings. While they will track a flow for a longer period of time, they will send flows more frequently. This is necessary for a number of reasons.
If I explained those well, you should see that it is very desirable to send information about a long-lived flow over the life of the flow. Most network devices will provide various options to control this. Some are simple:
Others provide more granular control:
I explain this difference just to show the options that are common. The expectation is that it should be possible to at least provide for the former. The only detail I would add is that the inactive and active periods should be tracked per flow, not globally. For example, if the 60 second export period was global, i.e. every 60 seconds all active flows are exported. You can end up with 59 seconds of no data, and a flood of records from across the whole infrastructure simultaneously. This can easily overwhelm some collection systems, especially Linux systems with default networking kernel parameters. Tracking the inactive and active timeouts per flow will better distribute the load for the collecting system and provide more timely and accurate reporting. In 2020 I would say that it is generally accepted that 1-min is a good compromise between data granularity and overall volume of records generated, with short-lived flows being exported quickly due to the inactive timeout. Now that we have established that records should be sent periodically over the lifetime of a network flow, the next question is which fields, or "information elements" (IEs), should be included. While there are both delta and total IEs for things like bytes and packets. Almost every vendor sends only the delta values, where the delta is the quantity since the previous record for this flow was exported (or since the flow started if sending the initial record). Sending only delta values is usually not an issue as it is easy enough at query time to do a sum of the deltas, when a total is needed. I have seen a few examples of vendors sending the total values as well, but it isn't really important. What should be avoided is sending only total values. That makes it very challenging to work with the data later, especially in combination with records form other sources that provide only delta values. There is nothing wrong with including the total values, but I would simply remove them and gain a bit of efficiency. Regarding the start and end timestamps. I have to admit that the RFCs aren't really clear here. My feeling is that for active flows The last thing I will mention is UDP vs. TCP. I noticed that in the Antrea repo examples that the flowexporter was configured to use TCP. However I set mine up to use UDP. Like many other network-related data sources (SNMP, syslog, etc.) netflow was designed to be sent via UDP. The logic of this is that establishing a TCP session has much more overhead and requires much more network traffic (especially in the reverse direction for ACKs and such) than UDP. During an outage, it is generally undesirable that the overhead of the management traffic attempting to re-establish session is competing with applications to do the same. This is so embedded in the networking mindset that many devices, as well as collectors, do not even support TCP. In fact IPFIX is the only flow standard to even specify optional support for TCP. Since the Antrea exporter supports both, this isn't an issue. However you should be aware that if people follow the documented example config exactly, some of them might encounter issues because the collector doesn't support TCP. Hopefully this was helpful. Let me know if there are any questions that I can help answer. |
@srikartati @zyiou @dreamtalen could we investigate the possibility of implementing |
Thanks @robcowart for detailed comments. @antoninbas Yes, an IPFIX field based on status flag in polled connections was in plans to be implemented, which is similar to
This is a good point. Currently, we do not have any maximum for polling interval, so the final counts are accurate if it is less than the time wait delay. Regarding event based tracking of connections, there are some ideas to use ebf scripts to get the flow information in addition to polling. We do not have any concrete proposal for now. |
Closes this issue (fixed in PR #1479) |
@zyiou, @srikartati this issue should be reopened. I have tested v0.11.1 and while the issue is partially fixed, there is still an issue. The flow start time is now correct. However the flow end time is still way in the future. It is not as far in the future as when I originally reported this issue. However it is still wrong.
|
@robcowart You are right. Just checked the code. We are getting the stoptime from conntrack table, instead we should record the stopTime as the time when the flow record is created (time.Now) from the discussion in this issue. @zyiou Is it possible for you to add this in one of the PRs you already opened? Thanks. |
Sure. I can add this in #1582 when I finished the upgrade to go-ipfix v0.3.1 |
* Move to go-ipfix package v0.3.1 - Changes for set and information element struct and interfaces - Fixes antrea-io#1417 by changing the flowEndSeconds to using time.Now() - Move to using antrea/ipfix-collector v0.3.1 (based on antrea-io#1523)
I have tested
|
@robcowart Thanks for reporting this. Did you use the Flow Aggregator service when you tested or directly exported flow records to the flow collector from Flow Exporter? |
Directly exported from flow exporter. |
@robcowart Does it happen every time you export flow records or it happens occasionally? |
@zyiou Yes, that is my follow up comment. If this case is not happening for every flow, then we need to look at conntrack dumping API from @robcowart I just want to make sure the antrea configmap parameter is the default value: |
@zyiou I checked the PCAP of records that I made. It isn't all of the records, but it is at least 80% or more. If it is relevant... this environment is not very busy. It has only a few example deployments to test the IPFIX exporter. @srikartati |
@robcowart Thanks for confirming. Could you please check if the following sysctl config is set in your system? /proc/sys/net/netfilter/nf_conntrack_timestamp |
Every node in the cluster has the same value...
|
Describe the bug
IPFIX start and stop timestamps are wrong/too large.
To Reproduce
I simply installed a new Kubernetes lab environment, and installed Antrea.
Expected
Start and Stop timestamps sent in the IPFIX record should be the actual start and stop time.
Actual behavior
As seen in the PCAP output below, the timestamps are ~85 1/2 years in the future.
The hex value is...
ff ff ff f1 88 6e 09 00
Versions:
Please provide the following information:
Additional context
Add any other context about the problem here, such as Antrea logs, kubelet logs, etc.
And the hex dump...
The text was updated successfully, but these errors were encountered: