Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatsD UDP publishing stops after failing to send lines #1591

Closed
edwardsre opened this issue Sep 16, 2019 · 17 comments · Fixed by #1782
Closed

StatsD UDP publishing stops after failing to send lines #1591

edwardsre opened this issue Sep 16, 2019 · 17 comments · Fixed by #1782
Labels
bug A general bug registry: statsd A StatsD Registry related issue
Milestone

Comments

@edwardsre
Copy link

edwardsre commented Sep 16, 2019

We have updated several applications to Spring Boot 2.1.8 and metric publishing using the StatsD meter registry has stopped. The actuator/metrics endpoint shows metrics being collected as expected. We publish to localhost on port 8125 at a 1m interval (all defaults).

The actuator/configprops endpoint shows the metrics publication is enabled for statsd.

management.metrics.export.statsd-org.springframework.boot.actuate.autoconfigure.metrics.export.statsd.StatsdProperties: {
  prefix: "management.metrics.export.statsd",
  properties: {
    flavor: "DATADOG",
    publishUnchangedMeters: true,
    port: 8125,
    pollingFrequency: {
    units: [
      "SECONDS",
      "NANOS"
    ]
    },
    host: "localhost",
    maxPacketLength: 1400,
    enabled: true
  }
}

Result summary of Boot and Micrometer versions

Spring Boot Version Micrometer Version Result
2.1.8 1.1.6 ❌ metrics not published to statsd
2.1.8 1.1.5 ✅ working
2.1.7 1.1.4 ✅ working
2.1.7 1.1.6 ❌ metrics not published to statsd

When running Micrometer 1.1.5 and 1.1.4, the Spring Boot java process shows an established connection on an ephemeral port sending to port 8125.

netstat -uv | grep 8125
udp        0      0 localhost:53613             localhost:8125              ESTABLISHED 
sudo netstat -lanp | grep 53613
udp        0      0 ::ffff:127.0.0.1:53613      ::ffff:127.0.0.1:8125       ESTABLISHED 3295/java     

UDP Traffic being received with a capture of the UDP traffic using tcpdump.

sudo tcpdump -i lo -n udp port 8125
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
19:17:34.053140 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1260
19:17:34.134113 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1261
19:17:34.199557 IP 127.0.0.1.53613 > 127.0.0.1.8125: UDP, length 1261

Utilizing the same command when running the same application with 1.1.6, the java process does not show any ephemeral port established and bound to 8125.

After changing the logging level to DEBUG for io.micrometer.shaded.reactor, I could see many onNextDropped debug messages output for each meter with its associated line data:

2019-09-17 21:51:12,895 DEBUG 3319 (http-nio-5000-exec-2) [Operators]  onNextDropped: method.timed:2.426269|ms|#class:LabelJdbcStore,exception:none,method:findLabeledPids

Since Operators uses multicast, that led me to look at ifconfig and finding that multicast is not turned on for the loopback interface for our AWS EC2 instances, but it is for eth0.

ifconfig - No Multicast on loopback

ifconfig
eth0      Link encap:Ethernet  HWaddr 0A:F3:B8:FE:4E:2E  
          inet addr:10.36.73.178  Bcast:10.36.79.255  Mask:255.255.240.0
          inet6 addr: fe80::8f3:b8ff:fefe:4e2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:5908501 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3941816 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6430044133 (5.9 GiB)  TX bytes:934258311 (890.9 MiB)
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:2918885 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2918885 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5714060914 (5.3 GiB)  TX bytes:5714060914 (5.3 GiB)

I turned on multicast for the loopback and restarted the Spring Boot app and metrics are now being published.

Turn on multicast

sudo ifconfig lo multicast

ifconfig lo
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MULTICAST  MTU:65536  Metric:1
          RX packets:3058498 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3058498 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5988434965 (5.5 GiB)  TX bytes:5988434965 (5.5 GiB)

When the loopback interface is configured without multicast, metrics are published when using Micrometer 1.1.5. The eth0 interface has multicast turn on so I'm wondering if that interface was being selected in 1.1.5 and now it isn't in 1.1.6. I haven't been able to narrow down what change in 1.1.6 caused this to stop working.

@breun
Copy link

breun commented Sep 19, 2019

Not sure if it's the same issue, but I have noticed some of our applications using Micrometer with StatsD recently have stopped publishing metrics at seemingly random moment, but all the same time. They recently seem to have been upgraded from Micrometer 1.1.5 to 1.1.6, as managed by respectively Spring Boot 2.1.7.RELEASE and 2.1.8.RELEASE.

Will try downgrading to Micrometer 1.1.5 via Spring Boot's Maven property to see if this helps.

<properties>
    <micrometer.version>1.1.5</micrometer.version><!-- Temporary workaround for StatsD bug: https://github.com/micrometer-metrics/micrometer/issues/1591 -->
</properties>

@shakuzen shakuzen added the registry: statsd A StatsD Registry related issue label Sep 20, 2019
@shakuzen shakuzen added this to the 1.1.7 milestone Sep 20, 2019
@shakuzen shakuzen added the bug A general bug label Sep 20, 2019
@shakuzen
Copy link
Member

If I were to guess, the upgrade from reactor-netty 0.8.1 to 0.8.11 (see #1574 and #1561) is a likely candidate for the difference in behavior here. In previous patches, we never upgraded the shaded version of reactor-netty we were using. I will have to look into this more to confirm the root cause and what to do about this, though.

Thank you for the thorough investigation @edwardsre and write-up of the issue.

shakuzen added a commit that referenced this issue Sep 30, 2019
This is a workaround to #1591 until we understand the root cause and it is fixed.

Resolves #1613
@shakuzen shakuzen modified the milestones: 1.1.7, 1.1.8 Sep 30, 2019
@shakuzen
Copy link
Member

As a workaround, I've downgraded reactor-netty in #1613 which restores the previous behavior as far as my testing can tell.

Request to affected users
Would those affected by this issue please test with Micrometer 1.1.7-SNAPSHOT or 1.1.7 when it is released to confirm this fixes things for now?

I'm leaving this issue open as we still want to identify the root cause and figure out where a fix needs to happen so we are not forever stuck on older versions.

@shakuzen shakuzen changed the title Micrometer 1.1.6 has broken StatsD meter publishing. StatsD publishing broken when multicast is disabled Oct 1, 2019
@breun
Copy link

breun commented Oct 1, 2019

In the cases I witnesses and heard about, the application has been running for a couple of days already. Does anyone have a quicker way to reproduce/test?

@breun
Copy link

breun commented Oct 1, 2019

@shakuzen I heard from colleagues that they also encountered this with Micrometer 1.2.0. Is that possible and does reactor-netty maybe also need a downgrade in the 1.2.x branch?

@breun
Copy link

breun commented Oct 1, 2019

@shakuzen I see Micrometer 1.1.7, 1.2.2 and 1.3.0 have been released. You say reactor-netty was downgraded in 1.1.7, which seems to be have been released now, and the 1.3.0 release notes mention all changes in 1.1.7 and 1.2.2 are also incorporated into 1.3.0. However, the 1.3.0 release notes also mention this issue as a known issue. So, is the issue fixed or not in 1.1.7, 1.2.2 and 1.3.0?

@edwardsre
Copy link
Author

@shakuzen I have deployed one of our affected applications with the 1.1.7 release and it seems to have remedied the reported problem seen with 1.1.6. For us, it did not take days for the break to occur as suggested by @breun above, but was immediate upon deploy. Either way, I will watch it for a while to make sure it doesn't stop working.

Thanks for leaving this open and continuing to look into the root cause in reactor-netty and netty.

@shakuzen
Copy link
Member

shakuzen commented Oct 1, 2019

However, the 1.3.0 release notes also mention this issue as a known issue. So, is the issue fixed or not in 1.1.7, 1.2.2 and 1.3.0?

The workaround of downgrading the reactor-netty and netty dependencies is in the 1.1.7 and 1.2.2 release. 1.3.0 upgrades to the recent reactor-netty 0.9.0 minor release, and there were some compatibility issues with downgrading that, so the issue is still present in 1.3 for now. Once the root cause of this issue is identified and fixed, we'll get out a patch release with the fix.

@edwardsre Thank you for trying out the release and confirming things are back to working as expected so far.

@patkovskyi
Copy link

Does not work for me with Micrometer 1.1.7 & Spring Boot 2.1.9.
Works with Micrometer 1.1.5 & Spring Boot 2.1.9.

@patkovskyi
Copy link

Spring Boot 2.2.0 (comes with Micrometer 1.3.0) also does not work.

Could someone clarify what's the recommendation for Spring Boot users?
Stick with the latest Spring Boot and use Micrometer 1.1.5?

@breun
Copy link

breun commented Oct 24, 2019

Have you tried using Spring Boot 2.2.0.RELEASE and setting <micrometer.version>1.1.7</micrometer.version> in your Maven pom.xml properties (or the equivalent in Gradle if you use that) to see if Spring Boot 2.2.0.RELEASE is compatible with Micrometer 1.1.7?

If not, I guess you need to stick with Spring Boot 2.1.x until this issue is fixed in Micrometer 1.3.x.

@shakuzen
Copy link
Member

I have investigated this more and the issue seems to be that the behavior of the reactor-netty UdpClient has changed:

before reactor-netty 0.8.7 / netty 4.1.36
UdpClient continues to send statsd lines even if the statsd host isn't accepting the connection.

since reactor-netty 0.8.7 / netty 4.1.36
UdpClient stops sending statsd lines if it fails to send twice.

Note that there should be no problem as long as the statsd daemon is started and accepting lines before the application starts sending statsd lines, and the application is able to continually send lines to the daemon. If, however, the statsd daemon is not accepting the lines sent by the application, the application will stop sending lines with the new versions of netty and reactor-netty.

I don't know if the change in behavior is intentional or not. Maybe it was caused by netty, and not something directly in reactor-netty. Perhaps @violetagg can speak to that.

In Micrometer, we could obviate the issue by implementing recovery logic for failing to send statsd lines via UDP as was already requested for TCP in #1212. Otherwise we would need a way to get the previous behavior from the underlying reactor-netty UdpClient.

@violetagg
Copy link
Contributor

@shakuzen Can you test 0.8.13 Reactor Netty + 4.1.43 Netty because we have changes there?

@shakuzen
Copy link
Member

Can you test 0.8.13 Reactor Netty + 4.1.43 Netty because we have changes there?

@violetagg I have tried with 0.8.13 Reactor Netty + 4.1.43 Netty and it has the same behavior as mentioned above for "since reactor-netty 0.8.7 / netty 4.1.36".

@shakuzen shakuzen changed the title StatsD publishing broken when multicast is disabled StatsD UDP publishing stops after failing to send lines Nov 4, 2019
@shakuzen
Copy link
Member

shakuzen commented Nov 4, 2019

I have updated the title of the issue to reflect my understanding after my testing showed consistent behavior regardless of multicast. I will open a separate issue for Micrometer 1.3.1 to temporarily downgrade our shaded versions of reactor-netty / netty to 0.8.6 / 4.1.34 which have the previous behavior. I'll leave this issue open to track the long-term solution.

@russell-bie
Copy link

Does it mean this issue has been fixed in version 1.3.1 by downgrading the netty packages?

@shakuzen
Copy link
Member

shakuzen commented Nov 7, 2019

Does it mean this issue has been fixed in version 1.3.1 by downgrading the netty packages?

That is my expectation. It is more of a workaround than the long-term solution. I'm leaving this issue open until we have a long term solution that allows us to upgrade to the latest versions of dependencies.

Let us know if this isn't fixed for you with 1.3.1. In my testing, it appeared to be fixed.

@shakuzen shakuzen modified the milestones: 1.1.9, 1.4.0 Nov 28, 2019
shakuzen added a commit that referenced this issue Jan 27, 2020
Upgrades to the latest version of reactor-core and reactor-netty. Uses a `DirectProcessor` instead of a `UnicastProcessor` that can only be subscribed to once. We need to be able to subscribe multiple times for reconnecting to a server with a new `TcpClient`. For multi-threaded publishing to the `Processor`, we use a `FluxSink` rather than calling `onNext` directly. Starting and stopping the registry should now work as expected and is tested.

Unlike before, there is no buffering done in the `Processor`. If the UDP/TCP client or `lineSink` cannot keep up with the rate metrics are produced at, metric lines will be dropped rather than buffered indefinitely (potentially until the application runs out of memory). A mitigation when using the UDP/TCP client is to use the `BufferingFlux` which is enabled by default - it buffers metric lines in memory up to a configurable size/time before emitting to the UDP/TCP client. As a consequence, the queue size/capacity methods and metrics (`StatsdMetrics`) are no longer available.

Resolves #1212
Resolves #1591
Resolves #1676
Resolves #1741
Supersedes #1251

Co-authored-by: Johnny Lim <izeye@naver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: statsd A StatsD Registry related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants