Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MP Metrics 5.0 support for 4.x #7139

Merged
merged 67 commits into from
Jul 14, 2023
Merged

Conversation

tjquinno
Copy link
Member

@tjquinno tjquinno commented Jun 30, 2023

Resolves #5816

👀 It's a long list of reviewers! Changes in the metrics API affected many parts of Helidon that use metrics, including several integrations. I've added people who IIRC have worked on such integrations so you can take a look.

Overview

Note: Look for 🚧 User action flags below for what developers need to do to adapt to the changes. The MP spec section on migration hints has information also.

The latest MP Metrics release is 5.0.1.

This PR contains all the code changes that I know of to implement MP Metrics 5.0. As noted in separate meetings, the TCK is not yet running because the way we set up the Arquillian environment leads to many test failures. That remains a work in progress.

To detect other problems, I have commented out the invocation of the metrics TCK in this PR.

What this PR does not include:

  • doc changes
  • use of the new Helidon injection/builder approaches
  • reactive removal
  • removal of nima directories

MP Metrics 5.0

There were breaking changes in the MP metrics libraries--to help simplify the API and bring MP metrics in closer alignment with Micrometer and OpenTelemetry metrics. Notable changes:

  • metric types removed

    🚧 User action - use an alternative metric type and annotation

Metric type/annotation removed Alternative
ConcurrentGauge
@ConcurrentGauge
Gauge
@Gauge
Meter
@Metered
Counter
@Counted - back-ends can track the changes in the counter value over time and create time series analysis from that (meter used to maintain historical data itself in the server)
SimpleTimer
@SimplyTimed
Timer
@Timed
  • metadata removals

    • type
    • display name

    🚧 User action - Remove code that sets the type or display name in the metadata programmatically and remove settings in the @Metric annotation.

  • Other

    • The notion of a registry "type" with fixed values of base, vendor, and application has given way to the idea of a "scope" instead. There are three built-in scopes--base, vendor, and application--but users can create their own scopes.

      • MetricRegistry.Type was an enum covering base, vendor, and application. It's gone because users can create their own so an enum doesn't really work.
      • @RegistryType is deprecated, intended to be replaced by @RegistryScope. Unfortunately, @RegistryScope is not (yet) marked as a CDI qualifier (that'll change in an upcoming release of MP Metrics), so for now our CDI code still has producers for the three registry types using @RegistryType as well as a producer for injection sites that use @RegistryScope rather than @RegistryType.

      🚧 User action

      • Change uses of MetricRegistry.Type to a String with the corresponding name. The MetricRegistry interface declares String constants for the predefined scopes.
      • Over time, change uses of @RegistryType (deprecated but not yet removed) to @RegistryScope.
    • All metrics that share the same name must:

      • Have the same metadata (this is not new)
      • Have the same tag name set (this is new). Apparently the Prometheus exporter (not ours) needs this restriction for some reason in order to produce correct output; I am not at all familiar with that code but the restriction is written into the spec and checked by the TCK so our code enforces this.

      🚧 User action - Make sure all metrics that share the same name use the same tag names.

    • The metric registry API no longer permits you to create your own instance of one of the supported metric types (such as your own Counter) and register it. This is because Micrometer and OpenTel are totally in control of manufacturing the metric instances.

      Some of our code (integrations, for example) used to do this in order to wrap other technology's measurements with custom implementations of the MP metrics (typically a counter). For the most part where I had to change our code to adjust, I changed such places to use gauges instead. In line with what Micrometer supports, The Registry interface usable from SE apps includes a way to register a read-only Counter by passing a ToDoubleFunction and an object to which to apply that function. But this is not visible through the MP MetricRegistry interface. In some places I used that approach.

      🚧 User action - Use a Gauge metric type to expose a value that is updated outside of the metrics API.

    • Change in Gauge to Gauge<? extends Number> - The Gauge metric was always intended to expose numeric data. The updated API now enforces this.

      🚧 User action - Declare each of your Gauge metrics so it reflects the specific numeric type the gauge exposes.

Metric types remaining from earlier releases are Counter, Gauge<? extends Number>, Histogram, and Timer.

Why do so many files have to change?

Our SE metrics implementation has always dependend--and, for the moment, still does depend--on the MP metrics types. We plan for that to change by introducing a neutral Helidon metrics API which will eventually have implementations which delegate to Micrometer and OpenTelemetry metrics.
Then our MP metrics implementation will invoke the neutral Helidon API, as will customers who develop SE apps.
People will be able to choose whether to use Micrometer or OTel underneat for either SE or MP apps.
But we have not yet defined or implemented the neutral API yet. (I have thought a lot about what that API should look like but the code is not written yet.)

So, the MP metrics changes rippled through our SE metrics API and implementation, and into other code that uses metrics, and into examples, and into tests.

Other notes

Although this PR does not introduce the neutral Helidon metrics API, there are some beginning steps in that direction.
For example, the new MetricFactory SPI interface encapsulates the details of creating new metrics. There are two implementations so far: one that delegates to Micrometer and one that provides no-op implementations of the various metrics.

Other changes in behavior

Prometheus output

In addition to the API changes, the Prometheus output now is slightly different. Each line showing a metric value (that is, not the # TYPE or # HELP lines) has an additional tag to identify the scope where that metric was registered. In MP, the tag name is mp_scope per the spec; in SE it (for the moment) is h_scope.

This is because our code now stores metrics in the Micrometer meter registry...a single meter registry for all scopes we and our developers might create. That's the way Micrometer works.

The Micrometer world does not use different meter registries for scoping but rather for dealing with different types of backends/consumers of metrics output. Typical applications use the Micrometer global registry for adding, removing, and querying meters.
That global registry is a composite registry, meaning you can add technology-specific meter registries to it.
There is a Prometheus Micrometer meter registry Helidon creates and adds (if absent) to the global registry.
Applications create and add other types of meter registries to the global one to support other back-ends.

Further, the Prometheus Micrometer meter registry knows how to format its output so we will be able to remove our own code that does that.

JSON output

It's still there in Helidon, but the MP metrics spec no longer requires it. At this moment, I have not added back the JSON OPTIONS support for metadata.
I might do that if I have time.

We might want to remove JSON support at some point--perhaps deprecate it in 4.0--but there was some Helidon code that used it for testing and there might be user code out there that relies on it.
Users will have plenty to adapt to in 4.0 already, so it might be worth keeping JSON format output for now to ease that migration.

I already changed some of the JSON-dependent testing code to use Prometheus output instead; it is possible to do that everywhere if we need to.
Because there will be a cost to us in keeping it around--even if deprecated--until 5.0, we might want to remove it now.
To my knowledge, MP metrics never marked the JSON output as deprecated during the MP metrics 4.x lifespan, for what that's worth.

Current state

Work done

  1. AFAIK, all our MP and SE code and other files (excluding doc) are now consistent with the new MP Metrics spec and changes in the API.
  2. I've updated the MP fault tolerance dependency from 4.0 to 4.0.2 so it depends on the updated MP metrics library instead of the old one as 4.0 did.

Work left to do for full MP metrics support in M1

  1. Adapt to upcoming big merges as needed.
  2. The MP spec contains a section with suggestions (not requirements) for vendors who choose to use Micrometer for the implementation, mostly how to plug in and configure Micrometer support for other back-ends. This is nice to have but is certainly not required for M1.

@tjquinno tjquinno added this to the 4.0.0-M1 milestone Jun 30, 2023
@tjquinno tjquinno self-assigned this Jun 30, 2023
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 30, 2023
@tjquinno tjquinno linked an issue Jun 30, 2023 that may be closed by this pull request
@tjquinno tjquinno force-pushed the mp-metrics-4.x-mm branch from bc72cfb to 05447e9 Compare July 1, 2023 23:36
@tomas-langer
Copy link
Member

tomas-langer commented Jul 3, 2023

There seems to be a problem with the calls to registry. Calls to /observe/metrics fail with 404. But if I call /observer/metrics/base, it succeeds, and then all calls to /observer/metrics succeed as well.
Investigating

Problem is in RegistryFactory.getInstance().scopes() that returns an empty set, as the registries field is empty.
Investigating

So the registries field is lazyly initialized as the various scopes are requested.
Possible fix:

    @Override
    public Iterable<String> scopes() {
        Set<String> result = new HashSet<>(registries.keySet());
        // always add base and vendor, as these are known to exist (even if not yet initialized)
        result.add(Registry.BASE_SCOPE);
        result.add(Registry.VENDOR_SCOPE);
        return result;
    }

@tjquinno
Copy link
Member Author

tjquinno commented Jul 3, 2023

Exactly. I also found this independently. Pushed a fix to the PR branch.

@tjquinno
Copy link
Member Author

tjquinno commented Jul 3, 2023

We definitely need to do this for base (but only if it is enabled; remember metrics can be disabled by scope).

But you think also for vendor? Base is the only registry lazily instantiated upon first read.

@tjquinno
Copy link
Member Author

tjquinno commented Jul 3, 2023

Still investigating, but the actual underlying cause might be different.

With the change to the scopes method to always return base I am seeing failures in the bookstore functional test because the request.count metric from the key performance indicator metrics is not present in the output.

It seems that maybe the metrics feature post-setup code is not invoked when expected. That code should be fetching the base metrics registry to trigger its creation and also set up the KPI metrics...and neither seems to be happening and in the debugger I don't see the metrics feature postSetup method invoked.

@barchetta barchetta added metrics MP dependencies Pull requests that update a dependency file labels Jul 5, 2023
@tjquinno
Copy link
Member Author

BTW, the latest pushes resolve the problem Tomas and I discovered with the base metrics.

Copy link
Member

@klustria klustria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it relates to changes in integrations/oci/metrics, it LGTM, hence giving approval from that perspective.

tjquinno added 24 commits July 12, 2023 23:30
…less sensitive to ordering of TEST and HELP lines in Prometheus output
…d created yet, when Registry is asked for the list of scopes; also make sure metrics is properly declared as a feature
…shutdown, now that the clean-up is in the RegistryFactory stop rather than start method
private static String valueMatcher(String statName) {
// application_timerForGets_mean_seconds 0.010275403147594316 # {trace_id="cfd13196e6a9fb0c"} 0.002189822 1617799841.963000
return "application_" + GreetService.TIMER_FOR_GETS
+ "_" + statName + "_seconds [\\d\\.]+ # \\{trace_id=\"[^\"]+\"\\} [\\d\\.]+ [\\d\\.]+";
// TODO Following is the original, exemplar-matching pattern. Suppressed temporarily while we migrate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you place the issue number here as part of the follow-up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in this PR if there are other, functional, changes as well.

@@ -34,6 +34,7 @@ class MicroProfileMetricsTrackerFactory implements MetricsTrackerFactory {
this.registry = null;
}

// TODO change to RegistryScope once MP makes it a qualifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracking issue inlined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in this PR if there are other, functional, changes as well.

private Instance() {
}

private static final LazyValue<MetricsProgrammaticSettings> INSTANCE = LazyValue.create(() ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we using the @Contract and pico style approach here? Holistically, I think we need to stop using this pattern - @tomas-langer should comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely needs to happen. Much was changing in a short time as this was coming together and I chose to defer those changes until we had MP metrics operating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct approach for now. We may want to use contract based lookup, but only when running within injection service registry. It still must work outside of it

* @throws java.lang.IllegalArgumentException if the implementation cannot handle the requested media type
* @throws java.lang.UnsupportedOperationException if the implementation cannot expose its metrics
*/
Optional<?> scrape(MediaType mediaType, Iterable<String> scopeSelection, Iterable<String> meterNameSelection);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning Optional<?> seems wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return null instead if the provided selections result in no output, but it's against our guidelines to use null as part of our APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not the optional i am commenting on - it is the use of the wildcard generic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying.

If you look at the two formatters that can be called, the Prometheus one returns an Optional<String> and the JSON one returns an Optional<JsonObject>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case why not return a Builder with two Optional<> attributes? I think that would be better than using generics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, the value (if present inside the Optional) is sent directly as the entity in the response. The code in MetricsFeature has no need to know the specific type of the returned value. The media support handles whatever the actual type is in sending the response. It's very flexible if we ever need to add another formatter that produces another type.

Adding another builder would (IMO) unnecessarily complicate things now, and if we were to add other format types in the future then we'd have to update the builder as well. As is, a hypothetical new formatter would just return its type and we'd be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw Optional<Object> would work the same and it is less unusual on the API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to follow-up tasks issue #7193

void addCounterWithTag() {
Counter counter = reg.counter(COMMON_COUNTER_NAME, new Tag("myTag", "a"));
assertThat("New counter value", counter.getCount(), is(0L));
counter.inc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of this increment call if its not getting checked later on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @BeforeEach exercises the metric deletion feature. The increment of the counters in each test verifies that the counter was, in fact, deleted correctly by that code. If the deletion does not work correctly, then the counter value in the other test could be non-zero as a result of the increment at the end of the first test.

* @param <T> type of the origin
*/
static <T> HelidonFunctionalCounter<T> create(MeterRegistry meterRegistry,
String scope,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alignment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change in this PR if there are also other, functional, changes needed.

Copy link
Member

@tomas-langer tomas-langer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of MIcrometer as part of Metrics API was unexpected for me, and I would like to understand this.
I thought the MP was kept independent yet aligned with Micrometer, but now we have a hard dependency on it.

@@ -50,6 +50,14 @@
<groupId>org.eclipse.microprofile.metrics</groupId>
<artifactId>microprofile-metrics-api</artifactId>
</dependency>
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Micrometer an explicit dependency here? I thought the APIs should be MP and not Micrometere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not ideal, but we plan for Helidon to introduce its own neutral metrics API and, once that is in place, we will change our MP metrics implementation to depend on that rather than directly on Micrometer.

But due to scheduling complications earlier in the spring and given our goal of having MP metrics support in M1 we decided to go ahead with a first MP implementation without waiting for the neutral API, knowing we would change the MP implementation once the neutral API is in place.

The neutral API will likely be very close to the Micrometer API, partly for the convenience of developers and partly because of Micrometer's maturity in this space. In that case, the changes we will need to make in our MP implementation, although not zero, will be relatively small. Much of the code already uses a MetricFactory SPI abstraction to help minimize the changes required later.

We would definitely have preferred to have the neutral API in place first, but I just couldn't make that happen in time for MP metrics in M1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to fix this for M2. It doesn't look very nice. I should be able to help Tim.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should implement our neutral API and adapt SE and MP to it soon.

private Instance() {
}

private static final LazyValue<MetricsProgrammaticSettings> INSTANCE = LazyValue.create(() ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct approach for now. We may want to use contract based lookup, but only when running within injection service registry. It still must work outside of it

* customization.
* </p>
*/
public interface MetricsProgrammaticSettings {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a follow up for naming here - we use Config as the suffix for configuring components in other modules, I think it should be used for this class as well

* @throws java.lang.IllegalArgumentException if the implementation cannot handle the requested media type
* @throws java.lang.UnsupportedOperationException if the implementation cannot expose its metrics
*/
Optional<?> scrape(MediaType mediaType, Iterable<String> scopeSelection, Iterable<String> meterNameSelection);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw Optional<Object> would work the same and it is less unusual on the API

*
* @param <T> type of the object providing the value
*/
class HelidonFunctionalCounter<T> extends MetricImpl implements Counter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it correctly that this is basically a Gauge? Why do we still have it? Or is it represented differently by the format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, this allows a Counter-style "wrapper" around an externally-updated value instead of a Gauge-style wrapper.

Earlier releases of MP metrics allowed developers to create their own implementations of metrics and register them. Helidon used this to register custom counters, for example.

MP Metrics 5.0 no longer permits this, but the ability to have a Counter-like wrapper is useful (and present in Micrometer and OTel metrics) so will likely be in the Helidon neutral API as well.


private HelidonGauge(String registryType, Metadata metadata, Gauge<T> metric) {
super(registryType, metadata);
private final io.micrometer.core.instrument.Gauge delegate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we should have support for Micrometer and possibly other metrics implementation, but not hardcoded into our impl, but through provider modules.
Why is there a hard dependency on Micrometer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier note about the neutral API and timing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize from the description that we must use Micrometer to be able to generate prometheus output (as we used to generate it ourself).
I have approved the PR, we need to fix this eventually when we switch to Helidon Metrics API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that our neutral API will layer on one or more actual implementations (Micrometer, OTel metrics), we should delegate to the underlying implementation to publish, whether Prometheus or other (push or pull) models, rather than duplicate that functionality in our own code.

Copy link
Member

@tomas-langer tomas-langer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are things to follow up on, let's merge so we have MP 6 complete

@tjquinno tjquinno merged commit 9327495 into helidon-io:main Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file metrics MP OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4.x: Upgrade MP Metrics to 5.0
6 participants