Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messaging-System Metric semantic conventions #1077

Conversation

cwildman
Copy link

@cwildman cwildman commented Oct 8, 2020

Changes

Add semantic conventions for messaging system metrics.

Related issues:
Fixes #1014

PR Details

The general approach I took here was to take the messaging operations referenced in the tracing semantic conventions (sending, receiving, processing) and break them into 3 distinct metric groups. I created metrics for monitoring throughput and latency within these operations.

I've left individual commits to highlight some open questions/points of interest that I will elaborate on here.

Metric naming with Span Kind

@justinfoote mentioned that the preference is that when name spacing labels and metrics the second field should be the "span kind". The second commit attempts to follow this convention. However the span kind as defined in the tracing semantic conventions of messaging suggest producer/consumer should be used for async message processing, and server/client should be used for synchronous message processing. Should we change the metric names based on whether the application is sync or async? I assumed no. Also there's a bit of ambiguity here because the tracing semantic conventions yaml refers to things like messaging.consumer.synchronous instead of messaging.client.

Superfluous Counters vs. ease of use

I created a specific Counter to track the messages sent, received and processed. This information is already captured in the
count of the duration metric ValueRecorder. Personally I feel like the extra metric is worth the ease of use but I don't know where the OTEL community stands on this.

Semconv format and generator

I went ahead and attempted to use the semconv yaml format and generator. Many of these labels are identical to tracing messaging attributes but I went ahead and copied them into a separate yaml for messaging metrics. I could not get the generator to actually work though. Here's an example: docker run --rm otel/semconvgen --yaml-root ../../opentelemetry-specification/semantic_conventions markdown --markdown-root ../../opentelemetry-specification/specification

Batch metrics

The last commit adds metrics around batching because I see that as a common feature that needs to be monitored. Introducing metrics around batching does create confusion around whether some of the non-batch metrics still exist or what they mean. I'm interested to hear if others feel the behavior is sufficiently explained or if batch metrics should be dropped entirely.

Other metrics

Here's a list of other potential metric categories that I omitted. I'm assuming these are too specific to only a subset of messaging systems to be included here:

  1. Lag
  2. I/O time
  3. Connection pooling
  4. Errors
  5. Retries

@jmacd @bogdandrutu mentioning you as potential reviewers.

@cwildman cwildman requested review from a team October 8, 2020 22:48
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 8, 2020

CLA Check
The committers are authorized under a signed CLA.

@arminru
Copy link
Member

arminru commented Oct 9, 2020

Please add an entry in CHANGELOG.md

specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
semantic_conventions/metrics/messaging.yaml Outdated Show resolved Hide resolved
brief: 'Connection string.'
examples: ['tibjmsnaming://localhost:7222', 'https://queue.amazonaws.com/80398EXAMPLE/MyQueue']
constraints:
- any_of:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the work on #1027, the net.* names should be removed as constraints and added as references like for tracing

specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved

| Name | Instrument | Units | Description |
|----------------------|---------------|--------------|-------------|
| `messaging.consumer.processed.messages` | Counter | messages | Sum of messages processed. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name might need altering, or messaging.consumer.messages needs altering to include "receive" in the name.

Otherwise, this metric could be implied to be a child of the other metric by virtue of the naming.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll think about how to reword these. My instinct was to have the metric name be messaging.processed.messages or messaging.processor.messages, however following the span kind naming convention it says that the processor should use "consumer" as its span kind.

@cwildman cwildman force-pushed the metric-semantic-conventions-messaging branch 6 times, most recently from 9e89bdf to 2cd7b4c Compare October 13, 2020 04:59
@cwildman
Copy link
Author

I was able to resolve some of my issues with the semconv generator. There were a few things going on:

  1. My /Users bind mount in docker for mac doesn't work even though it is automatically included.
  2. The semconv generator doesn't allow for multiple instances of the same id, e.g. messaging. I just used a different id for the metrics messaging labels named metrics-messaging.
  3. I used a reference to point to all the labels that would have the same name as an existing tracing attribute.

While the generator works correctly for me it seems I'm failing the md-check with File /spec/metrics/semantic_conventions/messaging.md contains a table that would be reformatted. even though I'm using the same generator to create the table. Anyone have suggestions for that?

@cwildman cwildman force-pushed the metric-semantic-conventions-messaging branch from b8c94dc to 8c0d5b2 Compare October 15, 2020 17:44
@jmacd
Copy link
Contributor

jmacd commented Oct 15, 2020

Discussed in the Metrics SIG today: we think it would be best to eliminate the superfluous counters. This may be problematic in the short term, but in the long term we can address these with a Views API, we think.

@cwildman
Copy link
Author

@jmacd thanks for the heads up. Removed those.

@cwildman
Copy link
Author

@open-telemetry/specs-metrics-approvers this is ready for review!

1 similar comment
@justinfoote
Copy link
Member

@open-telemetry/specs-metrics-approvers this is ready for review!

@cwildman cwildman force-pushed the metric-semantic-conventions-messaging branch from ca01fa0 to cf9ba47 Compare October 23, 2020 00:28
@cwildman cwildman changed the title Metric semantic conventions messaging Metric semantic conventions messaging fixes #1014 Oct 23, 2020
@cwildman cwildman force-pushed the metric-semantic-conventions-messaging branch from cf9ba47 to 4988c5e Compare October 23, 2020 00:33
@jmacd jmacd changed the title Metric semantic conventions messaging fixes #1014 Messaging-System Metric semantic conventions Oct 23, 2020
Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure, labels, and instrument definitions look good. Just issues with some of the terminology and formatting. Good job 👍

specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
specification/metrics/semantic_conventions/messaging.md Outdated Show resolved Hide resolved
@andrewhsu
Copy link
Member

at the metrics sig mtg today talked about this issue's staleness

@github-actions github-actions bot removed the Stale label Nov 6, 2020
@cwildman
Copy link
Author

cwildman commented Nov 6, 2020

@MrAlias I think I addressed almost all of your feedback. I had questions/comments on a couple of them. Thanks!

@cwildman cwildman force-pushed the metric-semantic-conventions-messaging branch from aa5f02c to 8683a2a Compare November 10, 2020 19:25
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read this PR and wonder if the recently-added guidelines on metrics named .io should be applied here. Instead of having two metrics <something>.bytes for each direction of travel, we can use <something>.io with a label to indicate whether it was produced or consumed.

@cwildman
Copy link
Author

@jmacd interesting. To be a little more concrete you're imagining renaming these metrics:

messaging.producer.bytes -> messaging.io
messaging.consumer.received.bytes -> messaging.io

And then we'd have some labels like direction/operation which can be one of ['produce', 'consume']?

Some of my initial reactions to this:

  1. What happens to the other metrics like messaging.producer.duration? It seems confusing as a user that there's not a consistent naming scheme for my producer metrics.
  2. I'm surprised the semantic conventions recommend using .io for any bidirectional data flow. For me I/O is specific to data flowing in or out of process from external devices. This is in contrast to data flow in process that is abstracted from the raw I/O that may have been involved or may not have come directly from I/O at all.
  3. It seems like the .io convention should have some guidance on what the direction label should be called.

@aabmass
Copy link
Member

aabmass commented Nov 13, 2020

@cwildman maybe this could be helpful to decide:

As a rule of thumb, aggregations over all the dimensions of a given metric SHOULD be meaningful," as Prometheus recommends.

Would a user ever want to aggregate together the produced/consumed bytes into one timeseries (or would it be meaningful)? If not, then I suppose they should be separate.

It seems like the .io convention should have some guidance on what the direction label should be called.
That should probably be specified. For system metrics, it's label direction with value read | write

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 21, 2020
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 28, 2020
@jgals
Copy link

jgals commented Dec 4, 2020

Hey @cwildman and @jmacd, I'm commenting here to avoid having this closed by the bot. 🤖

@cwildman
Copy link
Author

cwildman commented May 3, 2021

I'd like to reopen this PR and hopefully discuss at the spec meeting this week.

@sergeykonkin
Copy link

@cwildman Hi!
Any chances to revive this PR and finish a work on this? We are looking forward on this specification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define metric semantic conventions for messaging systems