Skip to content

Implement weaver registry infer command#1138

Open
ArthurSens wants to merge 15 commits intoopen-telemetry:mainfrom
ArthurSens:weaver-registry-infer
Open

Implement weaver registry infer command#1138
ArthurSens wants to merge 15 commits intoopen-telemetry:mainfrom
ArthurSens:weaver-registry-infer

Conversation

@ArthurSens
Copy link
Member

TLDR

Implements weaver registry infer command that generates a semantic convention registry YAML file by inferring the schema from incoming OTLP telemetry data.

Description

This PR adds a new weaver registry infer subcommand that starts a gRPC server to receive OTLP messages (traces, metrics, logs) and automatically infers a semantic convention schema from the observed telemetry. The command processes incoming data, deduplicates attributes across signals, and collects up to 5 unique example values per attribute to help document the inferred schema.
The inferred schema is written to a single registry.yaml file in the specified output directory (default: ./inferred-registry/). The output follows the standard semantic convention format with separate groups for resources, spans, metrics, and events. Resource attributes are currently accumulated into a single resource group; entity-based grouping (via OTLP EntityRef) is not yet supported but documented for future implementation.

Testing

Tested by using weaver registry emit to send OTLP telemetry to the infer command's gRPC endpoint. The generated registry.yaml file was verified to contain the expected groups (resources, spans, metrics, events) with properly inferred attribute types and example values.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@ArthurSens
Copy link
Member Author

ArthurSens commented Jan 14, 2026

Opening as a draft first, manually tested and it seemed to work :)

Some question that I have:

  1. Should we build v2 schemas instead of v1?
  2. I've created an object called YamlGroup to serialize the YAML file because I couldn't find another object that already does this. Don't we have something like that already? Could we re-use objects the unserialize YAML to also do the serialization somehow?
  3. Is code organized correctly? I'm still struggling to understand when code should go to a separate crate and when it should be in the CLI module.
  4. Do we want to implement entity inference already? I'm not sure how stable Entities are

@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.0%. Comparing base (dfe4670) to head (03d0ad0).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #1138   +/-   ##
=====================================
  Coverage   80.0%   80.0%           
=====================================
  Files        109     109           
  Lines       8528    8528           
=====================================
  Hits        6823    6823           
  Misses      1705    1705           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

let entry = self
.spans
.entry(span.name.clone())
.or_insert_with(|| AccumulatedSpan::new(span.name.clone(), span.kind.clone()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra thoughts here: When the same span name is received multiple times with different kind values, only the first kind is preserved.

Not sure what to do to be honest, should I use more than just span/metric/event/resource name as the identifier? Maybe use all the fields that identify a particular telemetry type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There currently is no identifier for a span so anything done here is guessing. Live-check cannot do span comparisons at the moment because of this too. It just looks at the attributes within.

}

fn add_metric(&mut self, metric: SampleMetric) {
let instrument = match &metric.instrument {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it means, sorry maybe I'm not familiar with the concept of metric.instrument, is this the metric type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of, it's gauge, updowncounter, histogram...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a bit more specific for the discussion we had in our call @nicolastakashi, summaries are indeed deprecated in the OTLP proto, and Weaver is very explicit about not supporting it:

Some(Data::Summary(_)) => SampleInstrument::Unsupported("Summary".to_owned()),

attributes,
});

// Span events as separate event groups
Copy link
Contributor

@jerbly jerbly Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... are span_events and log_events the same thing in semconv? In live-check I just check the attributes for span_events, whereas logs with event_name are checked against event definitions. @jsuereth / @lmolkova ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments from the SIG meeting today:

both span and log events can be mapped to events. Eventually, span events will be deprecated and we can remove this functionality from infer and add a warning in live check if they are present

@jerbly
Copy link
Contributor

jerbly commented Jan 17, 2026

Opening as a draft first, manually tested and it seemed to work :)

Some question that I have:

  1. Should we build v2 schemas instead of v1?
  2. I've created an object called YamlGroup to serialize the YAML file because I couldn't find another object that already does this. Don't we have something like that already? Could we re-use objects the unserialize YAML to also do the serialization somehow?
  3. Is code organized correctly? I'm still struggling to understand when code should go to a separate crate and when it should be in the CLI module.
  4. Do we want to implement entity inference already? I'm not sure how stable Entities are
  1. IMO, we should make v2.
  2. We should use the weaver_semconv crate to make the structure and then Serialize that to YAML.
  3. As it stands it's OK. See below for further thoughts that might change this...
  4. I'm not sure either, this has also not been done in live-check. @jsuereth to comment.

Overall I'm wondering what the intent of this command is. What you have made takes samples and aggregates them into entirely new definitions, I guess to use as a starting point model?

What I had in mind would have been more embedded in live-check, maybe a --infer option to live-check. You would then be comparing samples with an existing model, the otel semconv model by default. The inference would then be to create a new model that depends on and extends the model you're comparing with. This would make a definition with imports, refs and extends.

Also, live-check would be highlighting any items which would be troublesome to make an inference for: e.g. an attribute named MyAttr would fail policy checks around naming conventions (should be some_namespace.my_attr for example).

@ArthurSens
Copy link
Member Author

Overall I'm wondering what the intent of this command is. What you have made takes samples and aggregates them into entirely new definitions, I guess to use as a starting point model?

Exactly. While giving talks about Weaver last year, a very common question was: "I have thousands of metrics already, I don't want to manually rewrite what I have into a schema. Is there anything to make this easier?". That's the problem I'm trying to solve here. As long as there's an appropriate receiver in the collector, you can send data in any format, translate to OTLP, send it to Weaver Infer and you'll have your OTel Schema available. It's up to you to do further modifications to the schema as needed. With a schema available, code generation could build dashboards, could generate instrumentation code that helps migrate from one SDK to another, etc etc.

To be honest, I'm even envisioning a combined functionality of weaver serve+infer, where inferred schemas could be modified through the UI before the user "commits" them to the registry.

The inference would then be to create a new model that depends on and extends the model you're comparing with. This would make a definition with imports, refs and extends.

Interesting! This hasn't crossed my mind at all before. Could you elaborate a bit on the use case for this? What are the problems you wanted to solve?

@jerbly
Copy link
Contributor

jerbly commented Jan 20, 2026

Interesting! This hasn't crossed my mind at all before. Could you elaborate a bit on the use case for this? What are the problems you wanted to solve?

If you run live-check today with an empty registry it will produce an output with every sample and, where possible, it will tell you every attribute and signal is missing in the live_check_result for that sample. You could imagine taking the json report from this live-check and producing an inferred registry like you've done with your code.

Now extend this concept. Rather than starting with an empty registry, start with the OTel semconv registry. The output report can now be interpreted to infer either modifications to the registry, or extensions to it in a child registry.

At my company we have a company-registry which is dependent on the OTel registry. We often find attributes and signals we want to express that fit in the OTel namespaces for example aws. Let's say my application emits aws.s3.bucket and aws.new.attr. I don't want to define aws.s3.bucket again since it's already in the OTel registry, I just want to modify my company registry to add aws.new.attr.

As another example, you produced a registry in your PR: prometheus/prometheus#17868 - moving forward, you could run the live-check inference again with this registry loaded and infer modifications to it alongside live-check telling you what's missing or invalid.

@ArthurSens
Copy link
Member Author

If you run live-check today with an empty registry it will produce an output with every sample and, where possible, it will tell you every attribute and signal is missing in the live_check_result for that sample. You could imagine taking the json report from this live-check and producing an inferred registry like you've done with your code.

So with your idea, if we add a --infer flag to live-check, instead of a json output we would get the YAML file as done in this PR so far?

I can work with that :)

Now extend this concept. Rather than starting with an empty registry, start with the OTel semconv registry. The output report can now be interpreted to infer either modifications to the registry, or extensions to it in a child registry.

Hmmm, I think I understand some parts but others I'm still feeling a bit lost.

  • The output report could be interpreted as extentions in a child registry: We can infer this information if the OTLP message includes Samples that were not present before, is that correct?
  • The output report could be interpreted as modifications to the registry: This is the part where I'm not understaning how we could tell. If our registry has a sample called metric.X, the OTLP message doesn't include this Sample but includes metric.Y... How do I know the difference between a Sample that was renamed or a Sample that was removed completely and a new unrelated one was added?

@jerbly
Copy link
Contributor

jerbly commented Jan 21, 2026

So with your idea, if we add a --infer flag to live-check, instead of a json output we would get the YAML file as done in this PR so far?

I can work with that :)

No, I'm doing a bad job trying to explain this I think.

  • The output report could be interpreted as extentions in a child registry: We can infer this information if the OTLP message includes Samples that were not present before, is that correct?

I'm thinking the command could be: weaver registry live-check -r https://github.com/open-telemetry/semantic-conventions/archive/refs/tags/v1.38.0.zip[model] --infer new - this would collect samples and compare them with the otel registry. Let's say one of the samples is for metric.X with attributes: server.address and server.port. metric.X is not found in the otel registry but server.address and server.port are. The inferred output would be a new registry defining metric.X with references to server.address and server.port. Since we ran --infer new, weaver would also create a registry_manifest.yaml declaring the dependency on https://github.com/open-telemetry/semantic-conventions/archive/refs/tags/v1.38.0.zip[model].

  • The output report could be interpreted as modifications to the registry: This is the part where I'm not understaning how we could tell. If our registry has a sample called metric.X, the OTLP message doesn't include this Sample but includes metric.Y... How do I know the difference between a Sample that was renamed or a Sample that was removed completely and a new unrelated one was added?

This use case could be a later phase.
In this case, the command could be: weaver registry live-check -r my_model_dir --infer modify - in this case, new registry files are created but suffixed with _inferred. Those registry files are a copy of the original with modifications made to them with any changes inferred from the live-check result. For example, let's say we used the registry generated in the example above. We receive a sample of metric.X with the server attributes but now also the attribute error.type. The registry is modified to add this attribute to the metric. This would retain any non-inferrable fields in the original registry e.g. brief, note, annotations.

I think we would need options to determine if weaver should add or overwrite when it finds differences. And, if you want weaver to remove definitions if they were not received in the samples.

--infer modify is quite a bit more complicated and I'm not sure it's worth it. But --infer new, where we're making a dependent child registry I think is important and inline with our multi-registry philosophy.

@ArthurSens ArthurSens marked this pull request as ready for review January 21, 2026 21:08
@ArthurSens ArthurSens requested a review from a team as a code owner January 21, 2026 21:08
@ArthurSens
Copy link
Member Author

ArthurSens commented Jan 21, 2026

Ok, I think I've addressed all comments that are addressable, given what we discussed in the SIG meeting today.

I'm intentionally letting some things undone to keep the scope of the PR small and easier to review:

  • I'm generating v1 schemas instead of v2 -- Not sure if the plan is to allow both, as generate does, or if I should replace v1 with v2 entirely in the future.
  • Functionality to compare the incoming OTLP messages with already existing registries, so inferred schemas use extends and/or imports directives instead of duplicating an entire semantic convention.

But please let me know if any of the above should be worked on in this PR, and if there's anything else you'd like to see here

Comment on lines 193 to 199
let attr_entry = entry
.attributes
.entry(attr.name.clone())
.or_insert_with(|| {
AccumulatedAttribute::new(attr.name.clone(), attr.r#type.clone())
});
attr_entry.add_example(&attr.value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say I have messy real world telemetry. I may receive attr=42 and then in another sample attr="hello". This would make an attr of type int with examples [42,"hello"].

How should mismatched types be handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I didn't think of this. I'm updating the PR to ignore attribute values that differ from the original value received.

This will probably create some race conditions, but not sure what else could be done here 🤔

#[derive(Debug, Clone)]
struct AccumulatedAttribute {
name: String,
attr_type: Option<PrimitiveOrArrayTypeSpec>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be optional? Attributes must have a type.

#[derive(Debug, Clone)]
struct AccumulatedMetric {
name: String,
instrument: Option<InstrumentSpec>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional?

fn add_metric(&mut self, metric: SampleMetric) {
let instrument = match &metric.instrument {
SampleInstrument::Supported(i) => Some(i.clone()),
SampleInstrument::Unsupported(_) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we receive and unsupported instrument, we can't infer a semconv for it by definition. So we should reject the sample as not inferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok gotcha, that makes sense. That allows removing the Option from instrument as well then

Comment on lines 455 to 458
let attr_type = attr
.attr_type
.clone()
.unwrap_or(PrimitiveOrArrayTypeSpec::String);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in an earlier comment. IMO we should not need to have an Optional type and therefore this goes away. I'm guessing this optionality comes from Sample* where we allow missing type or value. This makes sense for live-check, we can just compare an attribute name on its own for validity. It doesn't make sense for infer, this data is mandatory to make a semconv definition.

I would recommend removing the Options where data is mandatory and rejecting samples with None types rather than carrying the Option all through the code.

Comment on lines 476 to 479
/// Convert a vector of JSON values to the appropriate Examples type.
///
/// Uses serde to automatically match the JSON values to the correct Examples variant.
fn json_values_to_examples(values: &[Value]) -> Option<Examples> {
Copy link
Contributor

@jerbly jerbly Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this above too. All examples must be the same type and must match the type of the attribute. If you've determined int they must all be int. (Of course one strategy to handle mismatching types is to change the type to Any and then you can have Any examples).

You could build the Examples as they arrive in the samples rather than this two stage process. Perhaps a function add_example(&value, &examples) -> Result<Examples, Error> where it will make a new Examples given the current examples and value (perhaps changing a single example to an array).

}

fn sanitize_id(name: &str) -> String {
name.replace(['/', ' ', '-', '.'], "_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to convert . to _? The namespace separator in OTel is .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, force of habit 😅


fn sanitize_id(name: &str) -> String {
name.replace(['/', ' ', '-', '.'], "_")
.to_lowercase()
Copy link
Contributor

@jerbly jerbly Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there's a way to use the convert_case crate to snake_case first and then deal with invalid chars. This would convert HelloWorld to hello_world too which is preferable to helloworld.


/// Accumulated attribute with examples
#[derive(Debug, Clone)]
struct AccumulatedAttribute {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we accumulate right into the semconv structs and avoid these additional structs and the two stage process?

Copy link
Contributor

@jerbly jerbly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a few comments:

  • some are to tidy the code which you can treat as nits
  • handling type mismatch and missing essential data I think needs to be addressed
  • optimizing with a single pass to accumulate and translate could be fun, not essential

I think, if we're not supporting v2 in this PR that's ok (it's marked experimental) but we should quickly move on to that in a follow-up. I'd also suggest, in the next PR, we move the main conversion code out to either one of the existing crates or a new one.

FYI. I've been asked for this infer tool a few times now so it's great to see it coming together. Thanks!

@ArthurSens
Copy link
Member Author

  • optimizing with a single pass to accumulate and translate could be fun, not essential

I think I made it work for attributes at least, but I'm struggling a bit to make it work for metrics, spans and events. The hashmap is useful for quick lookups, and I'm not sure how to do the deduplication without the hashmaps 😬

I think, if we're not supporting v2 in this PR that's ok (it's marked experimental) but we should quickly move on to that in a follow-up. I'd also suggest, in the next PR, we move the main conversion code out to either one of the existing crates or a new one.

Happy to tackle both!

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Serde should be able to handle the YAML serialization

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
@ArthurSens ArthurSens force-pushed the weaver-registry-infer branch from 9823e80 to 94161ee Compare February 4, 2026 19:13
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Copy link
Contributor

@jerbly jerbly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: Looks good for a first pass.

Perhaps when adding the v2 support there can be some refactoring to make this more idiomatic Rust. The conversion logic between Sample* types and Accumulated*/AttributeSpec types could use Rust's conversion traits:

  • From<&SampleAttribute> for AttributeSpec - Replace attribute_spec_from_sample() with a From impl
  • From<&AccumulatedSpan> for GroupSpec (and similar for Metric/Event) - Replace the inline conversion in to_semconv_spec()

Maybe add an Accumulate trait - Something like:

trait Accumulate {
    fn accumulate(&self, acc: &mut AccumulatedSamples);
}

Implement for SampleResource, SampleSpan, SampleMetric, etc. This would let add_sample become simply sample.accumulate(self).

But, this is a great addition to weaver, let's get the first iteration in.

Comment on lines +523 to +534
if let Some(resource) = resource_log.resource {
let mut sample_resource = SampleResource {
attributes: Vec::new(),
live_check_result: None,
};
for attribute in resource.attributes {
sample_resource
.attributes
.push(sample_attribute_from_key_value(&attribute));
}
accumulator.add_sample(Sample::Resource(sample_resource));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this resource accumulation block is repeated for each signal. Maybe we can be more DRY here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants