Service Telemetry Prometheus metrics server does not properly report error

**Describe the bug**
While using Prometheus server for `service.telemetry.metrics`, if the Prometheus server cannot start (e.g. the port used already occupied), OTel framework seems to swallow that error and does not properly report it so that the user/caller can do something about it. 

**Steps to reproduce**
It's pretty easy to reproduce with just very simple setup.

I am using the tag `v0.111.0` (latest available at the time of writing this issue) 

First I built a otelcol binary using the OTel Builder tool. For builder config I used the config in there repo [here](https://github.com/open-telemetry/opentelemetry-collector/blob/v0.111.0/cmd/otelcorecol/builder-config.yaml) with some small adjustments:
```
dist:
  name: otelcol-custom
  description: Local OpenTelemetry Collector binary
  output_path: /tmp/dist
exporters:
  - gomod: go.opentelemetry.io/collector/exporter/nopexporter latest

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/nopreceiver latest
  
replaces:
  - go.opentelemetry.io/collector => <path to my local cloned latest of this repo>/open-telemetry/opentelemetry-collector/
  # ...
  - go.opentelemetry.io/collector/service => <path to my local cloned latest of this repo>/opentelemetry-collector/service
```

The config used to run the otelcol binary is this: 
```
receivers:
  nop:
exporters:
  nop:
service:
  pipelines:
    logs:
      receivers:
        - nop
      exporters:
        - nop
  telemetry:
    metrics:
      level: Basic
      address: 8889
```

Now if I run 2 instances of the otelcol binary using this same config (i.e. both will try to bind to port 8889 for metrics), the logs will be exactly the same: 
```
2024-10-11T14:59:21.948-0700    info    service@v0.111.0/service.go:135 Setting up own telemetry...
2024-10-11T14:59:21.948-0700    info    telemetry/metrics.go:73 Serving metrics {"address": "localhost:8888", "metrics level": "Basic"}
2024-10-11T14:59:21.949-0700    info    service@v0.111.0/service.go:207 Starting otelcol-custom...      {"Version": "1.0.0", "NumCPU": 16}
2024-10-11T14:59:21.949-0700    info    extensions/extensions.go:39     Starting extensions...
2024-10-11T14:59:21.949-0700    info    service@v0.111.0/service.go:230 Everything is ready. Begin running and processing data.
```

Neither instance's logs do not say anything about port 8889 is already in use. 

Since I was using `replace` in the otelcol builder config, I was able to manually insert some `fmt.Printf` into the OTel source code to manually print error. 

I had basically change [this line](https://github.com/open-telemetry/opentelemetry-collector/blob/v0.111.0/service/telemetry/internal/otelinit/config.go#L108) to this:
```
		if serveErr := server.ListenAndServe(); serveErr != nil {
			fmt.Printf("%s\n", serveErr)
			if !errors.Is(serveErr, http.ErrServerClosed) {
				select {
				case asyncErrorChannel <- serveErr:
				case <-time.After(1 * time.Second):
				}
			}
```

Now if I rebuild the otelcol binary and run 2 instances of it again, I do see the address already in use error from 1 of the 2 instances: 
```
2024-10-11T14:59:31.857-0700    info    service@v0.111.0/service.go:135 Setting up own telemetry...
2024-10-11T14:59:31.857-0700    info    telemetry/metrics.go:73 Serving metrics {"address": "localhost:8888", "metrics level": "Basic"}
2024-10-11T14:59:31.858-0700    info    service@v0.111.0/service.go:207 Starting otelcol-custom...      {"Version": "1.0.0", "NumCPU": 16}
2024-10-11T14:59:31.858-0700    info    extensions/extensions.go:39     Starting extensions...
2024-10-11T14:59:31.858-0700    info    service@v0.111.0/service.go:230 Everything is ready. Begin running and processing data.
listen tcp 127.0.0.1:8889: bind: address already in use
```

I think the cause of this issue is this: 

When OTel tries to report the metrics server error to the `asyncErrorChannel`, the channel is actually not set and is nil. I also manually added some print statements and verified that the `asyncErrorChannel` param passed to [here](https://github.com/open-telemetry/opentelemetry-collector/blob/v0.111.0/service/telemetry/internal/otelinit/config.go#L96) is indeed nil. This will basically cause the `case asyncErrorChannel <- serveErr:` to never be executed. 

I think the channel is nil because [here](https://github.com/open-telemetry/opentelemetry-collector/blob/v0.111.0/service/service.go#L121) when it is setting up telemetry.Settings it did not set the `AsyncErrorChannel`. ()

We actually didn't have this problem when we were using an earlier version of OTel (e.g. v0.108.1) because it did set up the async error channel properly [here](https://github.com/open-telemetry/opentelemetry-collector/blob/v0.108.1/service/service.go#L153).



This is quite problematic because if a otel pipeline cannot bind to a port to report metrics, we have no proper way to know it (other than when Prometheus actually starts scraping and we see repeated failures) so that we can retry on a different port. 

**What did you expect to see?**
OTel should either report the metrics server error through the asyncErrorChannel or return the error so that it can be detected at start: `Start(ctx context.Context, host Host) error` 

**What did you see instead?**
The metrics server error is completely swallowed and there is not even a log about it. 

**What version did you use?**


**What config did you use?**


**Environment**
OS: MacOS and Linux
Compiler: go 1.22

**Additional context**
n/a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service Telemetry Prometheus metrics server does not properly report error #11417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service Telemetry Prometheus metrics server does not properly report error #11417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions