Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown when a component panics #3955

Open
mx-psi opened this issue Sep 1, 2021 · 10 comments
Open

Graceful shutdown when a component panics #3955

mx-psi opened this issue Sep 1, 2021 · 10 comments
Labels
area:service enhancement New feature or request

Comments

@mx-psi
Copy link
Member

mx-psi commented Sep 1, 2021

Is your feature request related to a problem? Please describe.

When a component panics the Collector does not gracefully handle the panic and abruptly stops.

Describe the solution you'd like

The Collector would recover from the panic, correctly log it and gracefully shut down. This could be done with a catch-all recover call on the service.Collector code or at the level of components helpers which would recover and report a fatal error.

Describe alternatives you've considered

Continue as is without graceful shutdown.

@PaurushGarg
Copy link
Member

@alolita please assign this issue to me.

@cpheps
Copy link
Contributor

cpheps commented Nov 8, 2022

@PaurushGarg @mx-psi I'm wondering if this is still being worked on. In our distribution of the collector it seems like panics are becoming more common as the number of components in contrib grows. Right now there's not a great way to gracefully handle panics from components and pass them along to the caller of the collector.

@mx-psi
Copy link
Member Author

mx-psi commented Nov 8, 2022

I am not actively working on this, feel free to take it if you want. My main concern with this was graceful shutdown, re-reading this maybe we want to raise the panic again after the shutdown. Feel free to open a PR and we can discuss there

@cpheps
Copy link
Contributor

cpheps commented Nov 8, 2022

Got it. I don't have capacity currently but I do think my team is interested in solving this issue. We may take this on in the next few weeks.

@PaurushGarg PaurushGarg removed their assignment Nov 8, 2022
@atoulme atoulme added the release:required-for-ga Must be resolved before GA release label Dec 18, 2023
@crobert-1
Copy link
Member

Note that I may have misunderstood something obvious here, so please let me know if that's the case.

Summary

I don't think a central recover is possible given go's panic and recover design. recover can only catch panics occurring on its own goroutine (from go's spec on panics). In other words, a goroutine can't recover from another goroutine's panic, even if it's a child goroutine (another reference).

Practically what this means is that to enforce graceful shutdown on panics, each component would need to implement recover functionality for each spawned goroutine. Even this may not be possible to some extent, as components may rely on a dependency that starts its own goroutines which in turn could then panic.

Practical example

I tested using the OTLP receiver ingesting traces using the GRPC protocol.

receivers:
  otlp/receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

I added a panic here for testing.

The following is the stack trace of different goroutines being spawned by the OTLP receiver to get to the introduced panic.

Stack traces
trace.(*Receiver).Export (otlp.go:45) go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace
ptraceotlp.rawTracesServer.Export (grpc.go:89) go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp
<autogenerated>:2
v1._TraceService_Export_Handler.func1 (trace_service.pb.go:310) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
configgrpc.enhanceWithClientInformation.func1 (configgrpc.go:396) go.opentelemetry.io/collector/config/configgrpc
v1._TraceService_Export_Handler (trace_service.pb.go:312) go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1
grpc.(*Server).processUnaryRPC (server.go:1369) google.golang.org/grpc
grpc.(*Server).handleStream (server.go:1780) google.golang.org/grpc
grpc.(*Server).serveStreams.func2.1 (server.go:1019) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc

Created by:

grpc.(*Server).serveStreams.func2 (server.go:1030) google.golang.org/grpc
transport.(*http2Server).operateHeaders (http2_server.go:631) google.golang.org/grpc/internal/transport
transport.(*http2Server).HandleStreams (http2_server.go:672) google.golang.org/grpc/internal/transport
grpc.(*Server).serveStreams (server.go:1013) google.golang.org/grpc
grpc.(*Server).handleRawConn.func1 (server.go:949) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
 - Async Stack Trace
grpc.(*Server).handleRawConn (server.go:948) google.golang.org/grpc

Created by:

grpc.(*Server).handleRawConn (server.go:926) google.golang.org/grpc
grpc.(*Server).Serve.func3 (server.go:917) google.golang.org/grpc
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
grpc.(*Server).Serve (server.go:916) google.golang.org/grpc

Created by:

grpc.(*Server).Serve (server.go:916) google.golang.org/grpc
otlpreceiver.(*otlpReceiver).startGRPCServer.func1 (otlp.go:117) go.opentelemetry.io/collector/receiver/otlpreceiver
runtime.goexit (asm_amd64.s:1695) runtime
:1
 - Async Stack Trace
otlpreceiver.(*otlpReceiver).startGRPCServer (otlp.go:109) go.opentelemetry.io/collector/receiver/otlpreceiver

The result of this is that even the receiver's main goroutine calling startGRPCServer can't handle recovering from the panic. This rules out a uniform way to handle this in the collector, from what I can tell. The panic in the export goroutine becomes a fatal panic once all of its defer methods have been called, it doesn't have a chance to bubble all the way back up to the goroutine running the collector's Run method. Once a goroutine exits from a panic that has not been recovered from, the whole program will terminate/crash.

@mx-psi
Copy link
Member Author

mx-psi commented Apr 19, 2024

@crobert-1 I think the main factor on deciding here is whether the amount of panics that we would be able to catch with one or more recovers is useful/significant to justify the effort of implementing this. We won't be able to catch all panics, but we would be able to catch some.

@TylerHelmuth
Copy link
Member

May have some dependency on #9324

@mx-psi
Copy link
Member Author

mx-psi commented Jul 31, 2024

I am not sure if this should be a target for service 1.0 (and therefore Collector 1.0), it's something we can do after 1.0 as an enhancement. Any thoughts?

@jpkrohling
Copy link
Member

I would vote for not including in 1.0.

@mx-psi mx-psi removed this from the go.opentelemetry.io/service 1.0 milestone Aug 19, 2024
@mx-psi mx-psi added enhancement New feature or request and removed release:required-for-ga Must be resolved before GA release labels Aug 19, 2024
@mx-psi
Copy link
Member Author

mx-psi commented Aug 19, 2024

Agreed, I removed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:service enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants