Storage node doesn't stop at the shutdown stage #1270

alexvanin · 2022-03-25T08:28:45Z

Expected Behavior

When neo-go node closes connection with storage node, application changes the state to SHUTTING_DOWN and then process is finished.

Current Behavior

Application is frozen at the SHUTTING_DOWN stage and process never finished.

It seems that gRPC server for NeoFS API endpoint cannot be closed. We see stopping gRPC server... log but there is no gRPC server stopped successfully message.

11:10:16.450Z        warn        subscriber/subscriber.go:134        remote notification channel has been closed
11:10:16.450Z        warn        event/listener.go:236        stop event listener by notification channel
11:10:16.450Z        warn        replicator/process.go:28        context is done        {"component": "Object Replicator", "error": "context canceled"}
11:10:16.450Z        info        replicator/process.go:15        routine stopped        {"component": "Object Replicator"}
11:10:16.450Z        warn        neofs-node/main.go:111        internal application error        {"message": "event subscriber connection has been terminated"}
11:10:16.450Z        info        neofs-node/config.go:352        closing components of the storage engine...
11:10:16.451Z        info        neofs-node/config.go:360        all components of the storage engine closed successfully
11:10:16.451Z        info        neofs-node/grpc.go:96        stopping gRPC server...        {"name": "NeoFS Public API"}
11:10:17.437Z        info        policer/process.go:15        routine stopped        {"component": "Object Policer"}

In the routine dump we see that grpc.GracefulStop() has been locked by the mutex, because it waits for all remaning RPC methods to finish.

1 @ 0x43b425 0x46e478 0x46e44e 0x4761d9 0x8ea40e 0xdd971e 0xde898c 0xdda593 0xdd9aa5 0x43aff6 0x472221
#	0x46e44d	sync.runtime_notifyListWait+0xcd			runtime/sema.go:513
#	0x4761d8	sync.(*Cond).Wait+0x98					sync/cond.go:56
#	0x8ea40d	google.golang.org/grpc.(*Server).GracefulStop+0x1ed	google.golang.org/grpc@v1.41.0/server.go:1774
#	0xdd971d	main.stopGRPC+0x17d					github.com/nspcc-dev/neofs-node/cmd/neofs-node/grpc.go:98
#	0xde898b	main.initGRPC.func1.1+0x4b				github.com/nspcc-dev/neofs-node/cmd/neofs-node/grpc.go:59
#	0xdda592	main.shutdown+0x52					github.com/nspcc-dev/neofs-node/cmd/neofs-node/main.go:118
#	0xdd9aa4	main.main+0x1c4						github.com/nspcc-dev/neofs-node/cmd/neofs-node/main.go:66
#	0x43aff5	runtime.main+0x255					runtime/proc.go:225

This never happens, though, because there are threads that frozen in Send()

2 @ 0x43b425 0x44c4d7 0x84cf7a 0x867a16 0x8f4c62 0xb34189 0xd7e942 0xd7858c 0xd7bcaa 0xb69f02 0xd79622 0xd7b70a 0xb78b07 0xd78862 0xd8714b 0xd7a4df 0xdaf5b1 0xd95c8c 0xd95f33 0xd97a0c 0xd933cc 0xd92fac 0xd96d12 0xd96b38 0xd969f6 0xdae778 0xdae6cf 0xddeacd 0xd7a716 0xd83d5c 0xd789b7 0xd7be42
#	0x84cf79	google.golang.org/grpc/internal/transport.(*writeQuota).get+0x79						google.golang.org/grpc@v1.41.0/internal/transport/flowcontrol.go:59
#	0x867a15	google.golang.org/grpc/internal/transport.(*http2Server).Write+0x135						google.golang.org/grpc@v1.41.0/internal/transport/http2_server.go:1037
#	0x8f4c61	google.golang.org/grpc.(*serverStream).SendMsg+0x241								google.golang.org/grpc@v1.41.0/stream.go:1510
#	0xb34188	github.com/nspcc-dev/neofs-api-go/v2/object/grpc.(*objectServiceGetServer).Send+0x48				github.com/nspcc-dev/neofs-api-go/v2@v2.11.2-0.20220114101721-227a871a04ac/object/grpc/service_grpc.pb.go:389

We did introduce timeouts for requests, but there is no timeout for sending responses. This timeout cannot be implemented on neofs-api-go side straight away, because we use directly gRPC generated code to send server responses.

Possible Solution

Provide some sort of timeout for sending responses
Run grpc.GracefulStop() with hardcoded timeout to close app anyway after a while.

Steps to Reproduce (for bugs)

Sometimes it is reproduced in the main chain when remote neo-go node closes the connection.

Your Environment

Version used: neofs-node v0.27.5

goroutine-dump.txt

The text was updated successfully, but these errors were encountered:

alexvanin · 2022-03-25T09:26:41Z

Okay, it was easier to reproduce than I thought: https://gist.github.com/alexvanin/f856d22de55e9e0f1b683bfbfb33e46b

alexvanin · 2022-03-25T10:12:26Z

Provide some sort of timeout for sending responses

Have an issue with this approach: it is not obvious how we can stop server.Send() execution. While this method is blocked, grpc.GracefulStop won't be finished.

So if we want to keep graceful shutdown, I suggest to have timeout here.

	done := make(chan struct{})
	go func() {
		srv.GracefulStop()
		close(done)
	}()
	select {
	case <-done:
	case <-time.After(30 * time.Second):
		srv.Stop()
	}

cthulhu-rider · 2022-09-02T13:51:30Z

grpc-go doesn't provide the option to specify the timeout of response writing (stream is expected to be interrupted by client side). More manual deadline can be implemented via interceptor:

type grpcServerStreamWithTimeout struct {
	ch    chan error
	timer *time.Timer
	grpc.ServerStream
}

func (x grpcServerStreamWithTimeout) SendMsg(m interface{}) error {
	go func() {
		x.ch <- x.ServerStream.SendMsg(m)
		// we also need some abort channel in order to prevent infinite lock above after timeout
	}()

	select {
	case err := <-x.ch:
		return err
	case <-x.timer.C:
		return errors.New("write response timeout")
	}
}

srv := grpc.NewServer(grpc.StreamInterceptor(func(srv interface{}, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error {
	v := grpcServerStreamWithTimeout{
		ch:           make(chan error),
		timer:        time.NewTimer(10 * time.Second),
		ServerStream: ss,
	}

	defer v.timer.Stop()

	return handler(srv, v)
}))

The problem is that we can't abort the steam due to timeout. This approach also introduces an overhead: go-routine, timer and select per message.

So 2 seems more appropriate to me.

cthulhu-rider · 2022-09-02T13:54:02Z

1 has been already implemented. Is the problem reproducible now?

alexvanin · 2022-09-05T06:52:05Z

Haven't heard about app termination issues since then. We can close it until new issues appears.

/cc @fyrchik

GracefulStop() may be blocked until all server-side streams are finished. There is no control over such streams yet, so application may be frozen in shutdown stage. Naive solution is to add timeout for GracefulStop(). At this point healthy connection will be finished and unhealthy connections will be terminated by Stop(). Signed-off-by: Alex Vanin <alexey@nspcc.ru>

alexvanin mentioned this issue Mar 25, 2022

neofs-node: Add timeout for grpc GracefulStop() #1271

Merged

alexvanin mentioned this issue May 19, 2022

[#413] Use tree service to handle versions nspcc-dev/neofs-s3-gw#440

Merged

alexchetaev added U2 Seriously planned 2022Q4 labels Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage node doesn't stop at the shutdown stage #1270

Storage node doesn't stop at the shutdown stage #1270

alexvanin commented Mar 25, 2022

alexvanin commented Mar 25, 2022

alexvanin commented Mar 25, 2022

cthulhu-rider commented Sep 2, 2022 •

edited

Loading

cthulhu-rider commented Sep 2, 2022

alexvanin commented Sep 5, 2022

Storage node doesn't stop at the shutdown stage #1270

Storage node doesn't stop at the shutdown stage #1270

Comments

alexvanin commented Mar 25, 2022

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

alexvanin commented Mar 25, 2022

alexvanin commented Mar 25, 2022

cthulhu-rider commented Sep 2, 2022 • edited Loading

cthulhu-rider commented Sep 2, 2022

alexvanin commented Sep 5, 2022

cthulhu-rider commented Sep 2, 2022 •

edited

Loading