Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage node doesn't stop at the shutdown stage #1270

Closed
alexvanin opened this issue Mar 25, 2022 · 5 comments
Closed

Storage node doesn't stop at the shutdown stage #1270

alexvanin opened this issue Mar 25, 2022 · 5 comments
Labels
bug Something isn't working neofs-storage Storage node application issues U2 Seriously planned

Comments

@alexvanin
Copy link
Contributor

Expected Behavior

When neo-go node closes connection with storage node, application changes the state to SHUTTING_DOWN and then process is finished.

Current Behavior

Application is frozen at the SHUTTING_DOWN stage and process never finished.

It seems that gRPC server for NeoFS API endpoint cannot be closed. We see stopping gRPC server... log but there is no gRPC server stopped successfully message.

11:10:16.450Z        warn        subscriber/subscriber.go:134        remote notification channel has been closed
11:10:16.450Z        warn        event/listener.go:236        stop event listener by notification channel
11:10:16.450Z        warn        replicator/process.go:28        context is done        {"component": "Object Replicator", "error": "context canceled"}
11:10:16.450Z        info        replicator/process.go:15        routine stopped        {"component": "Object Replicator"}
11:10:16.450Z        warn        neofs-node/main.go:111        internal application error        {"message": "event subscriber connection has been terminated"}
11:10:16.450Z        info        neofs-node/config.go:352        closing components of the storage engine...
11:10:16.451Z        info        neofs-node/config.go:360        all components of the storage engine closed successfully
11:10:16.451Z        info        neofs-node/grpc.go:96        stopping gRPC server...        {"name": "NeoFS Public API"}
11:10:17.437Z        info        policer/process.go:15        routine stopped        {"component": "Object Policer"}

In the routine dump we see that grpc.GracefulStop() has been locked by the mutex, because it waits for all remaning RPC methods to finish.

1 @ 0x43b425 0x46e478 0x46e44e 0x4761d9 0x8ea40e 0xdd971e 0xde898c 0xdda593 0xdd9aa5 0x43aff6 0x472221
#	0x46e44d	sync.runtime_notifyListWait+0xcd			runtime/sema.go:513
#	0x4761d8	sync.(*Cond).Wait+0x98					sync/cond.go:56
#	0x8ea40d	google.golang.org/grpc.(*Server).GracefulStop+0x1ed	google.golang.org/grpc@v1.41.0/server.go:1774
#	0xdd971d	main.stopGRPC+0x17d					github.com/nspcc-dev/neofs-node/cmd/neofs-node/grpc.go:98
#	0xde898b	main.initGRPC.func1.1+0x4b				github.com/nspcc-dev/neofs-node/cmd/neofs-node/grpc.go:59
#	0xdda592	main.shutdown+0x52					github.com/nspcc-dev/neofs-node/cmd/neofs-node/main.go:118
#	0xdd9aa4	main.main+0x1c4						github.com/nspcc-dev/neofs-node/cmd/neofs-node/main.go:66
#	0x43aff5	runtime.main+0x255					runtime/proc.go:225

This never happens, though, because there are threads that frozen in Send()

2 @ 0x43b425 0x44c4d7 0x84cf7a 0x867a16 0x8f4c62 0xb34189 0xd7e942 0xd7858c 0xd7bcaa 0xb69f02 0xd79622 0xd7b70a 0xb78b07 0xd78862 0xd8714b 0xd7a4df 0xdaf5b1 0xd95c8c 0xd95f33 0xd97a0c 0xd933cc 0xd92fac 0xd96d12 0xd96b38 0xd969f6 0xdae778 0xdae6cf 0xddeacd 0xd7a716 0xd83d5c 0xd789b7 0xd7be42
#	0x84cf79	google.golang.org/grpc/internal/transport.(*writeQuota).get+0x79						google.golang.org/grpc@v1.41.0/internal/transport/flowcontrol.go:59
#	0x867a15	google.golang.org/grpc/internal/transport.(*http2Server).Write+0x135						google.golang.org/grpc@v1.41.0/internal/transport/http2_server.go:1037
#	0x8f4c61	google.golang.org/grpc.(*serverStream).SendMsg+0x241								google.golang.org/grpc@v1.41.0/stream.go:1510
#	0xb34188	github.com/nspcc-dev/neofs-api-go/v2/object/grpc.(*objectServiceGetServer).Send+0x48				github.com/nspcc-dev/neofs-api-go/v2@v2.11.2-0.20220114101721-227a871a04ac/object/grpc/service_grpc.pb.go:389

We did introduce timeouts for requests, but there is no timeout for sending responses. This timeout cannot be implemented on neofs-api-go side straight away, because we use directly gRPC generated code to send server responses.

Possible Solution

  1. Provide some sort of timeout for sending responses
  2. Run grpc.GracefulStop() with hardcoded timeout to close app anyway after a while.

Steps to Reproduce (for bugs)

Sometimes it is reproduced in the main chain when remote neo-go node closes the connection.

Your Environment

  • Version used: neofs-node v0.27.5

goroutine-dump.txt

@alexvanin
Copy link
Contributor Author

Okay, it was easier to reproduce than I thought: https://gist.github.com/alexvanin/f856d22de55e9e0f1b683bfbfb33e46b

@alexvanin
Copy link
Contributor Author

Provide some sort of timeout for sending responses

Have an issue with this approach: it is not obvious how we can stop server.Send() execution. While this method is blocked, grpc.GracefulStop won't be finished.

So if we want to keep graceful shutdown, I suggest to have timeout here.

	done := make(chan struct{})
	go func() {
		srv.GracefulStop()
		close(done)
	}()
	select {
	case <-done:
	case <-time.After(30 * time.Second):
		srv.Stop()
	}

@cthulhu-rider
Copy link
Contributor

cthulhu-rider commented Sep 2, 2022

grpc-go doesn't provide the option to specify the timeout of response writing (stream is expected to be interrupted by client side). More manual deadline can be implemented via interceptor:

type grpcServerStreamWithTimeout struct {
	ch    chan error
	timer *time.Timer
	grpc.ServerStream
}

func (x grpcServerStreamWithTimeout) SendMsg(m interface{}) error {
	go func() {
		x.ch <- x.ServerStream.SendMsg(m)
		// we also need some abort channel in order to prevent infinite lock above after timeout
	}()

	select {
	case err := <-x.ch:
		return err
	case <-x.timer.C:
		return errors.New("write response timeout")
	}
}

srv := grpc.NewServer(grpc.StreamInterceptor(func(srv interface{}, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error {
	v := grpcServerStreamWithTimeout{
		ch:           make(chan error),
		timer:        time.NewTimer(10 * time.Second),
		ServerStream: ss,
	}

	defer v.timer.Stop()

	return handler(srv, v)
}))

The problem is that we can't abort the steam due to timeout. This approach also introduces an overhead: go-routine, timer and select per message.

So 2 seems more appropriate to me.

@cthulhu-rider
Copy link
Contributor

1 has been already implemented. Is the problem reproducible now?

@alexvanin
Copy link
Contributor Author

Haven't heard about app termination issues since then. We can close it until new issues appears.

/cc @fyrchik

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022
GracefulStop() may be blocked until all server-side streams
are finished. There is no control over such streams yet, so
application may be frozen in shutdown stage.

Naive solution is to add timeout for GracefulStop(). At this
point healthy connection will be finished and unhealthy
connections will be terminated by Stop().

Signed-off-by: Alex Vanin <alexey@nspcc.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working neofs-storage Storage node application issues U2 Seriously planned
Projects
None yet
Development

No branches or pull requests

4 participants