Skip to content

NullPointerException when calling onError(StatusRuntimeException) #11973

Closed
@ulfjack

Description

@ulfjack

What version of gRPC-Java are you using?

v1.71.0

What is your environment?

Linux

What did you expect to see?

I was not expecting onError to throw NullPointerException.

What did you see instead?

It threw NPE.

Steps to reproduce the bug

We have been seeing some rare NullPointerException or ArrayIndexOutOfBoundsException when calling onError on an incoming server call. This happens several times per day across all of our services, which translates to maybe one in a few billion calls or so. The exception looks like this:

java.lang.NullPointerException: Cannot invoke "io.grpc.Metadata$LazyValue.toBytes()" because "value" is null
	at io.grpc.Metadata.valueAsBytes(Metadata.java:183)
	at io.grpc.Metadata.serialize(Metadata.java:474)
	at io.grpc.InternalMetadata.serialize(InternalMetadata.java:79)
	at io.grpc.internal.TransportFrameUtil.toHttp2Headers(TransportFrameUtil.java:51)
	at io.grpc.netty.Utils.convertTrailers(Utils.java:327)
	at io.grpc.netty.NettyServerStream$Sink.writeTrailers(NettyServerStream.java:131)
	at io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
	at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
	at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError(ServerCalls.java:389)

In order to debug this, we added some code to use reflection to dump the internal structure of the Metadata object that throws. Here are some examples:

[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]
[grpc-status,14,grpc-message,Network closed for unknown reason,null,null,null,null]
[null,null,grpc-status,14,grpc-message,Channel shutdown invoked,null,null]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,grpc-message,Channel shutdown invoked]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]

Some of these look perfectly valid, while others have the expected content, but in an unusual order, with null elements at the beginning or between entries. The documentation for Metadata clearly says that it is mutable and not thread-safe, and this appears to be due to multi-threaded mutation of the Metadata object.

We initially found this surprising: we're not setting or mutating metadata anywhere.

After reviewing the grpc-java source code, we discovered the following sequence of events:

We call
io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError
which calls
Metadata metadata = Status.trailersFromThrowable(t);
and then calls
io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
which calls
io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
which calls
io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
which calls
addStatusToTrailers
which mutates the Metadata object.

The Metadata object mutated here is the Metadata object stored in the StatusRuntimeException that we passed to onError. Who owns that instance?

What actually happens in our code is that we make outgoing gRPC calls to other services which fail with a StatusRuntimeException, and then we use the same instance to reply to multiple incoming gRPC calls on different threads. When we see the concurrent mutation of the Metadata object, that's because the onError call unconditionally mutates the Metadata, which is now shared across multiple threads.

The code is racing with itself, and this happens even when the application code never sets or modifies Metadata objects. This is not great.

Personally, I think it's surprising and problematic to mutate the Metadata that's attached to a SRE in the onError call. I would prefer StatusRuntimeException to be effectively immutable.

At a minimum, it would be good to explicitly document that StatusRuntimeException is not thread-safe and should not be shared across threads (unfortunately, we end up storing it in Future objects, which is difficult to fix on our side). This would - at least - have pointed us in the right direction earlier.

Other options:

  • Make SRE immutable by copying the metadata whenever it's returned from getTrailers, but this would have performance implications.
  • Make onError not mutate the passed-in Metadata object:
    • Copy the trailers if set (performance?).
    • Don't update the Metadata if the status fields match the corresponding Status (not a complete solution).
  • Make outgoing client calls throw SRE without a Metadata object; I believe that the data is currently duplicated via the Metadata and the Status, even if no other metadata is set (not a complete solution).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions