Description
What version of gRPC-Java are you using?
v1.71.0
What is your environment?
Linux
What did you expect to see?
I was not expecting onError to throw NullPointerException.
What did you see instead?
It threw NPE.
Steps to reproduce the bug
We have been seeing some rare NullPointerException
or ArrayIndexOutOfBoundsException
when calling onError
on an incoming server call. This happens several times per day across all of our services, which translates to maybe one in a few billion calls or so. The exception looks like this:
java.lang.NullPointerException: Cannot invoke "io.grpc.Metadata$LazyValue.toBytes()" because "value" is null
at io.grpc.Metadata.valueAsBytes(Metadata.java:183)
at io.grpc.Metadata.serialize(Metadata.java:474)
at io.grpc.InternalMetadata.serialize(InternalMetadata.java:79)
at io.grpc.internal.TransportFrameUtil.toHttp2Headers(TransportFrameUtil.java:51)
at io.grpc.netty.Utils.convertTrailers(Utils.java:327)
at io.grpc.netty.NettyServerStream$Sink.writeTrailers(NettyServerStream.java:131)
at io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError(ServerCalls.java:389)
In order to debug this, we added some code to use reflection to dump the internal structure of the Metadata
object that throws. Here are some examples:
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]
[grpc-status,14,grpc-message,Network closed for unknown reason,null,null,null,null]
[null,null,grpc-status,14,grpc-message,Channel shutdown invoked,null,null]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,grpc-message,Channel shutdown invoked]
[grpc-status,14,grpc-message,Channel shutdown invoked,null,null,null,null]
Some of these look perfectly valid, while others have the expected content, but in an unusual order, with null
elements at the beginning or between entries. The documentation for Metadata
clearly says that it is mutable and not thread-safe, and this appears to be due to multi-threaded mutation of the Metadata
object.
We initially found this surprising: we're not setting or mutating metadata anywhere.
After reviewing the grpc-java source code, we discovered the following sequence of events:
We call
io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onError
which calls
Metadata metadata = Status.trailersFromThrowable(t);
and then calls
io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:213)
which calls
io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:227)
which calls
io.grpc.internal.AbstractServerStream.close(AbstractServerStream.java:133)
which calls
addStatusToTrailers
which mutates the Metadata
object.
The Metadata object mutated here is the Metadata
object stored in the StatusRuntimeException
that we passed to onError
. Who owns that instance?
What actually happens in our code is that we make outgoing gRPC calls to other services which fail with a StatusRuntimeException
, and then we use the same instance to reply to multiple incoming gRPC calls on different threads. When we see the concurrent mutation of the Metadata
object, that's because the onError
call unconditionally mutates the Metadata
, which is now shared across multiple threads.
The code is racing with itself, and this happens even when the application code never sets or modifies Metadata
objects. This is not great.
Personally, I think it's surprising and problematic to mutate the Metadata
that's attached to a SRE
in the onError
call. I would prefer StatusRuntimeException
to be effectively immutable.
At a minimum, it would be good to explicitly document that StatusRuntimeException
is not thread-safe and should not be shared across threads (unfortunately, we end up storing it in Future
objects, which is difficult to fix on our side). This would - at least - have pointed us in the right direction earlier.
Other options:
- Make
SRE
immutable by copying the metadata whenever it's returned fromgetTrailers
, but this would have performance implications. - Make
onError
not mutate the passed-inMetadata
object:- Copy the
trailers
if set (performance?). - Don't update the
Metadata
if the status fields match the correspondingStatus
(not a complete solution).
- Copy the
- Make outgoing client calls throw
SRE
without aMetadata
object; I believe that the data is currently duplicated via theMetadata
and theStatus
, even if no other metadata is set (not a complete solution).