Description
We recently added a background-logging task to our server, but in some unit tests, we were seeing that very rarely the background task would fail with the following stack trace:
nested task error: IOError: stream is closed or unusable
Stacktrace:
[1] check_open
@ ./stream.jl:388 [inlined]
[2] uv_write_async(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
@ Base ./stream.jl:1072
[3] uv_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
@ Base ./stream.jl:1035
[4] unsafe_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
@ Base ./stream.jl:1118
[5] unsafe_write
@ ./io.jl:683 [inlined]
[6] write(s::Base.PipeEndpoint, a::Vector{UInt8})
@ Base ./io.jl:706
[7] handle_message(logger::Logging.ConsoleLogger, level::Base.CoreLogging.LogLevel, message::Any, _module::Any, group::Any, id::Any, filepath::Any, line::Any; kwargs::Base.Pairs{Symbol, V, Tuple{Vararg{Symbol, N}}, NamedTuple{names, T}} where {V, N, names, T<:Tuple{Vararg{Any, N}}})
@ Logging /nix/store/60wrl8qkdrr2l6p9nbv8xff88a8igr0n-julia-1.8.2-patched/share/julia/stdlib/v1.8/Logging/src/ConsoleLogger.jl:178
[...]
We eventually tracked this down to a call to @suppress_err
in the unit test, from Suppressors.jl, which temporarily redirects stderr to a pipe, and then puts it back and closes the pipe when the macro is done.
We believe this (rare) error is a race condition between the test's Task closing the pipe at the end of the macro, and the background logging task attempting to write to the pipe. That is, we think that one thread is redirecting stderr and closing the pipe inbetween these two calls:
checking that stderr isn't closed:
julia/stdlib/Logging/src/ConsoleLogger.jl
Lines 121 to 123 in 6c0aa6d
and then writing to it here:
julia/stdlib/Logging/src/ConsoleLogger.jl
Line 178 in 6c0aa6d
(Here's that whole section, for posterity):
julia/stdlib/Logging/src/ConsoleLogger.jl
Lines 121 to 178 in 6c0aa6d
We're not sure what the right solution here should be.
It's possible that throwing an exception is already the right behavior: in our case, the exception stacktrace helped lead us to the issue!
On the other hand, philosophically, it seems that Julia takes the approach that an error during logging shouldn't throw an exception, so that you don't have to fear adding log messages. Additionally, it seems that after #40423, the logger is taking pains to gracefully fall back to stderr if the logger's stream has been closed. In this case, the stream was closed after we checked (in a data race), so it's possible we can be more robust in our stream-handling if this safety really is our aim.
For example, we think that this could be fixed by adding a try/catch around the write, and falling back to stderr.
Separately, we are going to file/fix the race condition in Suppressors.jl, probably by adding a delay in between redirecting stderr back to the original stderr, and closing the pipe.