Description
We've encountered that Stdlib/Out_channel.flush
may raise a Sys_error
exception when used in parallel with a close
.
Consider this reproducer program:
let path = Filename.temp_file "stm-" ""
let test () =
let channel = Atomic.make (Out_channel.open_text path) in
let wait = Atomic.make true in
let d1 = Domain.spawn (fun () ->
while Atomic.get wait do Domain.cpu_relax() done;
Out_channel.close (Atomic.get channel)) in
let d2 = Domain.spawn (fun () ->
while Atomic.get wait do Domain.cpu_relax() done;
(try
Out_channel.flush (Atomic.get channel)
with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false)) in
(* Output a sufficiently long string *)
Out_channel.output_string (Atomic.get channel) (String.make (65531 (*65536*)) 'a');
(* Let the races begin! *)
Atomic.set wait false;
let () = Domain.join d1 in
let () = Domain.join d2 in
(* Please leave the torture chamber nice and clean as you found it *)
(try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
Sys.remove path
let _ =
for i = 1 to 50_000 do
if i mod 250 = 0 then Printf.printf "#%!";
test ()
done
with this behaviour:
$ ocamlopt -g flushexc.ml
$ ./a.out
##############################################################################################################Out_channel.flush raised Sys_error "Bad file descriptor"
Fatal error: exception File "flushexc.ml", line 21, characters 94-100: Assertion failed
The documentation does not warn of unsynchronized access to Out_channel
operations, as they are generally protected by internal channel locks.
The documentation for close
mentions that flush
should not raise Sys_error
when called on a closed channel, but does
strictly speaking not specify the behaviour of flush
in parallel with close
:
val close : t -> unit
(** Close the given channel, flushing all buffered write operations. Output
functions raise a [Sys_error] exception when they are applied to a closed
output channel, except {!close} and {!flush}, which do nothing when applied
to an already closed channel. Note that {!close} may raise [Sys_error] if
the operating system signals an error when flushing or closing. *)
The problem is that flush
on a still-open channel ends up in caml_flush_partial
which calls check_pending
that may
temporarily unlock the channel to process pending actions. This creates a small window for caml_ml_close_channel
to lock and close the
underlying file descriptor, and offset channel->curr
by 1 from channel->buff
into dummy_buff
. As a result, when caml_flush_partial
resumes, it will attempt to output 1 character, fail, and raise an exception.
Before jumping into discussions of fixes, we may want to discuss how to proceed.
Here's a few suggestions:
- Do nothing, as this is misusing the Stdlib and/or invoking unspecified behaviour
- Update the documentation to warn of this behaviour
- Patch the code to prevent the exception. This option is delicate as
caml_flush_partial
is simultaniously responsible for raising an exception for its other callers. After a bit
of experimentation, my best bet at this would be to wrap a handler aroundflush
on the
OCaml-side in stdlib.ml, which shouldn't be a performance bottleneck. - ...
Note: the above example program can easily be modified to race on two close
s also resulting
in a Sys_error
. The last sentence of the documentation could account for that behaviour
though.
Thanks to @ncik-roberts for figuring out the above explanation.
CC to @damiendoligez who last had his fingers in these parts in #12678