Skip to content

flush can raise Sys_error when used in parallel #13586

Open
@jmid

Description

@jmid

We've encountered that Stdlib/Out_channel.flush may raise a Sys_error exception when used in parallel with a close.
Consider this reproducer program:

let path = Filename.temp_file "stm-" ""

let test () =
  let channel = Atomic.make (Out_channel.open_text path) in

  let wait = Atomic.make true in

  let d1 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      Out_channel.close (Atomic.get channel)) in

  let d2 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      (try
        Out_channel.flush (Atomic.get channel)
       with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false)) in

  (* Output a sufficiently long string *)
  Out_channel.output_string (Atomic.get channel) (String.make (65531 (*65536*)) 'a');
  (* Let the races begin! *)
  Atomic.set wait false;
  let () = Domain.join d1 in
  let () = Domain.join d2 in
  (* Please leave the torture chamber nice and clean as you found it *)
  (try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
  Sys.remove path

let _ =
  for i = 1 to 50_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

with this behaviour:

$ ocamlopt -g flushexc.ml
$ ./a.out
##############################################################################################################Out_channel.flush raised Sys_error "Bad file descriptor"
Fatal error: exception File "flushexc.ml", line 21, characters 94-100: Assertion failed

The documentation does not warn of unsynchronized access to Out_channel operations, as they are generally protected by internal channel locks.

The documentation for close mentions that flush should not raise Sys_error when called on a closed channel, but does
strictly speaking not specify the behaviour of flush in parallel with close:

val close : t -> unit
(** Close the given channel, flushing all buffered write operations.  Output
    functions raise a [Sys_error] exception when they are applied to a closed
    output channel, except {!close} and {!flush}, which do nothing when applied
    to an already closed channel.  Note that {!close} may raise [Sys_error] if
    the operating system signals an error when flushing or closing. *)

The problem is that flush on a still-open channel ends up in caml_flush_partial which calls check_pending that may
temporarily unlock the channel to process pending actions. This creates a small window for caml_ml_close_channel to lock and close the
underlying file descriptor, and offset channel->curr by 1 from channel->buff into dummy_buff. As a result, when caml_flush_partial resumes, it will attempt to output 1 character, fail, and raise an exception.

Before jumping into discussions of fixes, we may want to discuss how to proceed.
Here's a few suggestions:

  • Do nothing, as this is misusing the Stdlib and/or invoking unspecified behaviour
  • Update the documentation to warn of this behaviour
  • Patch the code to prevent the exception. This option is delicate as caml_flush_partial
    is simultaniously responsible for raising an exception for its other callers. After a bit
    of experimentation, my best bet at this would be to wrap a handler around flush on the
    OCaml-side in stdlib.ml, which shouldn't be a performance bottleneck.
  • ...

Note: the above example program can easily be modified to race on two closes also resulting
in a Sys_error. The last sentence of the documentation could account for that behaviour
though.

Thanks to @ncik-roberts for figuring out the above explanation.

CC to @damiendoligez who last had his fingers in these parts in #12678

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions