`flush` can raise `Sys_error` when used in parallel

We've encountered that `Stdlib/Out_channel.flush` may raise a `Sys_error` exception when used in parallel with a `close`.
Consider this reproducer program:

```ocaml
let path = Filename.temp_file "stm-" ""

let test () =
  let channel = Atomic.make (Out_channel.open_text path) in

  let wait = Atomic.make true in

  let d1 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      Out_channel.close (Atomic.get channel)) in

  let d2 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      (try
        Out_channel.flush (Atomic.get channel)
       with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false)) in

  (* Output a sufficiently long string *)
  Out_channel.output_string (Atomic.get channel) (String.make (65531 (*65536*)) 'a');
  (* Let the races begin! *)
  Atomic.set wait false;
  let () = Domain.join d1 in
  let () = Domain.join d2 in
  (* Please leave the torture chamber nice and clean as you found it *)
  (try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
  Sys.remove path

let _ =
  for i = 1 to 50_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done
```

with this behaviour:
```
$ ocamlopt -g flushexc.ml
$ ./a.out
##############################################################################################################Out_channel.flush raised Sys_error "Bad file descriptor"
Fatal error: exception File "flushexc.ml", line 21, characters 94-100: Assertion failed
```

The documentation does not warn of unsynchronized access to `Out_channel` operations, as they are generally protected by internal channel locks.

The documentation for `close` mentions that `flush` should not raise `Sys_error` when called on a closed channel, but does
strictly speaking not specify the behaviour of `flush` in parallel with `close`:

```ocaml
val close : t -> unit
(** Close the given channel, flushing all buffered write operations.  Output
    functions raise a [Sys_error] exception when they are applied to a closed
    output channel, except {!close} and {!flush}, which do nothing when applied
    to an already closed channel.  Note that {!close} may raise [Sys_error] if
    the operating system signals an error when flushing or closing. *)
```


The problem is that `flush` on a still-open channel ends up in [`caml_flush_partial`](https://github.com/ocaml/ocaml/blob/trunk/runtime/io.c#L250) which calls [`check_pending`]( https://github.com/ocaml/ocaml/blob/trunk/runtime/io.c#L118) that may
temporarily unlock the channel to process pending actions. This creates a small window for [`caml_ml_close_channel`](https://github.com/ocaml/ocaml/blob/trunk/runtime/io.c#L706) to lock and close the
underlying file descriptor, and offset `channel->curr` by 1 from `channel->buff` into `dummy_buff`. As a result, when `caml_flush_partial` resumes, it will attempt to output 1 character, fail, and raise an exception.


Before jumping into discussions of fixes, we may want to discuss how to proceed.
Here's a few suggestions:
- Do nothing, as this is misusing the Stdlib and/or invoking unspecified behaviour
- Update the documentation to warn of this behaviour
- Patch the code to prevent the exception. This option is delicate as `caml_flush_partial`
  is simultaniously responsible for raising an exception for its other callers. After a bit
  of experimentation, my best bet at this would be to wrap a handler around `flush` on the
  OCaml-side in stdlib.ml, which shouldn't be a performance bottleneck.
- ...


Note: the above example program can easily be modified to race on two `close`s also resulting
in a `Sys_error`. The last sentence of the documentation could account for that behaviour
though.


Thanks to @ncik-roberts for figuring out the above explanation.

CC to @damiendoligez who last had his fingers in these parts in #12678

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`flush` can raise `Sys_error` when used in parallel #13586

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

flush can raise Sys_error when used in parallel #13586

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`flush` can raise `Sys_error` when used in parallel #13586