Skip to content

Conversation

@v-gb
Copy link
Contributor

@v-gb v-gb commented Feb 27, 2025

No description provided.

@v-gb
Copy link
Contributor Author

v-gb commented Feb 27, 2025

The failure might be because of carriage returns on windows. But before figuring out what to do, I'd like to know if the approach is reasonable.

Also it might be useful to see what comes out of ocaml/ocaml#13833, because the parser is producing loc_ghost in cases that don't make sense to me, so I'm not sure that the use of loc_ghost to distinguish modifications made to the AST always work (I got a strange test change when using loc_ghost in Fmt_ast.sequence_blank_line).

Comment on lines 367 to 368
match Source.loc_of_first_token_at src whole_loc LBRACKETPERCENT with
| Some loc -> loc
| Some _ as o -> o
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps Source.loc_of_first_token_at shouldn't attempt to read the source at all and always return None for ghost loc ? I don't see other impossible cases related to these functions returning None.

Also, this relocate_ext_cmts shouldn't raise errors at all as it's just using heuristics to move comments around. The impossible case should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the impossible case.

The problem with checking loc_ghost earlier is that the upstream parser sets loc_ghost incorrectly, and the bug is inherited in the ocamlformat parser, and the result is blank lines or comments moving around.
Specifically, the tests that break use begin%test e end in one case and new%foo e in the other, and in both case, we're exercising the same parser semantic action that creates Pexp_extension with a ghost location, and the inner expression with a real location, whereas the opposite is what would be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the use of loc_ghost is not unreasonable, but the most principled solution would be to make the printer work using the ast only, without the side information from Source.t or Cmts.t.

This can be seen with this thought experimental: imagine that you're combining files with print (parse Structure file1 @ parse Structure file2). In principle, this should have the same effect as In_channel.read_all file1 ^ "\n" ^ In_channel.read file2. But in practice, either you set loc_ghost on one ast and lose part of the formatting, or you don't and instead risk the printer raising exceptions.

Whereas if the ast was self contained, the concatenated asts would get printed the same as the concatenation of the source files, modulo the whitespace at the boundary. I don't know how complex that would be, but in this function, it would probably mean that the Pexp_extension should have a loc_of_first_token field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

but the most principled solution would be to make the printer work using the ast only, without the side information from Source.t or Cmts.t.

That was once a goal but we're not there yet as it's a lot of work. The modified AST is closer to the concrete syntax than the upstream parser but not enough to stop reading the source yet.
Any work in that direction is very welcome :)

Comments is a much more complex story. Representing comments at every nodes in the AST was deemed too hard since the beginning.

This can be seen with this thought experimental: imagine that you're combining files with print (parse Structure file1 @ parse Structure file2).

In theory, you can concatenate the Cmts.t as well and not loose the comments.
For the source part, we could make use the filename field of a location but that's just adhoc. We definitely want to get rid of it.

But keep in mind that OCamlformat is a string -> string function. In a much simpler use-case than yours I made it a string -> (ast -> ast) -> string function and haven't had to deal with source and comments. Could that help ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the most principled solution would be to make the printer work using the ast only, without the side information from Source.t or Cmts.t.

That was once a goal but we're not there yet as it's a lot of work. The modified AST is closer to the concrete syntax than the upstream parser but not enough to stop reading the source yet. Any work in that direction is very welcome :)

Yes, clearly doing this everywhere would be a lot of work. But it seems like the kind of work where you chip away at it, and the most common problems get solved, and then you solve the long tail of remaining problems over time.

I don't think it requires bringing the AST close to the concrete syntax though, it only requires capturing the relevant information from the concrete syntax into the tree.
ocamlformat's AST is actually sufficient difficult to manipulate due to some of these choices that I ended up rewriting the AST into a more usable form, processing the ast the way I want, then rewriting the AST back into a "normal" ocamlformat ast. For instance, the fact that functions are represented by 3 nodes (Pexp_function, Pexp_let, Pexp_letop), that opens are represented by 2 nodes, that applications are represented by 4 nodes (Pexp_apply, Pexp_infix, Pexp_prefix, Pexp_indexedop and potentially Pexp_letop) complicate things significantly.

Comments is a much more complex story. Representing comments at every nodes in the AST was deemed too hard since the beginning.

This can be seen with this thought experimental: imagine that you're combining files with print (parse Structure file1 @ parse Structure file2).

In theory, you can concatenate the Cmts.t as well and not loose the comments. For the source part, we could make use the filename field of a location but that's just adhoc. We definitely want to get rid of it.

Ok, good to know!

I agree that comments are likely much harder. ocamlformat handles comments in this way:

  1. parse string into ast
  2. assign comments to ast locations
  3. print ast, considering the comments assigned to the locations

Without being very familiar with these steps, I was thinking that perhaps the second step could turn comments into attributes like [@ocamlformat.comment_before] in various places. But it's very possible that it's not close to good enough.

Anyway, I don't anything urgent to solve, it's just that I've been thinking that using loc_ghost is limited in general. Given:

let x = 1 in
let y = 1 in
(x, y)

If you inlined y, you'd end up with an undesired empty line:

let x = 1 in

(x, 1)

because the position between in and the tuple are 2 lines apart, and yet every node has loc_ghost=false.
Whereas if you Pexp_let stored both the loc_of_in and also loc_of_body (which is redundant with the body, in the ast produced by the parser, but not when you start to modify the ast), then this spurious empty line wouldn't show up.

But keep in mind that OCamlformat is a string -> string function. In a much simpler use-case than yours I made it a string -> (ast -> ast) -> string function and haven't had to deal with source and comments. Could that help ?

Oh yeah, that's what I'm doing as well, but it sometimes causes problems, like the exceptions fixed in this pull request. But I was thinking more generally about "what would it take for ocamlformat to fully support printing modified asts". If the modifications are simple, ocamlformat is pretty reasonable already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only requires capturing the relevant information from the concrete syntax into the tree.

That's exactly what we called "closer to the concrete syntax" so far. What we called "CST" in older issues is not a tree full of lexer tokens and snippets of source code.

We found that having different nodes that represent things with similar semantic but different syntax to be extremely useful (Pexp_function, Pexp_let, Pexp_letop or Pexp_apply, Pexp_infix, Pexp_prefix, Pexp_indexedop).
We could change ocamlformat's AST again if that could help your usecase, though we don't want to make ocamlformat harder to maintain.

because the position between in and the tuple are 2 lines apart, and yet every node has loc_ghost=false.

I don't think loc_ghost is relevant here. This is a case where we tried to not use Source and used line numbers instead. To fix this issue, we could perhaps record whether there's an empty line by adding a boolean into the Pexp_let.

@v-gb v-gb force-pushed the push-mwmryunvouso branch from 6d892ea to 3445f8f Compare March 5, 2025 23:04
(* Ideally we'd improve two things about this test:
- check the new string parses, to the same AST as the original one
- use ppx_expect, so we have a nicer workflow and more readable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cram tests. ppx_expect requires defining a library in Dune, which becomes installable and is annoying to deal with during releases. You can build an executable with Dune that uses ocamlformat-lib within a cram test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Libraries are only installable if they have a public name, isn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Libraries become installable (even private libraries) when they have a (package) field. This field might not be needed for ocamlformat right now but I had enough problems with ppx_expect and so much success with cram tests that I really don't want to go back.

@v-gb v-gb force-pushed the push-mwmryunvouso branch from 3445f8f to 5515457 Compare March 6, 2025 15:34
Copy link
Collaborator

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Julow Julow merged commit 9a050fb into ocaml-ppx:main Mar 10, 2025
8 of 9 checks passed
davesnx pushed a commit to ocaml-mlx/ocamlformat-mlx that referenced this pull request May 6, 2025
Julow added a commit to Julow/opam-repository that referenced this pull request Oct 24, 2025
….28.1)

CHANGES:

### Highlight

- \* Support for OCaml 5.4
  (ocaml-ppx/ocamlformat#2717, ocaml-ppx/ocamlformat#2720, ocaml-ppx/ocamlformat#2732, ocaml-ppx/ocamlformat#2733, ocaml-ppx/ocamlformat#2735, @Julow, @Octachron, @cod1r, @EmileTrotignon)
  OCamlformat now supports OCaml 5.4 syntax.
  Module packing of the form `((module M) : (module S))` are no longer
  rewritten to `(module M : S)` because these are now two different syntaxes.

- \* Reduce indentation after `|> map (fun` (ocaml-ppx/ocamlformat#2694, @EmileTrotignon)
  Notably, the indentation no longer depends on the length of the infix
  operator, for example:
  ```ocaml
  (* before *)
  v
  |>>>>>> map (fun x ->
              x )
  (* after *)
  v
  |>>>>>> map (fun x ->
      x )
  ```
  `@@ match` can now also be on one line.

### Added

- Added option `module-indent` option (ocaml-ppx/ocamlformat#2711, @HPRIOR) to control the indentation
  of items within modules. This affects modules and signatures. For example,
  module-indent=4:
  ```ocaml
  module type M = sig
      type t

      val f : (string * int) list -> int
  end
  ```

- `exp-grouping=preserve` is now the default in `default` and `ocamlformat`
  profiles. This means that its now possible to use `begin ... end` without
  tweaking ocamlformat. (ocaml-ppx/ocamlformat#2716, @EmileTrotignon)

### Deprecated

- Starting in this release, ocamlformat can use cmdliner >= 2.0.0. When that is
  the case, the tool no longer accepts unambiguous option names prefixes. For
  example, `--max-iter` is not accepted anymore, you have to pass the full
  option `--max-iters`. This does not apply to the keys in the `.ocamlformat`
  configuration files, which have always required the full name.
  See dbuenzli/cmdliner#200.
  (ocaml-ppx/ocamlformat#2680, @emillon)

### Changed

- \* The formatting of infix extensions is now consistent with regular
  formatting by construction. This reduces indentation in `f @@ match%e`
  expressions to the level of indentation in `f @@ match`. Other unknown
  inconsistencies might also be fixed. (ocaml-ppx/ocamlformat#2676, @EmileTrotignon)

- \* The spacing of infix attributes is now consistent across keywords. Every
  keyword but `begin` `function`, and `fun` had attributes stuck to the keyword:
  `match[@A]`, but `fun [@A]`. Now its also `fun[@A]`. (ocaml-ppx/ocamlformat#2676, @EmileTrotignon)

- \* The formatting of`let a = b in fun ...` is now consistent with other
  contexts like `a ; fun ...`. A check for the syntax `let a = fun ... in ...`
  was made more precise. (ocaml-ppx/ocamlformat#2705, @EmileTrotignon)

- \* `|> begin`, `~arg:begin`, `begin if`, `lazy begin`, `begin match`,
  `begin fun` and `map li begin fun`  can now be printed on the same line, with
  one less indentation level for the body of the inner expression.
  (ocaml-ppx/ocamlformat#2664, ocaml-ppx/ocamlformat#2666, ocaml-ppx/ocamlformat#2671, ocaml-ppx/ocamlformat#2672, ocaml-ppx/ocamlformat#2681, ocaml-ppx/ocamlformat#2685, ocaml-ppx/ocamlformat#2693, @EmileTrotignon)
  For example :
  ```ocaml
  (* before *)
  begin
    fun x ->
      some code
  end
  (* after *)
  begin fun x ->
    some code
  end
  ```

- \* `break-struct=natural` now also applies to `sig ... end`. (ocaml-ppx/ocamlformat#2682, @EmileTrotignon)

### Fixed

- Fixed `wrap-comments=true` not working with the janestreet profile (ocaml-ppx/ocamlformat#2645, @Julow)
  Asterisk-prefixed comments are also now formatted the same way as with the
  default profile.

- Fixed `nested-match=align` not working with `match%ext` (ocaml-ppx/ocamlformat#2648, @EmileTrotignon)

- Fixed the AST generated for bindings of the form `let pattern : type = function ...`
  (ocaml-ppx/ocamlformat#2651, @v-gb)

- Print valid syntax for the corner case (1).a (ocaml-ppx/ocamlformat#2653, @v-gb)

- `Ast_mapper.default_mapper` now iterates on the location of `in` in `let+ .. in ..`
  (ocaml-ppx/ocamlformat#2658, @v-gb)

- Fix missing parentheses in `let+ (Cstr _) : _ = _` (ocaml-ppx/ocamlformat#2661, @Julow)
  This caused a crash as the generated code wasn't valid syntax.

- Fix bad indentation of `let%ext { ...` (ocaml-ppx/ocamlformat#2663, @EmileTrotignon)
  with `dock-collection-brackets` enabled.

- ocamlformat is now more robust when used as a library to print modified ASTs
  (ocaml-ppx/ocamlformat#2659, @v-gb)

- Fix crash due to edge case with asterisk-prefixed comments (ocaml-ppx/ocamlformat#2674, @Julow)

- Fix crash when formatting `mld` files that cannot be lexed as ocaml (e.g.
  containing LaTeX or C code) (ocaml-ppx/ocamlformat#2684, @emillon)

- \* Fix double parens around module constraint in functor application :
  `module M = F ((A : T))` becomes `module M = F (A : T)`. (ocaml-ppx/ocamlformat#2678, @EmileTrotignon)

- Fix misplaced `;;` due to interaction with floating doc comments.
  (ocaml-ppx/ocamlformat#2691, @EmileTrotignon)

- The formatting of attributes of expression is now aware of the attributes
  infix or postix positions: `((fun [@A] x -> y) [@b])` is formatted without
  moving attributes. (ocaml-ppx/ocamlformat#2676, @EmileTrotignon)

- `begin%e ... end` and `begin [@A] ... end` nodes are always preserved.
  (ocaml-ppx/ocamlformat#2676, @EmileTrotignon)

- `begin end` syntax for `()` is now preserved. (ocaml-ppx/ocamlformat#2676, @EmileTrotignon)

- Fix a crash on `type 'a t = A : 'a. {a: 'a} -> 'a t`. (ocaml-ppx/ocamlformat#2710, @EmileTrotignon)

- Fix a crash where `type%e nonrec t = t` was formatted as `type nonrec%e t = t`,
  which is invalid syntax. (ocaml-ppx/ocamlformat#2712, @EmileTrotignon)

- Fix commandline parsing being quadratic in the number of arguments
  (ocaml-ppx/ocamlformat#2724, @let-def)

- \* Fix `;;` being added after a documentation comment (ocaml-ppx/ocamlformat#2683, @EmileTrotignon)
  This results in more `;;` being inserted, for example:
  ```ocaml
  (* before *)
  print_endline "foo"
  let a = 3

  (* after *)
  print_endline "foo" ;;
  let a = 3
  ```

- Fix dropped comment in `if then (* comment *) begin .. end` (ocaml-ppx/ocamlformat#2734, @Julow)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants