Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dune-project parser does not work if dune-project has a UTF-8 byte-order mark #9396

Open
jonahbeckford opened this issue Dec 6, 2023 · 9 comments

Comments

@jonahbeckford
Copy link
Collaborator

jonahbeckford commented Dec 6, 2023

Edits

  1. Narrowed the issue to just supporting UTF-8 with BOM (ie. supporting the UTF-8 spec). Previously the original problem report was UTF-16 + BOM but as mentioned by @Alizter OCaml does not support UTF-16.

Expected Behavior

In PowerShell on Windows:

echo "(lang dune 3.12)" > dune-project
dune build

I would expect the project to build.

Actual Behavior

File "dune-project", line 1, characters 0-34:
1 | ��(lang dune 3.12)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: Invalid first line, expected: (lang <lang> <version>)

The reason? Windows conventionally has a byte-order mark in its Unicode files. The built-in PowerShell (or 5.x or less) inserts the BOMs; newer Command Prompts do not.

> with-dkml file dune-project
dune-project: Unicode text, UTF-16, little-endian text, with CRLF line terminators

> with-dkml od -xc dune-project
0000000    feff    0028    006c    0061    006e    0067    0020    0064
        377 376   (  \0   l  \0   a  \0   n  \0   g  \0      \0   d  \0
0000020    0075    006e    0065    0020    0033    002e    0031    0032
          u  \0   n  \0   e  \0      \0   3  \0   .  \0   1  \0   2  \0
0000040    0029    000d    000a
          )  \0  \r  \0  \n  \0
0000046

Specifications

  • Version of dune (output of dune --version): 3.12.1
  • Operating system (distribution and version): Windows 11 with PowerShell 5.1
@rgrinberg rgrinberg added the bug label Dec 6, 2023
@nojb
Copy link
Collaborator

nojb commented Dec 6, 2023

> with-dkml file dune-project
dune-project: Unicode text, UTF-16, little-endian text, with CRLF line terminators

If this is accurate, then the BOM is only the start; I don't think Dune is able to decode UTF-16 text at the moment.

@rgrinberg
Copy link
Member

I imagine this is an issue for regular dune files as well. In any case, it seems reasonable for us to handle it.

@jonahbeckford jonahbeckford changed the title dune-project parser does not work if dune-project has a byte-order mark dune-project parser does not work if dune-project has a byte-order mark or is in UTF-16 Dec 6, 2023
@jonahbeckford
Copy link
Collaborator Author

Okay, there are two separate problems, both exposed on Windows. The first is UTF-16 encoding is not yet supported. The second is that BOM is not supported, even in UTF-8.

For example:

> with-dkml file dune-project
dune-project: Unicode text, UTF-8 (with BOM) text, with no line terminators

> with-dkml od -xc dune-project
0000000    bbef    28bf    616c    676e    6420    6e75    2065    2e33
        357 273 277   (   l   a   n   g       d   u   n   e       3   .
0000020    3231    0029
          1   2   )
0000023

>  dune build
File "dune-project", line 1, characters 0-19:
1 | (lang dune 3.12)
    ^^^^^^^^^^^^^^^^^^^
Error: Invalid first line, expected: (lang <lang> <version>)

@nojb
Copy link
Collaborator

nojb commented Dec 7, 2023

We will need an UTF decoder. The best one is in the standard library, but only on 4.14 and later.

@Alizter
Copy link
Collaborator

Alizter commented Dec 7, 2023

Does OCaml even compile with utf16 .ml files? Most other compilers I know, gcc, rustc don't accept any other encoding than utf8. Only ones I am aware of are Microsoft compilers and Java. It doesn't seem like a bug but more like a feature request to accept dune files in utf16 encoding.

Also from a cross platform perspective, if you want the dune files to work on other platforms, anything other than utf8 seems like a foot gun.

I've read that powershell allows you to configure the default piping output from utf16 to utf8. Apparently that is also the default on newer versions.

I'm tempted to say that echo is not the correct tool in this case, and that you should be passing PS arguments fixing the encoding. If this is for scripts, then this is probably fine, but for users, I wouldn't recommend writing Dune files this way and simply using an editor would be an improvement.

In the mean time, this is a good opportunity to improve the error message and say that we only accept utf8. Also updating the docs is a good idea.

@jonahbeckford
Copy link
Collaborator Author

Does OCaml even compile with utf16 .ml files?

I never even considered that. After testing UTF-16 ... no, it does not. But neither does it work with BOM-encoded UTF-8 ... which is valid UTF-8.

Anyway, Dune should not need to support encodings that OCaml does not support. Sadly, I can't find a specification for what encoding OCaml supports! I vaguely remember some thread that OCaml doesn't support UTF-8 source files ... I think it was only ISO 8859-1 or some Latin variant.

Flow chart:

Aside: C has no spec for source code encoding, so gcc expecting UTF-8 makes sense. rustc was a better example. But more popular languages (Javascript, C#, Python and as you mentioned Java) do support other source code encodings.

@nojb
Copy link
Collaborator

nojb commented Dec 9, 2023

Sadly, I can't find a specification for what encoding OCaml supports! I vaguely remember some thread that OCaml doesn't support UTF-8 source files ... I think it was only ISO 8859-1 or some Latin variant.

OCaml has no spec for source code encoding: source file contents are interpreted as raw bytes. There is however an ongoing discussion to require UTF-8: see ocaml/ocaml#1802 and the links mentioned there.

@jonahbeckford jonahbeckford changed the title dune-project parser does not work if dune-project has a byte-order mark or is in UTF-16 dune-project parser does not work if dune-project has a UTF-8 byte-order mark Dec 11, 2023
@jonahbeckford
Copy link
Collaborator Author

Okay ... will have you all on the Dune team decide whether UTF-8 (including optional BOM) is what Dune supports.

@rgrinberg
Copy link
Member

We'll support whatever OCaml does. I agree with your statement here:

Anyway, Dune should not need to support encodings that OCaml does not support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants