Description
openedon Aug 21, 2023
It is relatively easy to mess up and include non-printable characters in docstrings, e.g.:
"""
This function works on the NULL character `\0`.
"""
function foo end
The \0
ends up as an actually 0x00
character in the string, and will not be printed e.g. in the REPL. There are about ~10 cases of this in the ecosystem, with two random examples:
- https://github.com/tpapp/StataDTAFiles.jl/blob/v0.3.0/src/StataDTAFiles.jl#L103
- https://github.com/iamed2/LibPQ.jl/blob/e39283e5d1056e5eb0ab15c893ee9447c2d9c96f/src/parsing.jl#L97
This can cause problems also downstream in tooling, with e.g. Documenter currently just emitting \0
characters into the HTML (JuliaDocs/Documenter.jl#2226), which in turn can cause other problems (e.g. Gumbo does not seem to like \0
in HTML).
Opening this as a discussion / tracking issue, to figure out if there is anything we can do. Point being here that it's unlikely that having the 0x00
in the string is ever the intent of the docstring author, and more likely they wanted 0x5c 0x30
.
- We could warn or disallow certain characters in docstrings? Disallowing would be breaking, since as seen above, it already exists in the wild.
- We could somehow handle it in the Markdown standard library? Or maybe when you pull it up in the REPL? In the HTML spec,
0+0000
sometimes gets replaced with0+FFFD
. This is also how the CommonMark spec handles it (and therefore CommonMark.jl). So an option here would be to fix it in the Markdown parser. - We could try to rely on external tooling, like semgrep or docstring linting (@tecosaur) to catch these issue.
- Or we just don't do anything and just accept that the docstring strings can also contain those characters, and just handle it in the tooling?