Closed
Description
Introduction
The current tr
implementation assumes that the data fed into standard input is valid UTF-8.
Most implementations of tr
work on a byte by byte basis, which allows non-UTF-8 data to be handled.
BusyBox, GNU Core Utilities, and uutils' coreutils:
❯ printf '\xFF\n' | /usr/bin/tr '\377' 'A'
A
Multibyte UTF-8 data is obviously not handled correctly:
❯ printf '%s\n' 'ᛆᚠᛏᚢᛆᛘᚢᚦᛌᛏᚭᚿᛏᛆᚱᚢᚿᛆᛧᚦᛆᛧ' | /usr/bin/tr -d 'ᛆᚠ'
����������������
bsdutils (port of FreeBSD tools to Linux) handles multibyte UTF-8 input correctly:
❯ printf '%s\n' 'ᛆᚠᛏᚢᛆᛘᚢᚦᛌᛏᚭᚿᛏᛆᚱᚢᚿᛆᛧᚦᛆᛧ' | /usr/ucb/tr -d 'ᛆᚠ'
ᛏᚢᛘᚢᚦᛌᛏᚭᚿᛏᚱᚢᚿᛧᚦᛧ
But that means it can't handle non-Unicode input:
❯ printf '\xFF\n' | /usr/ucb/tr '\377' 'A'
tr: Invalid or incomplete multibyte or wide character
The posixutils implementation is currently similar to the bsdutils implementation, but the error message needs to be changed:
❯ printf '\xFF\n' | ./target/release/tr '\377' 'A'
thread 'main' panicked at text/./tr.rs:1200:25:
assertion failed: leftover_bytes == 0_usize
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Options
In no particular order:
- Keep the current implementation (assume UTF-8) input, but just fix the error message when
tr
is fed non-UTF-8 data- Pros: supports some use cases that the widely used implementations do not
- Cons: users may expect to be able to use
tr
to perform arbitrary byte transformations (not thinking of their input as text)
- Mirror most implementations and just process input on a byte by byte basis.
- Pros: better performance, simpler code, behaves like the most widely used implementations
- Cons: can't perform transformations of multibyte text
- Support both byte by byte processing and UTF-8 input, either by starting in UTF-8 mode and then switching to byte by byte mode when non-UTF-8 data is detected, or by providing additional/non-standard arguments
- Pros: supports basically every use case
- Cons: code complexity, deviation from existing implementations
I don't think there's one clear right option here, so please chime in if you have any suggestions or thoughts.
Metadata
Metadata
Assignees
Labels
No labels