Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tr: implement support for non-UTF-8 input #354

Merged

Conversation

andrewliebenow
Copy link
Contributor

No description provided.

@andrewliebenow
Copy link
Contributor Author

Other implementations can either handle only UTF-8 (FreeBSD/bsdutils), or always operate on a byte by byte basis, meaning that they cannot handle Unicode characters that have a multi-byte UTF-8 representation.

This implementation is about 2.5x as fast as FreeBSD/bsdutils', a bit faster than the implementation in uutils' coreutils, but slower than BusyBox and GNU Core Utilities (which don't have to deal with looking multiple bytes ahead).

Looks to be about twice as fast as it was before I touched the tr code.

❯ coreutils printf 'AA ᚱᚱ \xFF\xFF \xFE\xFE 11 ᚢᚢ BB ᛇᛇ 22\n' | ./target/release/tr -d -s 'A ᚱ \377 \376' '1 ᚢ B ᛇ 2'
1ᚢBᛇ2

❯ coreutils printf 'AA ᚱᚱ \xFF\xFF \xFE\xFE 11 ᚢᚢ BB ᛇᛇ 22\n' | /usr/bin/tr -d -s 'A ᚱ \377 \376' '1 ᚢ B ᛇ 2'
1�B����2

❯ coreutils printf 'AA ᚱᚱ \xFF\xFF \xFE\xFE 11 ᚢᚢ BB ᛇᛇ 22\n' | coreutils tr -d -s 'A ᚱ \377 \376' '1 ᚢ B ᛇ 2'
1�B����2

❯ coreutils printf 'AA ᚱᚱ \xFF\xFF \xFE\xFE 11 ᚢᚢ BB ᛇᛇ 22\n' | busybox tr -d -s 'A ᚱ \377 \376' '1 ᚢ B ᛇ 2'
1�B����2

❯ coreutils printf 'AA ᚱᚱ \xFF\xFF \xFE\xFE 11 ᚢᚢ BB ᛇᛇ 22\n' | /usr/ucb/tr -d -s 'A ᚱ \377 \376' '1 ᚢ B ᛇ 2'
tr: Invalid or incomplete multibyte or wide character

The obvious downside is about 550 more lines of code (excluding new tests), and the added complexity of the multi-byte lookahead needed to support both binary and UTF-8 input.

@andrewliebenow andrewliebenow force-pushed the tr-mixed-binary-utf-8-processing branch from bf22f23 to d3948df Compare October 24, 2024 05:25
@andrewliebenow andrewliebenow force-pushed the tr-mixed-binary-utf-8-processing branch from 6a2e291 to f50e7f2 Compare October 25, 2024 22:58
@jgarzik jgarzik merged commit f957698 into rustcoreutils:main Oct 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants