Description
Hi, I'd like to propose a cleaner approach (IMHO) to parse UTF-8 in core::str:from_utf8
and friends. Not confident to optimize the ASCII fast forward in unsafe, but I think a state machine would describe the problem better and is more readable.
As a bonus, the state machine could be made available for other more complex parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams. The machine exposes the number of incomplete()
and needed()
bytes, it can count the valid chars seen()
. As such it would be very useful for general byte stream processing. It works with no_std
, depending only on core::str
.
The essence is quite simple:
impl Utf8Machine {
pub fn turn(&mut self, input: &u8) -> Utf8Act {
/*
* legal utf-8 byte sequence
* http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
*
* Code Points 1st 2s 3s 4s
* U+0000..U+007F 00..7F
* U+0080..U+07FF C2..DF 80..BF
* U+0800..U+0FFF E0 A0..BF 80..BF
* U+1000..U+CFFF E1..EC 80..BF 80..BF
* U+D000..U+D7FF ED 80..9F 80..BF
* U+E000..U+FFFF EE..EF 80..BF 80..BF
* U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
* U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
* U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
*/
use Utf8Act::*;
use Utf8State::*;
let seen = self.seen;
let done = (self.seen + 1, Normal, Done);
let (seen, state, act) = match (&self.state, input) {
(Normal, 0x00...0x7F) => done,
(Normal, 0x80...0xC1) => (seen, Normal, Invalid(1)),
(Normal, 0xC2...0xDF) => (seen, C___x_, NeedMore(1)),
(Normal, 0xE0...0xE0) => (seen, C__0__, NeedMore(2)),
(Normal, 0xE1...0xEC) => (seen, C__x__, NeedMore(2)),
(Normal, 0xED...0xED) => (seen, C__D__, NeedMore(2)),
(Normal, 0xEE...0xEF) => (seen, C__x__, NeedMore(2)),
(Normal, 0xF0...0xF0) => (seen, C_0___, NeedMore(3)),
(Normal, 0xF1...0xF3) => (seen, C_x___, NeedMore(3)),
(Normal, 0xF4...0xF4) => (seen, C_4___, NeedMore(3)),
(Normal, 0xF5...0xFF) => (seen, Normal, Invalid(1)),
(C___x_, 0x80...0xBF) => done,
(C__xx_, 0x80...0xBF) => done,
(C_xxx_, 0x80...0xBF) => done,
(C__0__, 0xA0...0xBF) => (seen, C__xx_, NeedMore(1)),
(C__D__, 0x80...0x9F) => (seen, C__xx_, NeedMore(1)),
(C__x__, 0x80...0xBF) => (seen, C__xx_, NeedMore(1)),
(C_0___, 0x90...0xBF) => (seen, C_xx__, NeedMore(2)),
(C_4___, 0x80...0x8F) => (seen, C_xx__, NeedMore(2)),
(C_x___, 0x80...0xBF) => (seen, C_xx__, NeedMore(2)),
(C_xx__, 0x80...0xBF) => (seen, C_xxx_, NeedMore(1)),
(C___x_, _) => (seen, Normal, Invalid(2)),
(C__0__, _) => (seen, Normal, Invalid(2)),
(C__D__, _) => (seen, Normal, Invalid(2)),
(C__x__, _) => (seen, Normal, Invalid(2)),
(C__xx_, _) => (seen, Normal, Invalid(3)),
(C_0___, _) => (seen, Normal, Invalid(2)),
(C_4___, _) => (seen, Normal, Invalid(2)),
(C_x___, _) => (seen, Normal, Invalid(2)),
(C_xx__, _) => (seen, Normal, Invalid(3)),
(C_xxx_, _) => (seen, Normal, Invalid(4)),
};
self.seen = seen;
self.state = state;
act
}
}
The playground includes applications of the state machine where one could implement the fast forward ASCII optimization.