Skip to content

UTF-8 parsing with state machine #59399

Closed
@jocutajar

Description

@jocutajar

Hi, I'd like to propose a cleaner approach (IMHO) to parse UTF-8 in core::str:from_utf8 and friends. Not confident to optimize the ASCII fast forward in unsafe, but I think a state machine would describe the problem better and is more readable.

As a bonus, the state machine could be made available for other more complex parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams. The machine exposes the number of incomplete() and needed() bytes, it can count the valid chars seen(). As such it would be very useful for general byte stream processing. It works with no_std, depending only on core::str.

The essence is quite simple:


impl Utf8Machine {
    pub fn turn(&mut self, input: &u8) -> Utf8Act {
        /*
         * legal utf-8 byte sequence
         * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
         *
         *  Code Points        1st       2s       3s       4s
         * U+0000..U+007F     00..7F
         * U+0080..U+07FF     C2..DF   80..BF
         * U+0800..U+0FFF     E0       A0..BF   80..BF
         * U+1000..U+CFFF     E1..EC   80..BF   80..BF
         * U+D000..U+D7FF     ED       80..9F   80..BF
         * U+E000..U+FFFF     EE..EF   80..BF   80..BF
         * U+10000..U+3FFFF   F0       90..BF   80..BF   80..BF
         * U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF
         * U+100000..U+10FFFF F4       80..8F   80..BF   80..BF
         */
        use Utf8Act::*;
        use Utf8State::*;
        let seen = self.seen;
        let done = (self.seen + 1, Normal, Done);
        let (seen, state, act) = match (&self.state, input) {
            (Normal, 0x00...0x7F) => done,
            (Normal, 0x80...0xC1) => (seen, Normal, Invalid(1)),
            (Normal, 0xC2...0xDF) => (seen, C___x_, NeedMore(1)),
            (Normal, 0xE0...0xE0) => (seen, C__0__, NeedMore(2)),
            (Normal, 0xE1...0xEC) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xED...0xED) => (seen, C__D__, NeedMore(2)),
            (Normal, 0xEE...0xEF) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xF0...0xF0) => (seen, C_0___, NeedMore(3)),
            (Normal, 0xF1...0xF3) => (seen, C_x___, NeedMore(3)),
            (Normal, 0xF4...0xF4) => (seen, C_4___, NeedMore(3)),
            (Normal, 0xF5...0xFF) => (seen, Normal, Invalid(1)),

            (C___x_, 0x80...0xBF) => done,
            (C__xx_, 0x80...0xBF) => done,
            (C_xxx_, 0x80...0xBF) => done,

            (C__0__, 0xA0...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C__D__, 0x80...0x9F) => (seen, C__xx_, NeedMore(1)),
            (C__x__, 0x80...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C_0___, 0x90...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_4___, 0x80...0x8F) => (seen, C_xx__, NeedMore(2)),
            (C_x___, 0x80...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_xx__, 0x80...0xBF) => (seen, C_xxx_, NeedMore(1)),

            (C___x_, _) => (seen, Normal, Invalid(2)),
            (C__0__, _) => (seen, Normal, Invalid(2)),
            (C__D__, _) => (seen, Normal, Invalid(2)),
            (C__x__, _) => (seen, Normal, Invalid(2)),
            (C__xx_, _) => (seen, Normal, Invalid(3)),
            (C_0___, _) => (seen, Normal, Invalid(2)),
            (C_4___, _) => (seen, Normal, Invalid(2)),
            (C_x___, _) => (seen, Normal, Invalid(2)),
            (C_xx__, _) => (seen, Normal, Invalid(3)),
            (C_xxx_, _) => (seen, Normal, Invalid(4)),
        };
        self.seen = seen;
        self.state = state;
        act
    }
}

The playground includes applications of the state machine where one could implement the fast forward ASCII optimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions