Skip to content

Conversation

micolous
Copy link

@micolous micolous commented Sep 3, 2025

Fix a number of issues relating to nom decoding [u8] buffers as UTF-8 when it shouldn't (#1679):

  • escaped, escape_transform now accept control_char: impl AsChar, rather than char.

    This allows the functions to be used with a u8 or b'', which is useful for parsing text-like files that contain binary or non-UTF-8 data (like Lua).

  • escaped, escaped_transform, satisfy, one_of and none_of now iterate over individual bytes when using a [u8] input, rather than attempting to interpret bytes as UTF-8 sequences.

I've added many tests demonstrating some edge cases of these functions. Many existing tests incorrectly used str for [u8] inputs in some places, which can lead to some unexpected behaviour when handling binary data.

This will probably break API compatibility for a parser that takes a [u8] buffer as inputs and assumes everything is decoded as UTF-8. I'd argue this is incorrect usage anyway – those should be using str.

There's probably other parts of nom that assume [u8] is encoded as UTF-8, but searching for these is hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant