Skip to content

[lua] Fall back to built-in utf8 module on Lua 5.3+#12596

Draft
jdonaldson wants to merge 1 commit intoHaxeFoundation:developmentfrom
jdonaldson:lua-utf8-fallback
Draft

[lua] Fall back to built-in utf8 module on Lua 5.3+#12596
jdonaldson wants to merge 1 commit intoHaxeFoundation:developmentfrom
jdonaldson:lua-utf8-fallback

Conversation

@jdonaldson
Copy link
Member

@jdonaldson jdonaldson commented Feb 11, 2026

Summary

  • Adds a runtime shim (_hx_utf8.lua) that pre-populates package.loaded['lua-utf8'] with either the real library or a compat table built from Lua 5.3+'s built-in utf8 module
  • The shim runs before @:luaRequire('lua-utf8') generates require, so it's transparent — no changes to Utf8.hx or String.hx
  • On Lua 5.1/5.2/LuaJIT, behavior is unchanged (no built-in utf8 module, original error preserved)

Follows the same pcall(require, ...) pattern as _hx_bit.lua.

Per tobil4sk's comment: the built-in utf8 module doesn't implement all the methods provided by lua-utf8, but the ones that are provided we can use.

Compat table methods

Method Implementation
len(s,i,j,lax) utf8.len + lax fallback to #s for invalid UTF-8
char(...) utf8.char directly
codes(s) utf8.codes directly
byte(s,i) utf8.offsetutf8.codepoint
sub(s,i,j) utf8.offset for char→byte, then string.sub
find(s,pat,init,plain) utf8.offset for init, string.find, utf8.len for byte→char
upper, lower string.upper/string.lower (ASCII only)
gsub, gmatch, match string.* byte-level fallback

Limitations (without lua-utf8)

  • upper/lower are ASCII-only
  • gsub/gmatch/match operate on bytes, not characters

Closes #9412

Test plan

  • All 11,599 Lua unit tests pass with lua-utf8 installed
  • Only 4 expected failures without lua-utf8 (non-ASCII upper/lower)
  • CI passes

HaxeFoundation#9412)

On Lua 5.3+, add a runtime shim that pre-populates package.loaded['lua-utf8']
with a compat table built from the built-in utf8 module. This lets the Lua
target work without the third-party lua-utf8 library on Lua 5.3+.

Limitations without lua-utf8: upper/lower are ASCII-only; gsub/gmatch/match
operate on bytes, not characters.
@tobil4sk
Copy link
Member

I think it may be worth reorganising things a bit here, rather than trying to force utf8 into the shape of lua-utf8.

  • Overwriting the lua-utf8 import doesn't seem like good practice and has potential to cause confusion. I think it's best to have a intermediate class (or table like _hx_bit).
  • Rather than trying to re-implement lua-utf8 using utf8, the intermediate class should just define the common api
  • We should aim to implement haxe std functions with the common api
  • The more of this we can write in haxe, the better, so dce can remove unneeded methods from the output file.

I'm also worried that this introduces a bit of inconsistency. Currently, we either have lua-utf8 or we don't. If we have it, then strings are treated as utf8 and all operations work in terms of codepoints. If we don't have it, then all strings are treated as bytes. This is easy to explain in documentation.

With this fallback implementation, some methods use utf8 characters and others use bytes, which can break things as a user might take the output from one method and pass it to the other. In particular, falling back to #s for .len() breaks something like badString.charAt(badString.length - 1).

I'm not exactly sure the best way to solve all these concerns. Maybe another option is: if we use utf8 as a replacement for lua-utf8, throw on unimplemented methods, instead of falling back to unicode incompatible ones. That way users will get an explicit error instead of inconsistent behaviour.

What are your thoughts?

@jdonaldson jdonaldson marked this pull request as draft February 12, 2026 01:11
@jdonaldson
Copy link
Member Author

Yeah, what I have here is a bad idea for the reasons you're outlining. I've put this repo down as a draft for now. I think it will require a much bigger effort to completely integrate 5.3 utf8, and I'm not even sure if the juice is worth the squeeze here. Any additional thoughts appreciated.

@skial skial mentioned this pull request Feb 12, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lua 5.3 implements bit32 as a native and utf8 as just utf8 not lua-utf8

2 participants