-
Notifications
You must be signed in to change notification settings - Fork 3.4k
NFC: Use TextEncoder
for stringToUTF8Array and lengthBytesUTF8 Function
#25148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
NFC: Use TextEncoder
for stringToUTF8Array and lengthBytesUTF8 Function
#25148
Conversation
8a0522e
to
955d2f3
Compare
Personally I use Sublime Text as the editor, not sure what others use. It does also sometimes get confused with the
You can see CircleCI logs on this PR, that take you to URLs of live CI runs that check if there are any errors. Looks like there is something to fix there - check out the report pages there. |
Btw if you still wanted to implement use of TextEncoder into $lengthBytesUTF8__deps: ['$UTF8Encoder'],
$lengthBytesUTF8: (str) => {
#if SHRINK_LEVEL == 2
return UTF8Encoder.encode(str).length;
#else
.. the original implementation here
#endif
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the PR description and title since this is no longer just about lengthBytesUTF8
.
For the title, can you use the . NFC
suffix rather than Refactor:
prefix (this is just our local convention).
TextEncoder
for $lengthBytesUTF8
FunctionTextEncoder
for stringToUTF8Array and lengthBytesUTF8 Function
Hi team, I'd like to propose a change to our Currently, our UTF-8, UTF-16, and UTF-32 conversion functions rely on a lot of custom logic that manually loops through bytes. This approach is not as performant as it could be. I've outlined a high-level approach to address this: High-Level Architectural ChangeThe core idea is to move away from individual functions with custom loops and introduce dedicated Processor objects. These processors will act as a centralized, robust layer for all encoding and decoding operations.
Conceptual Code ExampleThis code block illustrates the core concept for a single // A conceptual high-level "Processor" object
const UTF8Processor = (() => {
// Try to get native browser APIs
const nativeDecoder = typeof TextDecoder !== 'undefined' ? new TextDecoder('utf-8') : null;
const nativeEncoder = typeof TextEncoder !== 'undefined' ? new TextEncoder() : null;
return {
decode: (byteArray) => {
if (nativeDecoder) {
// Use the fast, native API
return nativeDecoder.decode(byteArray);
} else {
// Fallback to manual JS implementation for older browsers
// ...manual loop logic here...
}
},
encode: (jsString) => {
if (nativeEncoder) {
// Use the fast, native API
return nativeEncoder.encode(jsString);
} else {
// Fallback to manual JS implementation
// ...manual loop logic here...
}
}
};
})(); I don't have the skills to implement this refactor myself, but I believe it's an important change for improving performance. Could an existing developer please review this and help get it started? |
Thanks for the idea. For the system library architecture that Emscripten has, this is unfortunately not a good approach. The system libraries need to be kept flat, C-like to that they generate minimal code size and allow effective automatic dead code elimination. So object-oriented class-like abstractions do not fit well within these requirements. That is why TextEncoder and TextDecoder are accessed as separate global variables. Also another requirement is to be able to fall back to manual code when the JS objects are not available. The |
So, I had another thought based on what you said. What if we don't use a single big object at all? We could just add the logic directly to each function. That way, each one can check for its own native API and fall back to the manual code if it needs to. It keeps everything flat and simple. |
@juj can you check why those 2 ci check are not running |
src/lib/libstrings.js
Outdated
return u8array; | ||
if (dontAddNull) | ||
u8array = u8array.subarray(0, numBytesWritten); | ||
return Array.from(u8array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed? If so then I guess this change technically changes the interface to stringToUTF8Array
?
If this part of the change is unrelated then perhaps is can be removed, or split into its own PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually intArrayFromString
originally had the functionality to remove the null at the end of the array. To support this old feature, I have kept it here. Without it, some of the test cases are failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more?
If the behaviour of stringToUTF8Array
is unchanged then it seems like that code in intArrayFromString
would not need changing.
If the behaviour of stringToUTF8Array
has changed in some way can you explain exactly how its has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the stringToUTF8Array
, we updated the logic to use the set
method on the heap
instead of looping through and assigning values one by one. This works only when both the source and target are TypedArray
objects like Uint8Array
.
So for that Previously in here, we were using a regular Array(len)
for the u8array heap
, which was incorrect since set
wouldn't work in that case. Now, we explicitly create it as a Uint8Array
like this:
var u8array = new Uint8Array(len);
Additionally, earlier we had:
if (dontAddNull) u8array.length = numBytesWritten;
I changed it to:
if (dontAddNull)
u8array = u8array.subarray(0, numBytesWritten);
This ensures we properly handle cases where the null terminator should be removed, which also fixes some failing test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in that case I'm not sure we can just switch to using the .set()
method, since that would be a breaking change to stringToUTF8Array
. i.e. it would mean the function no longer accepts Array
objects, only TypedArray
objects.
Is that right?
Its not impossible to imagine us changing that interface, but if we do I think we should do that first, as a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct, I'll just revert back the change to accept both array and typed-array
Meanwhile quick question what is the best way to run all the test case in local I am having a mac - so is there any just a quick test run command or somthing
I have tried that test/runner script but it's running for more then 1.5 hour in my machine and still running...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally run ./test/runner other --skip-slow
and ./test/runner core2 --skip-slow
locally.. that gives pretty good coverage and both commands should return pretty quickly.
Here I think we definitely do want to support a This is because the TextEncoder-enabled paths all allocate temporary garbage. We do want to have a build path that provides as many garbage-free paths as possible. GC stuttering is a very sensitive and difficult thing to optimize for, so being able to clean up as many sources of temp garbage as possible is desirable there. (Building no-runaway-temp-garbage real-time renderers/games is in general quite feasible with Emscripten, e.g. Unity and Unreal Engine 5 code paths do not generate per-frame temp garbage at all) |
Hmm, not sure why they are not progressing. CircleCI is a bit finicky at times. I generally push some new change commit to the PR to make the CircleCI backend re-do the whole thing. |
In that case I'm not sure its worth adding this complexity. i.e. if the current code has advantages in terms of code size and advantages in terms of not generating garbage, then maybe the perf win in some cases is not worth the extra compexity of adding a new setting (and the extra dimension in the testing matix). Presumably most programs are not bottlenecked on converting JS strings to linear memory? |
If we were able to use |
@juj It looks like the GitHub Actions workflow is currently blocked and waiting for maintainer approval. Here’s the link to the pending run: https://github.com/emscripten-core/emscripten/actions/runs/17455947522 Could you please take a look and approve it so the checks can proceed? since all the other checks are successfully completed |
I don't see any maintainers approval button so I'm not sure what is blocking those github actions. |
33d95ea
to
63a5874
Compare
Yup, I know that it looks like since it shows up for every new contributor. It was missing for some reason, but its there now and I just pressed it. |
cdc7017
to
be0690b
Compare
be0690b
to
d79f964
Compare
This PR refactors the
$lengthBytesUTF8
function to use the nativeTextEncoder
API.This change simplifies the codebase, improves correctness and reliability, and aligns the implementation with modern web standards.
The following code replaces the old, manually-coded version:
Change Log
TextEncoder
web API.Explanation
The original implementation manually calculates the UTF-8 byte length by iterating through the string's characters and checking their
charCodeAt
.While functional, this approach was complex and fragile:
New Implementation Benefits
The new implementation leverages the built-in
TextEncoder
API, which is modern, standardized, and highly optimized.Why this is an improvement:
Simplicity
Robustness
TextEncoder
correctly handles all Unicode edge cases without complex, manual checks.Works seamlessly for strings containing characters like:
which require careful surrogate pair handling in the old implementation.
Performance
TextEncoder
are implemented in highly optimized C++.Example
To demonstrate correctness, consider this test case:
Both methods yield the correct result, but the new implementation achieves it with a more reliable and maintainable approach.