Skip to content

Conversation

PtPrashantTripathi
Copy link

This PR refactors the $lengthBytesUTF8 function to use the native TextEncoder API.
This change simplifies the codebase, improves correctness and reliability, and aligns the implementation with modern web standards.

The following code replaces the old, manually-coded version:

  /**
   * Returns the number of bytes the given JavaScript string takes if encoded as a
   * UTF8 byte array, EXCLUDING the null terminator byte.
   *
   * @param {string} str - The JavaScript string to operate on.
   * @return {number} The length, in bytes, of the UTF-8 encoded string.
   */
  $lengthBytesUTF8: (str) => {
    var UTF8Encoder = new TextEncoder();
    return UTF8Encoder.encode(str).length;
  }

Change Log

  • Removed manual UTF-16 surrogate pair and character code logic.
  • Added reliance on the native TextEncoder web API.
  • Improved code simplicity, readability, and maintainability.
  • Ensured accurate and robust UTF-8 byte length calculation for all Unicode characters.

Explanation

The original implementation manually calculates the UTF-8 byte length by iterating through the string's characters and checking their charCodeAt.
While functional, this approach was complex and fragile:

  • It required specific logic to correctly handle multi-byte characters.
  • It had to identify and process UTF-16 surrogate pairs, which represent characters outside the Basic Multilingual Plane (e.g., emojis and certain CJK ideographs).

New Implementation Benefits

The new implementation leverages the built-in TextEncoder API, which is modern, standardized, and highly optimized.

Why this is an improvement:

  1. Simplicity

    • The manual loop is replaced by a single, declarative line of code.
    • Easier to read, understand, and less prone to human error.
  2. Robustness

    • TextEncoder correctly handles all Unicode edge cases without complex, manual checks.

    • Works seamlessly for strings containing characters like:

      こんにちは
      or
      नमस्ते
      

      which require careful surrogate pair handling in the old implementation.

  3. Performance

    • Native browser APIs like TextEncoder are implemented in highly optimized C++.
    • Faster than manual JavaScript logic, especially for long or complex strings.

Example

To demonstrate correctness, consider this test case:

// A string with characters that are multi-byte in UTF-8
const testString = "नमस्ते विश्व!";

// Old manual function:
// lengthBytesUTF8(testString) === 35

// New TextEncoder implementation:
const byteLength = new TextEncoder().encode(testString).length;
// byteLength === 35

Both methods yield the correct result, but the new implementation achieves it with a more reliable and maintainable approach.

@PtPrashantTripathi PtPrashantTripathi force-pushed the refactor/use-text-encoder-for-utf8-length branch from 8a0522e to 955d2f3 Compare September 3, 2025 10:57
@juj
Copy link
Collaborator

juj commented Sep 3, 2025

Quick question for the team: What VSCode settings/extensions do you guys use to make this Emscripten code more readable? The syntax highlighting is pretty much useless for me right now with all the preprocessor directives.

Personally I use Sublime Text as the editor, not sure what others use. It does also sometimes get confused with the #if preprocessors, though I've gotten by getting used to it.

Hope I didn't introduce any bugs

You can see CircleCI logs on this PR, that take you to URLs of live CI runs that check if there are any errors. Looks like there is something to fix there - check out the report pages there.

@juj
Copy link
Collaborator

juj commented Sep 3, 2025

  $lengthBytesUTF8: (str) => {
    var UTF8Encoder = new TextEncoder();
    return UTF8Encoder.encode(str).length;
  }

Btw if you still wanted to implement use of TextEncoder into lengthBytesUTF8, you can do so with something like

$lengthBytesUTF8__deps: ['$UTF8Encoder'],
$lengthBytesUTF8: (str) => {
#if SHRINK_LEVEL == 2
  return UTF8Encoder.encode(str).length;
#else
  .. the original implementation here
#endif
}

SHRINK_LEVEL=2 corresponds to using -Oz on command line, i.e. "optimize maximally for smallest code size, ignoring any performance implications".

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR description and title since this is no longer just about lengthBytesUTF8.

For the title, can you use the . NFC suffix rather than Refactor: prefix (this is just our local convention).

@PtPrashantTripathi PtPrashantTripathi changed the title Refactor: Use TextEncoder for $lengthBytesUTF8 Function NFC: Use TextEncoder for stringToUTF8Array and lengthBytesUTF8 Function Sep 3, 2025
@PtPrashantTripathi
Copy link
Author

Hi team,

I'd like to propose a change to our emscripten string utility library.

Currently, our UTF-8, UTF-16, and UTF-32 conversion functions rely on a lot of custom logic that manually loops through bytes. This approach is not as performant as it could be.

I've outlined a high-level approach to address this:


High-Level Architectural Change

The core idea is to move away from individual functions with custom loops and introduce dedicated Processor objects. These processors will act as a centralized, robust layer for all encoding and decoding operations.

+----------------+       +-------------------+
|  Old Approach  |       |  Proposed Approach  |
+----------------+       +-------------------+
|                |       |                   |
|  stringToUTF8  |       |  stringToUTF8     |
|       |        |       |       |           |
|  Manual Loop   |       |   UTF8Processor   |
+----------------+       |                   |
|  UTF8ToString  |       |  UTF8ToString     |
|       |        |       |       |           |
|  Manual Loop   |       |   UTF8Processor   |
+----------------+       +-------------------+
                                   |
                                   |
+------------------------------------------------+
|          UTF8Processor Logic                   |
+------------------------------------------------+
|  Try Native `TextEncoder` / `TextDecoder` API  |
|          (Faster)                              |
|------------------------------------------------|
|  Fallback to JavaScript Loop                   |
|           (Compatability)                      |
+------------------------------------------------+


Conceptual Code Example

This code block illustrates the core concept for a single UTF8Processor object. The decode and encode methods encapsulate the logic, preferring the fast native APIs while maintaining a fallback for older browser compatibility.

// A conceptual high-level "Processor" object
const UTF8Processor = (() => {
  // Try to get native browser APIs
  const nativeDecoder = typeof TextDecoder !== 'undefined' ? new TextDecoder('utf-8') : null;
  const nativeEncoder = typeof TextEncoder !== 'undefined' ? new TextEncoder() : null;

  return {
    decode: (byteArray) => {
      if (nativeDecoder) {
        // Use the fast, native API
        return nativeDecoder.decode(byteArray);
      } else {
        // Fallback to manual JS implementation for older browsers
        // ...manual loop logic here...
      }
    },
    encode: (jsString) => {
      if (nativeEncoder) {
        // Use the fast, native API
        return nativeEncoder.encode(jsString);
      } else {
        // Fallback to manual JS implementation
        // ...manual loop logic here...
      }
    }
  };
})();

I don't have the skills to implement this refactor myself, but I believe it's an important change for improving performance. Could an existing developer please review this and help get it started?

cc : @juj @sbc100

@juj
Copy link
Collaborator

juj commented Sep 4, 2025

The core idea is to move away from individual functions with custom loops and introduce dedicated Processor objects.

Thanks for the idea. For the system library architecture that Emscripten has, this is unfortunately not a good approach. The system libraries need to be kept flat, C-like to that they generate minimal code size and allow effective automatic dead code elimination.

So object-oriented class-like abstractions do not fit well within these requirements. That is why TextEncoder and TextDecoder are accessed as separate global variables.

Also another requirement is to be able to fall back to manual code when the JS objects are not available. The UTF8Processor idea would tie encoder and decoder together, even though they would have been adopted in JavaScript APIs at different times, so gating their use would needlessly be tied to each other.

@PtPrashantTripathi
Copy link
Author

Thanks for the idea. For the system library architecture that Emscripten has, this is unfortunately not a good approach. The system libraries need to be kept flat, C-like to that they generate minimal code size and allow effective automatic dead code elimination.

So, I had another thought based on what you said. What if we don't use a single big object at all? We could just add the logic directly to each function. That way, each one can check for its own native API and fall back to the manual code if it needs to. It keeps everything flat and simple.

@PtPrashantTripathi
Copy link
Author

@juj can you check why those 2 ci check are not running

return u8array;
if (dontAddNull)
u8array = u8array.subarray(0, numBytesWritten);
return Array.from(u8array);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed? If so then I guess this change technically changes the interface to stringToUTF8Array?

If this part of the change is unrelated then perhaps is can be removed, or split into its own PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually intArrayFromString originally had the functionality to remove the null at the end of the array. To support this old feature, I have kept it here. Without it, some of the test cases are failing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more?

If the behaviour of stringToUTF8Array is unchanged then it seems like that code in intArrayFromString would not need changing.

If the behaviour of stringToUTF8Array has changed in some way can you explain exactly how its has changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the stringToUTF8Array, we updated the logic to use the set method on the heap instead of looping through and assigning values one by one. This works only when both the source and target are TypedArray objects like Uint8Array.

So for that Previously in here, we were using a regular Array(len) for the u8array heap, which was incorrect since set wouldn't work in that case. Now, we explicitly create it as a Uint8Array like this:

var u8array = new Uint8Array(len);

Additionally, earlier we had:

if (dontAddNull) u8array.length = numBytesWritten;

I changed it to:

if (dontAddNull) 
  u8array = u8array.subarray(0, numBytesWritten);

This ensures we properly handle cases where the null terminator should be removed, which also fixes some failing test cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case I'm not sure we can just switch to using the .set() method, since that would be a breaking change to stringToUTF8Array. i.e. it would mean the function no longer accepts Array objects, only TypedArray objects.

Is that right?

Its not impossible to imagine us changing that interface, but if we do I think we should do that first, as a separate change.

Copy link
Author

@PtPrashantTripathi PtPrashantTripathi Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct, I'll just revert back the change to accept both array and typed-array

Meanwhile quick question what is the best way to run all the test case in local I am having a mac - so is there any just a quick test run command or somthing
I have tried that test/runner script but it's running for more then 1.5 hour in my machine and still running...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally run ./test/runner other --skip-slow and ./test/runner core2 --skip-slow locally.. that gives pretty good coverage and both commands should return pretty quickly.

@juj
Copy link
Collaborator

juj commented Sep 4, 2025

Here I think we definitely do want to support a -sTEXTENCODER=0 option that uses the old synchronous code.

This is because the TextEncoder-enabled paths all allocate temporary garbage. We do want to have a build path that provides as many garbage-free paths as possible. GC stuttering is a very sensitive and difficult thing to optimize for, so being able to clean up as many sources of temp garbage as possible is desirable there.

(Building no-runaway-temp-garbage real-time renderers/games is in general quite feasible with Emscripten, e.g. Unity and Unreal Engine 5 code paths do not generate per-frame temp garbage at all)

@juj
Copy link
Collaborator

juj commented Sep 4, 2025

@juj can you check why those 2 ci check are not running

Hmm, not sure why they are not progressing. CircleCI is a bit finicky at times. I generally push some new change commit to the PR to make the CircleCI backend re-do the whole thing.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 4, 2025

Here I think we definitely do want to support a -sTEXTENCODER=0 option that uses the old synchronous code.

This is because the TextEncoder-enabled paths all allocate temporary garbage. We do want to have a build path that provides as many garbage-free paths as possible. GC stuttering is a very sensitive and difficult thing to optimize for, so being able to clean up as many sources of temp garbage as possible is desirable there.

(Building no-runaway-temp-garbage real-time renderers/games is in general quite feasible with Emscripten, e.g. Unity and Unreal Engine 5 code paths do not generate per-frame temp garbage at all)

In that case I'm not sure its worth adding this complexity. i.e. if the current code has advantages in terms of code size and advantages in terms of not generating garbage, then maybe the perf win in some cases is not worth the extra compexity of adding a new setting (and the extra dimension in the testing matix). Presumably most programs are not bottlenecked on converting JS strings to linear memory?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 4, 2025

(Building no-runaway-temp-garbage real-time renderers/games is in general quite feasible with Emscripten, e.g. Unity and Unreal Engine 5 code paths do not generate per-frame temp garbage at all)

If we were able to use encodeInto would that avoid the garbage? I guess not since it would involve calling heap.subarray(..) to ceate the target array? Are the engines able to avoid creating garbage for transient subarrays?

@PtPrashantTripathi
Copy link
Author

PtPrashantTripathi commented Sep 4, 2025

@juj can you check why those 2 ci check are not running

Hmm, not sure why they are not progressing. CircleCI is a bit finicky at times. I generally push some new change commit to the PR to make the CircleCI backend re-do the whole thing.

@juj It looks like the GitHub Actions workflow is currently blocked and waiting for maintainer approval.

Here’s the link to the pending run: https://github.com/emscripten-core/emscripten/actions/runs/17455947522

Could you please take a look and approve it so the checks can proceed? since all the other checks are successfully completed

@sbc100
Copy link
Collaborator

sbc100 commented Sep 4, 2025

I don't see any maintainers approval button so I'm not sure what is blocking those github actions.

@PtPrashantTripathi PtPrashantTripathi force-pushed the refactor/use-text-encoder-for-utf8-length branch 2 times, most recently from 33d95ea to 63a5874 Compare September 4, 2025 20:24
@PtPrashantTripathi
Copy link
Author

I don't see any maintainers approval button so I'm not sure what is blocking those github actions.

It might look somthing like this
Screenshot_2025-09-05-02-54-21-309_com android chrome-edit

Source: https://docs.github.com/en/copilot/how-tos/use-copilot-agents/coding-agent/review-copilot-prs#managing-github-actions-workflow-runs

@sbc100
Copy link
Collaborator

sbc100 commented Sep 4, 2025

It might look somthing like this

Yup, I know that it looks like since it shows up for every new contributor. It was missing for some reason, but its there now and I just pressed it.

@PtPrashantTripathi PtPrashantTripathi force-pushed the refactor/use-text-encoder-for-utf8-length branch 2 times, most recently from cdc7017 to be0690b Compare September 5, 2025 09:27
@PtPrashantTripathi PtPrashantTripathi force-pushed the refactor/use-text-encoder-for-utf8-length branch from be0690b to d79f964 Compare September 5, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants