Skip to content

Support a built-in type for well-formed stringsΒ #60765

@hudlow

Description

@hudlow

πŸ” Search Terms

"Unicode", "well-formed Unicode", "valid Unicode", "lone surrogates", ""UTF-16", "UTF-8", "isWellFormed()", "toWellFormed()"

βœ… Viability Checklist

⭐ Suggestion

ES2024 now has String.isWellFormed() and String.toWellFormed(), which are supported in TypeScript's ES2024 type definitions.

But significant value from these functions is not realized in TypeScript because of the lack of a well-formed string type.

What I'd like to see is a "well-formed string" type (itself a super-type of String) for which isWellFormed() serves as a type guard and toWellFormed() (as well as functions like TextDecoder.decode()) return the well-formed string type.

Additionally, string literals could be determined to be of the well-formed string type at compile time.

This way TypeScript developers could get type safety for scenarios where strings need to be guaranteed to be well-formed.

πŸ“ƒ Motivating Example

I'm working on a TypeScript implementation of CEL which requires passing well-formed UTF-8 strings into an evaluation environment. If I want to bridge TypeScript's type safety to CEL's type safety, I'll need a well-formed string type in TypeScript.

πŸ’» Use Cases

I can do something like this in my project:

interface WellFormedString extends String {
  __brand: "WellFormed";
}

interface String {
  isWellFormed(): this is WellFormedString;
  toWellFormed(): WellFormedString;
  toUpperCase(): this extends WellFormedString ? WellFormedString : string;
  toLowerCase(): this extends WellFormedString ? WellFormedString : string;
}

interface TextDecoder {
  decode(input?: AllowSharedBufferSource, options?: TextDecodeOptions): WellFormedString;
}

function useWellFormedString(a: WellFormedString) {
  // ...
}

// good -- no error
useWellFormedString("hello".toWellFormed());

// good -- no error
useWellFormedString("hello".toWellFormed().toUpperCase());

// good -- no error
const h = "hello";
if (h.isWellFormed()) {
  useWellFormedString(h); 
}

// good -- no error
// (the decoder coerces a lone "WTF-8" surrogate to "\ufffd\ufffd\ufffd")
useWellFormedString(new TextDecoder().decode(new Uint8Array([0xed, 0xba, 0xad])))

// good -- error
// (malformed string with lone UTF-16 surrogate)
useWellFormedString("\udead");

// bad -- error
useWellFormedString("hello");

// bad -- error
useWellFormedString("hello" as WellFormedString);

But there are some significant disadvantages here:

  1. Well-formed string literals are not recognized as well-formed.
  2. Uses a branding hack.
  3. The compiler complains about casting (maybe this is fixable, but I don't know how).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Awaiting More FeedbackThis means we'd like to hear from more people who would be helped by this featureSuggestionAn idea for TypeScript

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions