-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
π Search Terms
"Unicode", "well-formed Unicode", "valid Unicode", "lone surrogates", ""UTF-16", "UTF-8", "isWellFormed()", "toWellFormed()"
β Viability Checklist
- This wouldn't be a breaking change in existing TypeScript/JavaScript code
- This wouldn't change the runtime behavior of existing JavaScript code
- This could be implemented without emitting different JS based on the types of the expressions
- This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, new syntax sugar for JS, etc.)
- This isn't a request to add a new utility type: https://github.com/microsoft/TypeScript/wiki/No-New-Utility-Types
- [?] This feature would agree with the rest of our Design Goals: https://github.com/Microsoft/TypeScript/wiki/TypeScript-Design-Goals
β Suggestion
ES2024 now has String.isWellFormed()
and String.toWellFormed()
, which are supported in TypeScript's ES2024 type definitions.
But significant value from these functions is not realized in TypeScript because of the lack of a well-formed string type.
What I'd like to see is a "well-formed string" type (itself a super-type of String
) for which isWellFormed()
serves as a type guard and toWellFormed()
(as well as functions like TextDecoder.decode()
) return the well-formed string type.
Additionally, string literals could be determined to be of the well-formed string type at compile time.
This way TypeScript developers could get type safety for scenarios where strings need to be guaranteed to be well-formed.
π Motivating Example
I'm working on a TypeScript implementation of CEL which requires passing well-formed UTF-8 strings into an evaluation environment. If I want to bridge TypeScript's type safety to CEL's type safety, I'll need a well-formed string type in TypeScript.
π» Use Cases
I can do something like this in my project:
interface WellFormedString extends String {
__brand: "WellFormed";
}
interface String {
isWellFormed(): this is WellFormedString;
toWellFormed(): WellFormedString;
toUpperCase(): this extends WellFormedString ? WellFormedString : string;
toLowerCase(): this extends WellFormedString ? WellFormedString : string;
}
interface TextDecoder {
decode(input?: AllowSharedBufferSource, options?: TextDecodeOptions): WellFormedString;
}
function useWellFormedString(a: WellFormedString) {
// ...
}
// good -- no error
useWellFormedString("hello".toWellFormed());
// good -- no error
useWellFormedString("hello".toWellFormed().toUpperCase());
// good -- no error
const h = "hello";
if (h.isWellFormed()) {
useWellFormedString(h);
}
// good -- no error
// (the decoder coerces a lone "WTF-8" surrogate to "\ufffd\ufffd\ufffd")
useWellFormedString(new TextDecoder().decode(new Uint8Array([0xed, 0xba, 0xad])))
// good -- error
// (malformed string with lone UTF-16 surrogate)
useWellFormedString("\udead");
// bad -- error
useWellFormedString("hello");
// bad -- error
useWellFormedString("hello" as WellFormedString);
But there are some significant disadvantages here:
- Well-formed string literals are not recognized as well-formed.
- Uses a branding hack.
- The compiler complains about casting (maybe this is fixable, but I don't know how).