Description
Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.
Text Documents
......... The offsets are based on a UTF-16 string representation.
Here in TextDocumentContentChangeEvent
, range
is specified in UTF-16 column offsets while text
is transmitted in UTF-8.
interface TextDocumentContentChangeEvent {
range?: Range;
rangeLength?: number;
text: string;
}
Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units.
A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.
Survey: counting method of Position.character offsets supported by language servers/clients
https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0