You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve recently published my parsing combinator library lexy. One of the things it does is issue a lexy::error if the input does not match the grammar. This error has a .position() which gives you the position where the error occurred.
In order to keep the happy path fast, .position() is not something that is easy to use for end users: it is simply an iterator into the input range. This is no good to a human user who wants something like line and column number to easily locate the problematic input.
Converting an iterator into line/column location seems simple enough: set line = column = 1 and iterate over the entire input until you’ve reached the position of the iterator. Every time you see a newline, increment the line number and set the column number back to 1. Otherwise, the column is implemented every time you … see what exactly?
What exactly is a “column” of a text and how do I compute it?
Let’s just write the basic version without thinking much of the problem:
template<typenameInput,typenameIterator>autofind_location(constInput&input,Iteratorposition){autoline=1;autocolumn=1;for(autoiter=input.begin();iter!=input.end();++iter){if(iter==position){// We found the location.
break;}elseif(*iter=='\n')// End of a line.
{++line;column=1;}else{// Advance one column.
++column;}}returnstd::make_pair(line,column);}
When we encounter a newline, we advance to the next line. Otherwise, we increment the column. Once we’ve reached the position in the input we’re looking for, we exit the loop and return the result.
This works and is quite simple and intuitive. I’ve tested a couple of text editors and compilers and it seems like this algorithm is used by clang, GCC before version 11, as well as neovims col() function.
Yet this algorithm is “wrong”.
We’re counting the number of chars in a line which in a Unicode world has no relation to any notion of “character”. Input like ä, €, or 𝔄 will count for 2, 3 and 4 columns respectively in UTF-8, but 1, 1, and 2 in UTF-16.
Let’s assume the input is encoded in UTF-8 for the sake of discussion. UTF-8 is a multibyte encoding, which means that some “characters” are encoded using a sequence of char. A single char is called a code unit and a sequence of code units is used to encode a code point. “Characters” like ä, €, or 𝔄 are one code point, but encoded as multiple code units.
So we need to count code points, not chars:
for(autoiter=input.begin();iter!=input.end();){if(iter==position){// We found the location.
break;}elseif(*iter=='\n')// End of a line.
{++line;column=1;}else{// One code point is a column.
skip_code_point(iter,input.end());++column;}}
The function skip_code_point() does the necessary logic to advance the iterator to the next code point. This is not too complicated – just look at the bit pattern of the initial code unit, so I’ve omitted it here for brevity.
For reference, here is lexy’s straightforward implementation. It also extracts the value of the code point and handles various encoding errors such as overlong sequences. None of that is really necessary here though.
Counting code points means that even multibyte “characters” are treated as a single column and we’re no longer exposing their actual encoding. This algorithm seems to be used by the Rust compiler.
So, counting columns is a bit more complicated than you’d initially expect but it’s still manageable. lexy already provided rules to match Unicode code points, so let’s just use those in the actual implementation and call it a day.
That’s because a “character” doesn’t really have a precise definition like code unit or code point. The closest to what a non-tech person would describe as character, is a Unicode grapheme cluster: a string that roughly corresponds to a single glyph in the font.
And of course, a single code point isn’t enough to encode one grapheme cluster, you might need multiple. You can combine many Latin characters with special code points to form characters such as f̃, w͜, or s̷̙̃, which are 2, 3, and 4 code points respectively. There are also scripts such as Hangul or Thai which make use of multiple code points that are combined when rendered – and then there are emojis.
Emojis easily combine many many code points into one symbol. It begins with flag emojis such as 🇪🇺, which is actually a special “E” followed by “U”, continues with emojis such as 🧑🔬 (scientist), which is 🧑 (person) glued together with 🔬 (microscope) using a special joiner code point, and ends at the absolute pinnacle of code point combinations - the family emoji 👪. How do you make a family? Easy, you take a person (with optional skin tone and gender modifier) and glue it with another person, as well as their children. That way you can easily end up with a single “character” consisting of ten or more code points!
So to properly count “characters”, we need to advance the position not by a code point, but by an entire grapheme cluster. This is what “real” text programs such as LibreOffice do.
While this is certainly doable, it seems complicated (and I’m not even sure that covers emoji sequences…?). So before implementing it, let’s make sure this is actual the approach we want.
When reporting an error, the compiler also underlines the relevant part of the input:
error: this is not how things work!
my_really_cool_program(42);
^^ this is wrong
For that, it needs to know how many spaces to print before printing the underline. If we define a column as that number of spaces, this is also referred to as a virtual column. It is reported by neovims virtcol() function and used by GCC since version 11 (as recommended by the GNU standard apparently).
Counting the number of equivalent spaces is not trivial in general, as that depends on the font. However, here we can safely assume a monospace font where every glyph has the same width (mono space, right?).
Except of course it doesn’t.
Most Chinese, Japanese or Korean characters are rendered twice as wide as most other characters, even in a monospace font:
1234 // 4 characters
全角 // 2 characters
And there are also wide version of some normal characters, such as A (not A). But there is a Unicode standard and a lookup table, so that doesn’t seem to bad.
Except that this doesn’t cover emojis, which are also rendered twice as wide:
12
🙂
And then there is \t, the tab character. Dare I say and ask: How many spaces is a tab?
GCC seems to say “8”, for some reason. This awful choice means that the underline alignment breaks when I view an error message in neovim’s embedded terminal, where \t is rendered as four spaces, but the underline assumes its eight.
I mean it would break, if I were to use a literal \t character in my source code. I’m not a monster, so it doesn’t.
The incompatibilities between GCC and neovim doesn’t stop there either: remember those emojis glued together from multiple code points?
Well of course neovim doesn’t render them properly. 🧑🔬 isn’t displayed as 🧑🔬 but as 🧑<200d>🔬, where 200d is the value of the glue code point. This means that, according to neovim, 🧑🔬 virtual column length is 2 (first emoji) + 6 (length of '<200d>') + 2 (second emoji), so you need 10 spaces to account for it in the underline. GCC, however, prints only 4 spaces (2 for each emoji, and 0 for the invisible glue code point), which means it also gets misaligned in neovim’s terminal.
And can you really blame it?
In my “real” terminal, 🧑🔬 is rendered as 🧑🔬, so printing four spaces is correct there (although that’s also because my terminal doesn’t render it properly, then it would be two). So to answer “how many spaces is this character wide?”, we still need to ask the environment/font we’re using – even for monospace fonts!
Needless to say, this approach doesn’t seem right either.
And now what?
So, to recap, we’ve seen four approaches:
Counting code units: simple and fast to compute, but might be surprising to users as it has no real relation to “character”.
Counting code points: more complicated than counting bytes and “more correct”, but still no real relation to “character”.
Counting grapheme clusters: even more complicated, but at least it corresponds to “character”.
Counting virtual columns: somehow even more complicated still, but at least it it allows underlining the error message.
What should we do?
To answer that we have to take a step back and actually look at why we need column information in the first place. In particular, there are two distinct use cases: editors and compilers.
For an editor, we display columns to inform the user about the cursor position. There, I think counting grapheme clusters is the right approach. This has the advantage that the column directly corresponds to “how often do I need to press l (or the right arrow key) to go that column”, as cursor movement is also based around grapheme clusters. Telling the user “you’re at position 5” which means “press the arrow key five times to get there” is quite nice.
For a compiler, we display columns so the user can locate the position of an error. If the user looks at the output and then manually goes to that error location, this should also be the number of grapheme clusters, as that corresponds to arrow movement.
But nobody looks at an error message and manually navigates to the location using the column information! Your IDE/vim setup automatically jumps to the error location (or you just look at the underline and manually go there without looking at the column at all).
This means that the error location should be written in a format that is easily parseable by the IDE, in units that are easy to compute – i.e. code units. Counting code units is simple and fast and there is only one unique definition of it.
Contrast this with virtual columns, which is what GCC is going to use: to compute it properly, it depends on the environment! In particular, neovim’s and GCC’s definition disagrees, which means automatic jumping to an error location is impossible. GNU’s decision to use virtual column by default in the future seems misguided.
Don’t get me wrong – virtual columns have their place, e.g. for computing the underline. But even then, it is entirely non trivial to compute: do I report the correct value of two for 🧑🔬 or am I bug-compatible with most terminals and say its four? In either case, it doesn’t work inside neovim as there it’s rendered differently still. Not to mention tab, where there is no correct answer that works everywhere.
Using such a brittle unit with no clear definition in something that should be parsable by machines is just asking for trouble. I can understand why neovim chooses to use it as its column position: it is the one that closely resembles an actual column. But I don’t think even this is useful for a user: why would you need to know the equivalent number of spaces to indicate position?
That leaves code points which are middle ground: complicated to compute and not really useful for users. However, unlike code units they are independent of the actual encoding. So if you have an input file in UTF-16, but the compiler uses UTF-8 internally, giving positions in code points gives the same result for compiler and editor.
One scenario where this happens is with the use of a language server. Input files are usually UTF-8, but the language server protocol assumes UTF-16. Indicating column information in code points would be ideal, but they use UTF-16 code units instead, which requires servers to transcode. Note that there is an open issue to use code points instead, as that would be portable.
Conclusion
One table summary:
Counting
Machines
Humans
Portable
Code units
easy
not useful
no
Code points
moderate
not useful
yes
Grapheme clusters
hard
useful
yes
Virtual columns
hard
not really useful?
absolutely not
So, use code units as the unit if the location is intended to be parsed by machines (such as compiler error messages), use grapheme clusters as the unit if the location is intended to be useful to humans (such as in text editors).
Use code points over code units if you need to communicate between different encodings.
Use virtual columns only if that’s what you actually need (e.g. to align multiple lines). Using it as a portable output format such as in error messages is just asking for trouble.
In lexy, the unit was and is actually customizable – so you can define column as “number of As in the line” if you want. But I’m definitely going to discuss this issue a bit more in the documentation.
What is the unit of a text column number?
https://ift.tt/gpnFibS
What is the unit of a text column number?
I’ve recently published my parsing combinator library lexy. One of the things it does is issue a
lexy::error
if the input does not match the grammar. This error has a.position()
which gives you the position where the error occurred.In order to keep the happy path fast,
.position()
is not something that is easy to use for end users: it is simply an iterator into the input range. This is no good to a human user who wants something like line and column number to easily locate the problematic input.Converting an iterator into line/column location seems simple enough: set
line = column = 1
and iterate over the entire input until you’ve reached the position of the iterator. Every time you see a newline, increment the line number and set the column number back to1
. Otherwise, the column is implemented every time you … see what exactly?What exactly is a “column” of a text and how do I compute it?
Approach #1: Count
char
sLet’s just write the basic version without thinking much of the problem:
When we encounter a newline, we advance to the next line. Otherwise, we increment the column. Once we’ve reached the position in the input we’re looking for, we exit the loop and return the result.
This works and is quite simple and intuitive. I’ve tested a couple of text editors and compilers and it seems like this algorithm is used by clang, GCC before version 11, as well as neovims
col()
function.Yet this algorithm is “wrong”.
We’re counting the number of
char
s in a line which in a Unicode world has no relation to any notion of “character”. Input likeä
,€
, or𝔄
will count for 2, 3 and 4 columns respectively in UTF-8, but 1, 1, and 2 in UTF-16.So we need to do better.
Approach #2: Count code points
Let’s assume the input is encoded in UTF-8 for the sake of discussion. UTF-8 is a multibyte encoding, which means that some “characters” are encoded using a sequence of
char
. A singlechar
is called a code unit and a sequence of code units is used to encode a code point. “Characters” likeä
,€
, or𝔄
are one code point, but encoded as multiple code units.So we need to count code points, not
char
s:The function
skip_code_point()
does the necessary logic to advance the iterator to the next code point. This is not too complicated – just look at the bit pattern of the initial code unit, so I’ve omitted it here for brevity.Counting code points means that even multibyte “characters” are treated as a single column and we’re no longer exposing their actual encoding. This algorithm seems to be used by the Rust compiler.
So, counting columns is a bit more complicated than you’d initially expect but it’s still manageable. lexy already provided rules to match Unicode code points, so let’s just use those in the actual implementation and call it a day.
Except it’s not that simple.
Handling text is never that simple.
Approach #3: Count grapheme clusters
Notice how I put “character” in quotes?
That’s because a “character” doesn’t really have a precise definition like code unit or code point. The closest to what a non-tech person would describe as character, is a Unicode grapheme cluster: a string that roughly corresponds to a single glyph in the font.
And of course, a single code point isn’t enough to encode one grapheme cluster, you might need multiple. You can combine many Latin characters with special code points to form characters such as f̃, w͜, or s̷̙̃, which are 2, 3, and 4 code points respectively. There are also scripts such as Hangul or Thai which make use of multiple code points that are combined when rendered – and then there are emojis.
Emojis easily combine many many code points into one symbol. It begins with flag emojis such as 🇪🇺, which is actually a special “E” followed by “U”, continues with emojis such as 🧑🔬 (scientist), which is 🧑 (person) glued together with 🔬 (microscope) using a special joiner code point, and ends at the absolute pinnacle of code point combinations - the family emoji 👪. How do you make a family? Easy, you take a person (with optional skin tone and gender modifier) and glue it with another person, as well as their children. That way you can easily end up with a single “character” consisting of ten or more code points!
So to properly count “characters”, we need to advance the position not by a code point, but by an entire grapheme cluster. This is what “real” text programs such as LibreOffice do.
While this is certainly doable, it seems complicated (and I’m not even sure that covers emoji sequences…?). So before implementing it, let’s make sure this is actual the approach we want.
Approach #4: Count virtual columns
When reporting an error, the compiler also underlines the relevant part of the input:
For that, it needs to know how many spaces to print before printing the underline. If we define a column as that number of spaces, this is also referred to as a virtual column. It is reported by neovims
virtcol()
function and used by GCC since version 11 (as recommended by the GNU standard apparently).Counting the number of equivalent spaces is not trivial in general, as that depends on the font. However, here we can safely assume a monospace font where every glyph has the same width (mono space, right?).
Except of course it doesn’t.
Most Chinese, Japanese or Korean characters are rendered twice as wide as most other characters, even in a monospace font:
And there are also wide version of some normal characters, such as
A
(notA
). But there is a Unicode standard and a lookup table, so that doesn’t seem to bad.Except that this doesn’t cover emojis, which are also rendered twice as wide:
And then there is
\t
, the tab character. Dare I say and ask: How many spaces is a tab?GCC seems to say “8”, for some reason. This awful choice means that the underline alignment breaks when I view an error message in neovim’s embedded terminal, where
\t
is rendered as four spaces, but the underline assumes its eight.The incompatibilities between GCC and neovim doesn’t stop there either: remember those emojis glued together from multiple code points?
Well of course neovim doesn’t render them properly. 🧑🔬 isn’t displayed as 🧑🔬 but as
🧑<200d>🔬
, where200d
is the value of the glue code point. This means that, according to neovim, 🧑🔬 virtual column length is2 (first emoji) + 6 (length of '<200d>') + 2 (second emoji)
, so you need 10 spaces to account for it in the underline. GCC, however, prints only 4 spaces (2 for each emoji, and 0 for the invisible glue code point), which means it also gets misaligned in neovim’s terminal.And can you really blame it?
In my “real” terminal, 🧑🔬 is rendered as
🧑🔬
, so printing four spaces is correct there (although that’s also because my terminal doesn’t render it properly, then it would be two). So to answer “how many spaces is this character wide?”, we still need to ask the environment/font we’re using – even for monospace fonts!Needless to say, this approach doesn’t seem right either.
And now what?
So, to recap, we’ve seen four approaches:
What should we do?
To answer that we have to take a step back and actually look at why we need column information in the first place. In particular, there are two distinct use cases: editors and compilers.
For an editor, we display columns to inform the user about the cursor position. There, I think counting grapheme clusters is the right approach. This has the advantage that the column directly corresponds to “how often do I need to press
l
(or the right arrow key) to go that column”, as cursor movement is also based around grapheme clusters. Telling the user “you’re at position 5” which means “press the arrow key five times to get there” is quite nice.For a compiler, we display columns so the user can locate the position of an error. If the user looks at the output and then manually goes to that error location, this should also be the number of grapheme clusters, as that corresponds to arrow movement.
But nobody looks at an error message and manually navigates to the location using the column information! Your IDE/vim setup automatically jumps to the error location (or you just look at the underline and manually go there without looking at the column at all).
This means that the error location should be written in a format that is easily parseable by the IDE, in units that are easy to compute – i.e. code units. Counting code units is simple and fast and there is only one unique definition of it.
Contrast this with virtual columns, which is what GCC is going to use: to compute it properly, it depends on the environment! In particular, neovim’s and GCC’s definition disagrees, which means automatic jumping to an error location is impossible. GNU’s decision to use virtual column by default in the future seems misguided.
Don’t get me wrong – virtual columns have their place, e.g. for computing the underline. But even then, it is entirely non trivial to compute: do I report the correct value of two for 🧑🔬 or am I bug-compatible with most terminals and say its four? In either case, it doesn’t work inside neovim as there it’s rendered differently still. Not to mention tab, where there is no correct answer that works everywhere.
Using such a brittle unit with no clear definition in something that should be parsable by machines is just asking for trouble. I can understand why neovim chooses to use it as its column position: it is the one that closely resembles an actual column. But I don’t think even this is useful for a user: why would you need to know the equivalent number of spaces to indicate position?
That leaves code points which are middle ground: complicated to compute and not really useful for users. However, unlike code units they are independent of the actual encoding. So if you have an input file in UTF-16, but the compiler uses UTF-8 internally, giving positions in code points gives the same result for compiler and editor.
One scenario where this happens is with the use of a language server. Input files are usually UTF-8, but the language server protocol assumes UTF-16. Indicating column information in code points would be ideal, but they use UTF-16 code units instead, which requires servers to transcode. Note that there is an open issue to use code points instead, as that would be portable.
Conclusion
One table summary:
So, use code units as the unit if the location is intended to be parsed by machines (such as compiler error messages), use grapheme clusters as the unit if the location is intended to be useful to humans (such as in text editors).
Use code points over code units if you need to communicate between different encodings.
Use virtual columns only if that’s what you actually need (e.g. to align multiple lines). Using it as a portable output format such as in error messages is just asking for trouble.
In lexy, the unit was and is actually customizable – so you can define column as “number of
A
s in the line” if you want. But I’m definitely going to discuss this issue a bit more in the documentation.C++ enthusiast. I write libraries.
via foonathan::blog()
September 1, 2022 at 03:03PM
The text was updated successfully, but these errors were encountered: