-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Swap the role of graphemes & characters!
The terminology is rather confusing but the Unicode standard does seem to consider graphemes as clusters of characters.
- Loading branch information
Showing
1 changed file
with
16 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,34 @@ | ||
# Text | ||
We represent text, at a high level, much as many natural languages do: | ||
using sequences of "_characters_" (e.g. A, €, Ñ, Æ, ffi, क्ष, 공, นั่, 鬱, ♬, ⭐, 🇨🇦, 𓂀, 🏴☠️, 👩🏻❤️💋👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ) | ||
that are in turn composed of _graphemes_ (the smallest distinct meaningful units of writing); | ||
some characters consist of a single grapheme (e.g. A, €, ♬, ⭐, 𓂀) while | ||
We represent text, at a high level, much as many natural languages do: using sequences of _graphemes_, | ||
the smallest distinct meaningful units of writing (e.g. A, €, Ñ, Æ, ffi, क्ष, 공, นั่, 鬱, ♬, ⭐, 🇨🇦, 𓂀, 🏴☠️, 👩🏻❤️💋👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ), | ||
which are in turn composed of "_characters_"; | ||
some graphemes consist of a single character (e.g. A, €, ♬, ⭐, 𓂀) while | ||
others combine two (e.g. Ñ, Æ, อ์, 🇨🇦), three (e.g. ffi, क्ष, 공, นั่, 🏴☠️) or even several (e.g. 鬱, 👩🏻❤️💋👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ). | ||
|
||
[Unicode](https://www.joelonsoftware.com/articles/Unicode.html) is the most comprehensive and ubiquitous _character set_: | ||
a method for mapping between graphemes (and some precomposed characters) and integers (AKA scalar values). | ||
It contains integer mappings for graphemes that can represent all living scripts as well as many historical ones, | ||
a method for mapping between characters (including some precomposed/composite ones) and integers (AKA scalar values). | ||
It contains integer mappings for characters that can represent all living scripts as well as many historical ones, | ||
various symbols (e.g. math, music, transport, science, games), emoji, etc. | ||
Each of the integers can be encoded by a small number (up to 4) of bytes. | ||
Each of the integers can be encoded by a small number (up to 4) of _bytes_. | ||
|
||
Some older software uses a legacy mapping called ASCII, which uses a single byte per character and only works for a few scripts - | ||
it doesn’t even allow mixing text from multiple scripts. | ||
**Historical note**: some older software uses a legacy mapping called ASCII, which uses a single byte per character | ||
and only works for a few scripts - it doesn’t even allow mixing text from multiple scripts. | ||
|
||
In the parlance of computer science and software development, sequences of characters are known as **strings** | ||
(because we form them by _stringing_ together characters). | ||
Unlike natural languages, computers do not typically group characters into increasingly more complex sequences like phrases, sentences and paragraphs. | ||
In the parlance of computer science and software development, sequences of characters/graphemes are known as **strings** | ||
(because we form them by _stringing_ those together). | ||
Unlike natural languages, computers do not typically group graphemes into increasingly more complex sequences like phrases, sentences and paragraphs. | ||
That said, we often refer to smaller segments of a string as substrings. | ||
|
||
We can store the sequence of bytes that comprise a string in an _array_ (contiguous chunk of memory). | ||
_Recall that each element of an array can be addressed by its position in the sequence (typically starting at 0)._ | ||
Arrays have one primary downside: once created, their capacity cannot be changed. | ||
This means that, once created, strings cannot be lengthened in-place. | ||
|
||
Aside from the grapheme representation method, the most important property of a string is its _length_. | ||
This may be computed as needed using a special character that marks the end of a string: the string terminator, which is a null byte (integer value 0). However, because characters may be composed of a variable number of graphemes, which in turn may be composed of a variable number of bytes, | ||
computing the number of characters (or even graphemes) in a string is a complex operation. | ||
An important property of a string is its _length_. | ||
This may be computed as needed using a special character that marks the end of a string: the string terminator, which is a null byte (integer value 0). However, because graphemes may be composed of a variable number of characters, which in turn may be composed of a variable number of bytes, | ||
computing the number of graphemes (or even characters) in a string is a complex operation. | ||
|
||
Alternatively, the length of the string may be stored along with its content for easy retrieval. | ||
This length never needs to be updated if we can never change the number of characters in a string. | ||
This length never needs to be updated if we can never change the _number_ of graphemes or characters in a string. | ||
Often we want to go even further & never modify the bytes in strings at all. | ||
The term used to describe data that can never be changed is **immutable** - as opposed to mutable. |