Skip to content

Commit

Permalink
Swap the role of graphemes & characters!
Browse files Browse the repository at this point in the history
The terminology is rather confusing but the Unicode standard does seem to consider graphemes as clusters of characters.
  • Loading branch information
adsouza authored Jun 22, 2024
1 parent 81ac1a2 commit b586322
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions text.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
# Text
We represent text, at a high level, much as many natural languages do:
using sequences of "_characters_" (e.g. A, €, Ñ, Æ, ffi, क्ष, 공, นั่, 鬱, ♬, ⭐, 🇨🇦, 𓂀, 🏴‍☠️, 👩🏻‍❤️‍💋‍👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ)
that are in turn composed of _graphemes_ (the smallest distinct meaningful units of writing);
some characters consist of a single grapheme (e.g. A, €, ♬, ⭐, 𓂀) while
We represent text, at a high level, much as many natural languages do: using sequences of _graphemes_,
the smallest distinct meaningful units of writing (e.g. A, €, Ñ, Æ, ffi, क्ष, 공, นั่, 鬱, ♬, ⭐, 🇨🇦, 𓂀, 🏴‍☠️, 👩🏻‍❤️‍💋‍👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ),
which are in turn composed of "_characters_";
some graphemes consist of a single character (e.g. A, €, ♬, ⭐, 𓂀) while
others combine two (e.g. Ñ, Æ, อ์, 🇨🇦), three (e.g. ffi, क्ष, 공, นั่, 🏴‍☠️) or even several (e.g. 鬱, 👩🏻‍❤️‍💋‍👨🏾, Ǫ̵̹̻̝̳͂̌̌͘, ﷺ).

[Unicode](https://www.joelonsoftware.com/articles/Unicode.html) is the most comprehensive and ubiquitous _character set_:
a method for mapping between graphemes (and some precomposed characters) and integers (AKA scalar values).
It contains integer mappings for graphemes that can represent all living scripts as well as many historical ones,
a method for mapping between characters (including some precomposed/composite ones) and integers (AKA scalar values).
It contains integer mappings for characters that can represent all living scripts as well as many historical ones,
various symbols (e.g. math, music, transport, science, games), emoji, etc.
Each of the integers can be encoded by a small number (up to 4) of bytes.
Each of the integers can be encoded by a small number (up to 4) of _bytes_.

Some older software uses a legacy mapping called ASCII, which uses a single byte per character and only works for a few scripts -
it doesn’t even allow mixing text from multiple scripts.
**Historical note**: some older software uses a legacy mapping called ASCII, which uses a single byte per character
and only works for a few scripts - it doesn’t even allow mixing text from multiple scripts.

In the parlance of computer science and software development, sequences of characters are known as **strings**
(because we form them by _stringing_ together characters).
Unlike natural languages, computers do not typically group characters into increasingly more complex sequences like phrases, sentences and paragraphs.
In the parlance of computer science and software development, sequences of characters/graphemes are known as **strings**
(because we form them by _stringing_ those together).
Unlike natural languages, computers do not typically group graphemes into increasingly more complex sequences like phrases, sentences and paragraphs.
That said, we often refer to smaller segments of a string as substrings.

We can store the sequence of bytes that comprise a string in an _array_ (contiguous chunk of memory).
_Recall that each element of an array can be addressed by its position in the sequence (typically starting at 0)._
Arrays have one primary downside: once created, their capacity cannot be changed.
This means that, once created, strings cannot be lengthened in-place.

Aside from the grapheme representation method, the most important property of a string is its _length_.
This may be computed as needed using a special character that marks the end of a string: the string terminator, which is a null byte (integer value 0). However, because characters may be composed of a variable number of graphemes, which in turn may be composed of a variable number of bytes,
computing the number of characters (or even graphemes) in a string is a complex operation.
An important property of a string is its _length_.
This may be computed as needed using a special character that marks the end of a string: the string terminator, which is a null byte (integer value 0). However, because graphemes may be composed of a variable number of characters, which in turn may be composed of a variable number of bytes,
computing the number of graphemes (or even characters) in a string is a complex operation.

Alternatively, the length of the string may be stored along with its content for easy retrieval.
This length never needs to be updated if we can never change the number of characters in a string.
This length never needs to be updated if we can never change the _number_ of graphemes or characters in a string.
Often we want to go even further & never modify the bytes in strings at all.
The term used to describe data that can never be changed is **immutable** - as opposed to mutable.

0 comments on commit b586322

Please sign in to comment.