Description
Consider to add an internal string data type as an alternate representation for lists of characters:
Complete strings
The tagged value is a byte-aligned pointer to a null-terminated sequence of UTF-8 characters on the heap (copy stack). This means that the full address-space of 32/64 bits cannot be used but some (1 or two) bits less than that. In many systems this is not a problem since the operating systems already makes similar assumptions anyway.
A list of n characters [C1, ..., CN]
would thus occupy between n+1 and n+8 bytes, depending on alignment, whereas the naive representation as a list of chars would take n · (3 · 8) bytes.
In this manner, the "tail" of the strings can be rapidly (constantly) computed which is key for fast parsing with DCGs. Note that SWI7's built-in string requires for sub_string(String, 1, _, 0, Tail)
time and space proportional to the length of String
.
Then, all operations from unification upwards need to be able to handle both lists of chars and that string type. That is indeed the more challenging part.
Partial strings
This would take the idea one step further. Instead of representing a list of known characters, a partial list of known characters would be represented compactly. E.g. Xs0 = [a,b,c,d|Xs]
would be represented as a null-terminated string "abcd" plus padding to the next word which represents Xs
. This would make library(pio)
very fast!
Just as a remark: Strings that contain the null-character can still be represented as a regular lists of characters.