Skip to content

Efficient alternate representations for lists of characters (strings). #24

Open
@UWN

Description

@UWN

Consider to add an internal string data type as an alternate representation for lists of characters:

Complete strings

The tagged value is a byte-aligned pointer to a null-terminated sequence of UTF-8 characters on the heap (copy stack). This means that the full address-space of 32/64 bits cannot be used but some (1 or two) bits less than that. In many systems this is not a problem since the operating systems already makes similar assumptions anyway.

A list of n characters [C1, ..., CN] would thus occupy between n+1 and n+8 bytes, depending on alignment, whereas the naive representation as a list of chars would take n · (3 · 8) bytes.

In this manner, the "tail" of the strings can be rapidly (constantly) computed which is key for fast parsing with DCGs. Note that SWI7's built-in string requires for sub_string(String, 1, _, 0, Tail) time and space proportional to the length of String.

Then, all operations from unification upwards need to be able to handle both lists of chars and that string type. That is indeed the more challenging part.

Partial strings

This would take the idea one step further. Instead of representing a list of known characters, a partial list of known characters would be represented compactly. E.g. Xs0 = [a,b,c,d|Xs] would be represented as a null-terminated string "abcd" plus padding to the next word which represents Xs. This would make library(pio) very fast!

Just as a remark: Strings that contain the null-character can still be represented as a regular lists of characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions