Skip to content

Define a string interfaceType and establish documentation for well-known types #118

Open
@PathogenDavid

Description

@PathogenDavid
  • Proposed
  • Prototype: Not Started
  • Implementation: Not Started
  • Specification: Not Started

Summary

Right now there's no canonical method for representing strings in Harp as only one register is string-like, but there is a growing need for them. We should establish a clear standard to avoid diverging implementations.

Motivation

Right now the only string-type register is R_DEVICE_NAME. The structure of the string is partially specified in the register description, but is imprecise.

We're finding needs for new string registers, and ideally do not want to duplicate this description in multiple places.

Furthermore, code generators consuming the device.yml need to know how to interpret the data to naturally expose the register to their API consumers. We use interfaceType for this currently, but it's open-ended by design and fully implementation-defined how generators interpret it.

Detailed Design

A string-typed register is defined as having a payload consisting of an array of bytes encoded using UTF-8.

Harp messages associated with string registers have a PayloadType of Type=1, HasTimestamp=dontcare, IsFloat=0, IsSigned=0. The register specification in the device.yml shall have an interfaceType of string.

The length of the string is defined by the length of the message.

The null character ('\0') has no special meaning per this specification, strings do not need to end with a null character.


Separate but related: We should establish a list of well-known types that generators are expected to support. Any core register with an interfaceType must only use a well-known type.

Drawbacks

None

Alternatives

Using null-terminated strings or another encoding like ASCII

Short-answer: No

Long answer (Click to expand)

TL;DR: David yells at legacy-shaped cloud

  • Using null-terminated strings
    • Null-terminated strings are a bad experiment established early on as a convention for representing strings in C, and has been the source of a lot of issues.
    • Basically all modern languages (including C# and Python, and arguably even C++) stored as a character array alongside an explicit length instead.
    • We already have a message length, so we might as well use it. Null termination only makes sense when you don't know the bounds of the data.
    • Not giving specific meaning to null characters allows them to be used within the string.
      • (Such as for lists of strings, which isn't an uncommon convention even in C.)
      • On the flip side, some protocols argue that null termination shouldn't be used, but null characters should still be banned or discouraged to avoid causing problems for clients which might unintentionally ascribe meaning to them. (I've never seen this be a problem in practice outside of C.)
    • C's convention for null-terminated strings has been a major source of problems in the modern day: It's the source of many security issues, poor (usually undefined) behavior of standard library functions when a buffer doesn't contain space for one, poor performance characteristics for basic operations, problems using span-based memory conventions, and they make David soapbox 10's of minutes during Harp meetings! Look, he's doing it in this proposal too!
  • Using ASCII or some other encoding
    • It is 2025. Everything besides UTF-8 is legacy. Encoding strings as anything else (especially for interchange) is baggage that should be left in the past.
    • Despite its historical prevalence, ASCII in particular is a very USA English-centric. (You may think ASCII supports accented characters, but you were actually using Windows-1252 or ISO 8859 mislabeled as ASCII.)
    • Strings are mainly for human consumers, it's perfectly reasonable that a Korean scientist would want to name their device "코 찌르기 1"
    • Anything that needs to be parsed needs to be defined separately from this specification anyway. Characters beyond ASCII will should naturally be excluded from needing to be handled by device code.
    • For cases where the embedded code does need to have proper UTF-8 awareness (such as slicing strings or rendering them to a display), it is trivial to implement sensibly. (IE: We don't need to ship all of libicu or Harfbuzz to do something reasonable like replace non-ASCII characters with ? or find character boundaries.)
    • If you think David soapboxing about the evils of null-terminated strings is bad, just wait until you get him started on string encoding.

Unresolved Questions

  • Should the well-defined types be part of the protocol spec or currently non-existent device.yml documentation?
    • Right now things are only documented by the JSON schema, which is not very discoverable.
    • I think it'd be helpful to document things in both places as the audiences are somewhat different.
    • Maybe the protocol could just link to the device.yml documentation from the PayloadType spec: "Further interpretation of the payload data is defined by the device.yml."
    • (This would also be helpful for showing that the data may be structured/heterogeneous.)
  • Should we explicitly allow fixed-length strings in the spec?
    • Variable-length arrays are currently not supported by the device.yml spec, so this proposal as written relies on the variable-length register proposal. This would be a way to allow strings without the variable-length requirement.
    • Even if that proposal is accepted, this would allow us to retroactively turn R_DEVICE_NAME into a string without also making it variable-length.
    • Spec should be amended to include the following:

    For fixed-length registers only, the unused space beyond the end of the string (if any) shall be padded with null characters.

    • Is it weird for this specification to acknowledge the concept of a fixed-length register? That's a device.yml concept, not a Harp protocol one. (Depends on where this gets documented, I suppose.)
      • We could define a separate stringz type that is explicitly null-padded.
    • Note that this definition is subtly different from null-termination. (The string can still contain nulls as long as they're not the final character. This implicitly handles a string that matches the length of the register exactly.)
      • We could also define the string as being null-terminated IIF it terminates before the end of the payload.
    • This also might enable using strings as fields within a payloadSpec
      • Or we say that a string within a payloadSpec consumes every byte from its offset to the end of the payload (implicitly saying it must always be the final field)
  • Do we want to disallow null characters?
    • I think it's better to be unopinionated about the contents of a string, but it is true that null characters can cause problems when they pass through C code. As such I've seen file formats/protocol/language specifications forbid or discourage their use within strings, but I'm of the opinion this is rarely a problem in practice. (It's more important in textual formats like JSON.)
    • The "array of strings" example I gave in the ramblerant above is arguably a different interface type which should perhaps be specified separately.
    • It may be tempting to intentionally allow nulls to allow smuggling binary blobs in strings, but arbitrary binary data is not always valid UTF-8.
      • I feel like the spec as-written implies the UTF-8 has to be well-formed, but we could explicitly say it.
  • Is tying this specification to things like "register" or PayloadType appropriate? One could feasibly want a string within a payloadSpec.

Design Meetings

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalRequest for a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions