Define a string `interfaceType` and establish documentation for well-known types

* [x] Proposed
* [ ] Prototype: Not Started
* [ ] Implementation: Not Started
* [ ] Specification: Not Started

## Summary

Right now there's no canonical method for representing strings in Harp as only [one register](https://github.com/harp-tech/protocol/blob/016171aa9fb802b2a6f4871d1fdf5caa2d4c2a05/Device.md#r_device_name-25-bytes--devices-name) is string-like, but there is a growing need for them. We should establish a clear standard to avoid diverging implementations.

## Motivation

Right now the only string-type register is [`R_DEVICE_NAME`](https://github.com/harp-tech/protocol/blob/016171aa9fb802b2a6f4871d1fdf5caa2d4c2a05/Device.md#r_device_name-25-bytes--devices-name). The structure of the string is partially specified in the register description, but is imprecise.

We're finding needs for [new string registers](https://github.com/harp-tech/protocol/pull/68), and ideally do not want to duplicate this description in multiple places.

Furthermore, code generators consuming the `device.yml` need to know how to interpret the data to naturally expose the register to their API consumers. We use `interfaceType` for this currently, but it's open-ended by design and fully implementation-defined how generators interpret it.

## Detailed Design

A string-typed register is defined as having a payload consisting of an array of bytes encoded using [UTF-8](https://en.wikipedia.org/wiki/UTF-8).

Harp messages associated with string registers have a `PayloadType` of `Type=1, HasTimestamp=dontcare, IsFloat=0, IsSigned=0`. The register specification in the `device.yml` shall have an `interfaceType` of `string`.

The length of the string is defined by the length of the message.

The null character (`'\0'`) has no special meaning per this specification, strings do not need to end with a null character.

----------------

Separate but related: We should establish a list of well-known types that generators are expected to support. Any core register with an `interfaceType` must only use a well-known type.

## Drawbacks

None

## Alternatives

**Using null-terminated strings or another encoding like ASCII**

Short-answer: No

<details>
<summary>Long answer (Click to expand)</summary>

TL;DR: David yells at legacy-shaped cloud

* Using null-terminated strings
  * Null-terminated strings are a bad experiment established early on as a convention for representing strings in C, and has been the source of a lot of issues.
  * Basically all modern languages (including C# and Python, and arguably even C++) stored as a character array alongside an explicit length instead.
  * We already have a message length, so we might as well use it. Null termination only makes sense when you don't know the bounds of the data.
  * Not giving specific meaning to null characters allows them to be used within the string.
    * (Such as for lists of strings, which isn't an uncommon convention even in C.)
    * On the flip side, some protocols argue that null termination shouldn't be used, but null characters should still be banned or discouraged to avoid causing problems for clients which might unintentionally ascribe meaning to them. (I've never seen this be a problem in practice outside of C.)
  * C's convention for null-terminated strings has been a major source of problems in the modern day: It's the source of many security issues, poor (usually undefined) behavior of standard library functions when a buffer doesn't contain space for one, poor performance characteristics for basic operations, problems using span-based memory conventions, and they make David soapbox 10's of minutes during Harp meetings! Look, he's doing it in this proposal too!
* Using ASCII or some other encoding
  * It is 2025. Everything besides UTF-8 is legacy. Encoding strings as anything else (especially for interchange) is baggage that should be left in the past.
  * Despite its historical prevalence, ASCII in particular is a very USA English-centric. (You may think ASCII supports accented characters, but you were actually using Windows-1252 or ISO 8859 mislabeled as ASCII.)
  * Strings are mainly for human consumers, it's perfectly reasonable that a Korean scientist would want to name their device "코 찌르기 1"
  * Anything that needs to be parsed needs to be defined separately from this specification anyway. Characters beyond ASCII will should naturally be excluded from needing to be handled by device code.
  * For cases where the embedded code does need to have proper UTF-8 awareness (such as slicing strings or rendering them to a display), it is trivial to implement sensibly. (IE: We don't need to ship all of libicu or Harfbuzz to do something reasonable like replace non-ASCII characters with `?` or find character boundaries.)
  * If you think David soapboxing about the evils of null-terminated strings is bad, just wait until you get him started on string encoding.

</details>

## Unresolved Questions

* Should the well-defined types be part of the protocol spec or currently non-existent `device.yml` documentation?
  * Right now things are only documented by [the JSON schema](https://github.com/harp-tech/protocol/blob/016171aa9fb802b2a6f4871d1fdf5caa2d4c2a05/schema/registers.json), which is not very discoverable.
  * I think it'd be helpful to document things in both places as the audiences are somewhat different.
  * Maybe the protocol could just link to the `device.yml` documentation from the `PayloadType` spec: "Further interpretation of the payload data is defined by [the `device.yml`](https://zombo.com/)."
  * (This would also be helpful for showing that the data may be structured/heterogeneous.)
* Should we explicitly allow fixed-length strings in the spec?
  * Variable-length arrays are currently not supported by the `device.yml` spec, so this proposal as written relies on [the variable-length register proposal](https://github.com/harp-tech/protocol/issues/116). This would be a way to allow strings without the variable-length requirement.
  * Even if that proposal is accepted, this would allow us to retroactively turn [`R_DEVICE_NAME`](https://github.com/harp-tech/protocol/blob/016171aa9fb802b2a6f4871d1fdf5caa2d4c2a05/Device.md#r_device_name-25-bytes--devices-name) into a string without also making it variable-length.
  * Spec should be amended to include the following:
  > For fixed-length registers only, the unused space beyond the end of the string (if any) shall be padded with null characters.
  * Is it weird for this specification to acknowledge the concept of a fixed-length register? That's a `device.yml` concept, not a Harp protocol one. (Depends on where this gets documented, I suppose.)
    * We could define a separate `stringz` type that is explicitly null-padded.
  * Note that this definition is subtly different from null-termination. (The string can still contain nulls as long as they're not the final character. This implicitly handles a string that matches the length of the register exactly.)
    * We could also define the string as being null-terminated IIF it terminates before the end of the payload.
  * This also might enable using strings as fields within a `payloadSpec`
    * Or we say that a string within a `payloadSpec` consumes every byte from its offset to the end of the payload (implicitly saying it must always be the final field)
* Do we want to disallow null characters?
  * I think it's better to be unopinionated about the contents of a string, but it is true that null characters can cause problems when they pass through C code. As such I've seen file formats/protocol/language specifications forbid or discourage their use within strings, but I'm of the opinion this is rarely a problem in practice. (It's more important in textual formats like JSON.)
  * The "array of strings" example I gave in the ramblerant above is arguably a *different* interface type which should perhaps be specified separately.
  * It may be tempting to intentionally allow nulls to allow smuggling binary blobs in strings, but arbitrary binary data is not always valid UTF-8.
    * I feel like the spec as-written implies the UTF-8 has to be well-formed, but we could explicitly say it.
* Is tying this specification to things like "register" or `PayloadType` appropriate? One could feasibly want a string within a `payloadSpec`.

## Design Meetings

* Proposal stemmed from discussion during [2025-02-20 SRM](https://github.com/orgs/harp-tech/discussions/115)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define a string `interfaceType` and establish documentation for well-known types #118

Summary

Motivation

Detailed Design

Drawbacks

Alternatives

Unresolved Questions

Design Meetings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define a string interfaceType and establish documentation for well-known types #118

Description

Summary

Motivation

Detailed Design

Drawbacks

Alternatives

Unresolved Questions

Design Meetings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Define a string `interfaceType` and establish documentation for well-known types #118