Skip to content

AVRO Schema Peculiarities

Ryan Slominski edited this page Jul 20, 2023 · 43 revisions


Naming Limitations

AVRO named items (including enum items) cannot start with a number (specification, issue ticket). Some of the location names used in early versions of JAWS had letter prefix (S1D - S5D). Ops doesn't like this. Workarounds include:

  1. Have ops just "get over it" and accept things can't start with numbers
    • This is a common limitation - Java Enums can't start with a number either, but it is easy to add an alias
  2. Provide a lookup-map object in the global config topic that provides an operator friendly name (String values in a map don't have naming limits).
  3. Use AVRO Enum aliases field (needs to be tested to check if same limits apply).
  4. Not using AVRO enums when enumerating things (use a string and lose in-place enforcement of valid items

We've landed on the last option - don't use AVRO enums for locations. There was another reason to not use enums - they're like a RDMS Check constraint - part of the DDL - what is often nicer is to use regular database rows to store this information and reference it with foreign keys. OK, no foreign keys in Kafka (Validation Gateway maybe?), but we store in separate alarm-locations topic now so that it is easy for admins to define locations (in the admin GUI).

Java Specific Logical Types

When you use the AVRO built-in "compiler" to generate Java classes from an AVRO *.avsc file it creates classes tied to a SCHEMA$ class variable, but that embedded schema may be different than the schema supplied. In particular, Java specific logical types may be added (in order to provide metadata to other Java ecosystem tools). This causes issues when you're attempting to establish language agnostic shared schemas (issue ticket). One specific problem is that the schema registry doesn't treat the original and modified schemas as identical so two schemas are ultimately registered and then schema compatibility comes into play. Example Schemas:

Original:

{
     "name": "username",
     "type": "string"
}

Modified:

{
     "name": "username",
     "type": {
        "type": "string",
        "avro.java.string": "String"
      }
}

Workarounds include:

  1. Don't use the built-in compiler - hand craft Java classes
  2. Modify the generated classes (SCHEMA$ class variables) to remove Java specific logical types.
  3. Set the "one true schema" to have Java specific types (even in Python jaws-libp)
    • AVRO specification says unknown logical types should fallback to native base type (needs to be tested)
  4. Let Schema Compatibility do its thing (needs to be tested)

More information is needed about why Java compiler has this behavior and the ramifications of a schema without Java specific logical types (how critical is that metadata and can it be provided in alternative ways). By default AVRO treats a plain "string" type to mean AVRO's internal Utf8 CharSequence. However, AVRO always uses the Utf8 CharSequence internally (binary wire format doesn't change), I think we're simply talking about the Java class accessor methods and whether they provide a plain Java.lang.String or the internal Utf8 CharSequence. If true, the classes could simply have conversion methods exposed (with optional conversion caching as this whole debacle is about performance - of course Java developers want to simply use java.lang.String).

Schema Registries

AVRO does not include schemas inside messages, but they are required to be available at runtime and optionally at buildtime. AVRO makes a distinction between the schema used to write a message and the schema used to read a message and they are always treated separately, even if they happen to be the same. This is done to support schema evolution in a partially automated way. Without this, you'd have to upgrade all apps that use a modified schema in lock-step, which doesn't scale well as the number of apps grows and probably also precludes rolling upgrades if you happen to have multiple instances of a given app. However, the number of ways that you can change a schema at runtime without breaking running apps is basically limited to adding new fields that existing clients can simply ignore, deprecating fields for future removal, and simply buying time for all apps to upgrade.

AVRO allows dynamic schema discovery and parsing of messages for consumers/readers with newly discovered schemas at runtime via the GenericRecord interface. GenericRecord also allows producers/writers to skip typed code generation if they desire. Alternatively, the AVRO SpecificRecord interface can be used when schemas are matched to typed classes (usually class code is auto-generated from .avsc schema files by scripts) at buildtime. We use SpecificRecords because our applications know what to expect at build time - we're not creating some generic data consumer app that gobbles up data given new schemas at runtime that it is seeing for the first time.

For runtime schema lookup we use the Confluent Schema Registry - mostly because it is sorta built-in with the "standard" Confluent AVRO serializers. For buildtime schemas we bundle them inside language API libraries: jaws-libj (Java) and jaws-libp (Python). This means the schemas are distributed in multiple places - a little awkward. In our case we ALWAYS have the schema available in advance (at build and runtime) - so do we even need the schema registry, other than the fact it is tightly integrated with the ecosystem and Confluent AVRO serde REQUIRE it. As far as schema evolution goes - that's built into AVRO itself, and as far as a repo of schema versions (history) - that's in our git repo and could be included in API libs if needed. Not sure who Virginia is, but that's not us and it seems the Schema Registry is adding a ton of technical complexity to solve a problem we don't really have (at least not yet).

Schema References

It is sometimes useful to define an AVRO entity in one place and then reference it from many other places. For example the jaws-effective-processor joins many topics together (and therefore is joining schemas), and without references we'd just need to copy and paste schemas. References avoids duplicate specification, and is the alternative to nesting identical type specifications. However, schema references are ill-defined and support is incomplete across the entire Kafka AVRO ecosystem.

Schema references are being used experimentally at this time. See: Confluent Python API does not support schema references.

Note: It's possible to use build tooling to copy and paste nested definitions inline at build time to allow development to define/re-use schemas from one place but at run time they're just duplicated in an automated way to ease issues with support in client libraries and schema registries.

Note: The AVRO project supports references via the import statement if using the newer IDL format. Unfortunately Schema Registry doesn't support this format at the moment.

See Also:

Unions

AVRO Union types allow specification of a field that can be one of many choices. This includes indicating that a field may be null (union consisting of null and non-null type). For us, we have a set of different possible alarm producers tied to the registered alarm producer field as a complex type and each producer has different fields so they're modeled as a union. Similar scenario for overrides - we have multiple different overrides with different fields. Again for active alarms.

For AVRO Unions we avoid placing this structure at the root. Instead we use a single field msg, which is a union of records. The msg field appears unnecessary, and as an alternative the entire value could have been a union. However, a nested union is less problematic than a union at the AVRO root (confluent blog). If a union was used at the root then (1) the schema must be pre-registered with the registry instead of being created on-the-fly by clients, (2) the AVRO serializer must have additional configuration:

auto.register.schemas=false
use.latest.version=true

This may appear especially odd with messages that have no fields. For example the value is:

{"msg":{}}

instead of:

{}

Further, a union at the root makes it harder to reference the union as a whole, as discussed here: https://github.com/JeffersonLab/jaws-libp/issues/6.

When attempting to serialize a union the Python API has two strategies:

  1. Use a tuple to indicate the type and value
  2. Just use the value and let the API try to guess what branch of the union you mean

The later is dangerous and messy and has bitten us already - make sure to use the tuple explicit specification.

Unions are powerful and coupled with Kafka key=value message format you can define a composite key such that the type in value varies based on the key. There is currently no built-in constraints enforcing keys match union values though. We've encountered scenarios where we weren't careful and had alarm override keys indicting Disabled while the value was for something else like Shelved for example - applications must be careful.

Validation Gateway

The New York Times uses a gateway to validate data, so maybe we should too.

Schema Enforcement:
Schemas are a great step towards imposing order on chaos, however, they require all clients play nicely - producers are simply assumed to use the serializers they're supposed to and to use the version of a schema they're supposed to. There is actually nothing stopping a client from writing a bunch of random bytes to a topic (assuming they're authorized to write to the topic). In other words, Kafka servers don't validate data. I think the paid Confluent server might, but the stock Kafka does not. One common way to handle this is to have "raw" writable topics for producers that are read by a gateway Kafka Streams app to validate the data and pass good data onto a new vetted topic and bad data to a dead letter queue.

Database Constraints This gateway could also validate Event Sourcing database-like constraints such as foreign keys and composite keys. Constraints between topics or between key and value are not captured by AVRO schemas.