-
Notifications
You must be signed in to change notification settings - Fork 16k
Description
What language does this apply to?
Java
If it's a proto syntax change, is it for proto2 or proto3?
If it's about generated code change, what programming language?
Not a proto syntax change (applies to both proto2 and proto3 payloads since it’s runtime parsing behavior).
Describe the problem you are trying to solve.
We recently implemented a high throughput incremental parser for a large, nested protobuf payload (telemetry/metrics style structure: repeated “resource blocks”, each with repeated “scopes”, repeated “measurements”, and repeated key/value “attributes”). The goal was to parse and validate only a subset of fields while minimizing allocations and CPU overhead (avoid constructing the full generated message object graph).
We found that protobuf’s Java runtime (CodedInputStream) supports the necessary low level primitives (readTag, pushLimit/popLimit, skipField, etc.), but it is difficult to use correctly and safely for these workloads. In particular:
- Aliasing is hard to reason about and hard to use safely.
There is anenableAliasing(boolean)API, but it’s not obvious:- when it actually takes effect (implementation dependent),
- what the correctness constraints are (buffer lifetime/pinning),
- and what guarantees exist around
readBytes()andreadByteBuffer()
- Stream backed decoding can’t safely alias.
For stream backed decoders, aliasing is typically not possible because internal buffers are reused/refilled. This is currently implied by implementation details (e.g.StreamDecoderenableAliasingbeing a noop) rather than clearly expressed in API and docs. - ByteBuffer views are especially risky.
readByteBuffer()can return aByteBuffer.slice()backed by the underlying array when aliasing is enabled (for array backed decoders). The code itself contains aTODOabout making returned buffers read only. This indicates the current behavior is known to be risky: the buffer is mutable and stateful (position/limit), and it’s easy for callers to accidentally mutate it or misuse cursor state, causing subtle bugs. - There isn’t a “how to do this” guide.
We ended up learning patterns like thepushLimit/popLimitlimit stack by reading internal code and trial/error. It’s not obvious to many developers why you mustpopLimit(oldLimit)after each nested message, or how the cursor advances through a length delimited field. This makes it harder than necessary to implement an incremental parser correctly.
We want to reduce friction for teams building allocation sensitive incremental parsers (telemetry/logs/metrics/event ingestion, etc.) while keeping protobuf’s correctness guarantees intact.
(We can’t share internal production code, but we can provide simplified snippets showing the relevant parsing patterns.)
Describe the solution you'd like
We’d like to propose a small set of Java runtime improvements split across focused PRs, so maintainers can review incrementally:
Proposed PR #1: Aliasing ergonomics and documentation in CodedInputStream (Java runtime)
Goals:
- Make aliasing behavior easier to understand and safer to use.
- Reduce “guessing” and reliance on implementation details.
Candidate changes (examples — we’d like maintainer feedback on API shape):
- Improve Javadoc and guidance around
CodedInputStream.enableAliasing(boolean):- explain buffer lifetime requirements,
- clarify which kinds of decoders can alias (array backed vs stream backed),
- clarify interaction with
readBytes()andreadByteBuffer()
- Optionally add small introspection helpers such as:
supportsAliasing()(implementation capability),isAliasingEnabled()(flag state),- and/or an explicit method like
readBytesPossiblyAliased()to make “may alias” semantics visible at call sites.
- Add unit tests covering:
- stream backed vs array backed behavior expectations,
- correctness of returned values regardless of aliasing.
Proposed PR #2: Long form “High performance parsing” guide (docs only)
Goals:
- Provide a clear, official guide for incremental parsing using
CodedInputStream.
Contents:
- Explain tag loops, skipping fields, and nested parsing using
pushLimit/popLimit. - Include ASCII art style diagrams showing the cursor (
pos) advancing through a buffer and howpushLimit/popLimitforms a limit stack. - Include guidance on “read bytes first vs decode UTF-8 immediately”:
- when it helps (dedup/interning / high cardinality attributes),
- pitfalls (UTF-8 validation, security considerations),
- and how aliasing affects buffer lifetime
- Include a generic “telemetry style payload” case study section (no OTLP/OpenTelemetry specific code) that mirrors common real world nested/repeated payload shapes.
Proposed PR #3: Safer ByteBuffer views when aliasing is enabled
Goals:
- Reduce accidental misuse of aliased
ByteBufferviews.
Options (seeking maintainer preference):
- Minimal behavior change: when
readByteBuffer()returns a view into an underlying array, return it asasReadOnlyBuffer()(aligns with existingTODO). - Alternatively (or additionally), add a new API that makes this contract explicit (e.g.
readReadOnlyByteBuffer()/readByteBufferPossiblyAliasedReadOnly()), leaving existing behavior unchanged for compatibility.
Describe alternatives you've considered
- Use generated message parsing (
parseFrom) and accept allocations.
This was not viable for our ingestion workload where we parse large nested batches and only need certain fields. Full object graph allocation overhead dominated runtime/GC. - Write custom decoding without protobuf runtime.
This would be error prone and would lose protobuf’s correctness guarantees and compatibility benefits. - Manually copy all bytes always.
Works but defeats the purpose of aliasing/zero copy optimizations and increases allocations significantly. - Keep improvements internal only.
This would avoid upstreaming but leaves the ecosystem with the same discoverability/safety issues. We’d prefer to contribute improvements so other teams can avoid the same pitfalls.
Additional context
- This request is motivated by implementing a production incremental parser for a nested, telemetry style protobuf payload with very high repetition of small strings (attribute keys/values) and large repeated arrays of messages.
- The runtime already supports the needed primitives, the primary issues are:
- discoverability/documentation of correct patterns,
- clearer aliasing semantics/capabilities,
- and safer handling of
ByteBufferviews (read only enforcement)
- We can provide simplified code snippets (generic parsing loops with
pushLimit/popLimit) and performance/GC rationale, but we cannot share internal code verbatim due to legal review requirements.
Add any other context or screenshots about the feature request here.