Java CodedInputStream: aliasing ergonomics + high-performance parsing documentation

### **What language does this apply to?**
Java

If it's a proto syntax change, is it for proto2 or proto3?
If it's about generated code change, what programming language?

Not a proto syntax change (applies to both proto2 and proto3 payloads since it’s runtime parsing behavior).

### **Describe the problem you are trying to solve.**

We recently implemented a high throughput incremental parser for a large, nested protobuf payload (telemetry/metrics style structure: repeated “resource blocks”, each with repeated “scopes”, repeated “measurements”, and repeated key/value “attributes”). The goal was to parse and validate only a subset of fields while minimizing allocations and CPU overhead (avoid constructing the full generated message object graph).

We found that protobuf’s Java runtime (`CodedInputStream`) supports the necessary low level primitives (`readTag`, `pushLimit`/`popLimit`, `skipField`, etc.), but it is difficult to use correctly and safely for these workloads. In particular:

* Aliasing is hard to reason about and hard to use safely.
   There is an `enableAliasing(boolean)` API, but it’s not obvious:
   * when it actually takes effect (implementation dependent),
  * what the correctness constraints are (buffer lifetime/pinning),
  * and what guarantees exist around `readBytes()` and `readByteBuffer()`
* Stream backed decoding can’t safely alias.
   For stream backed decoders, aliasing is typically not possible because internal buffers are reused/refilled. This is currently implied by implementation details (e.g. `StreamDecoder` `enableAliasing` being a noop) rather than clearly expressed in API and docs.
* ByteBuffer views are especially risky.
   `readByteBuffer()` can return a `ByteBuffer.slice()` backed by the underlying array when aliasing is enabled (for array backed decoders). The code itself contains a `TODO` about making returned buffers read only. This indicates the current behavior is known to be risky: the buffer is mutable and stateful (position/limit), and it’s easy for callers to accidentally mutate it or misuse cursor state, causing subtle bugs.
* There isn’t a “how to do this” guide.
   We ended up learning patterns like the `pushLimit`/`popLimit` limit stack by reading internal code and trial/error. It’s not obvious to many developers why you must `popLimit(oldLimit)` after each nested message, or how the cursor advances through a length delimited field. This makes it harder than necessary to implement an incremental parser correctly.

We want to reduce friction for teams building allocation sensitive incremental parsers (telemetry/logs/metrics/event ingestion, etc.) while keeping protobuf’s correctness guarantees intact.

(We can’t share internal production code, but we can provide simplified snippets showing the relevant parsing patterns.)


### **Describe the solution you'd like**

We’d like to propose a small set of Java runtime improvements split across focused PRs, so maintainers can review incrementally:

Proposed PR #1: Aliasing ergonomics and documentation in `CodedInputStream` (Java runtime)

Goals:
* Make aliasing behavior easier to understand and safer to use.
* Reduce “guessing” and reliance on implementation details.

Candidate changes (examples — we’d like maintainer feedback on API shape):

* Improve Javadoc and guidance around `CodedInputStream.enableAliasing(boolean)`:
  * explain buffer lifetime requirements,
  * clarify which kinds of decoders can alias (array backed vs stream backed),
  * clarify interaction with `readBytes()` and `readByteBuffer()`
* Optionally add small introspection helpers such as:
  * `supportsAliasing()` (implementation capability),
  * `isAliasingEnabled()` (flag state),
  * and/or an explicit method like `readBytesPossiblyAliased()` to make “may alias” semantics visible at call sites.
* Add unit tests covering:
  * stream backed vs array backed behavior expectations,
  * correctness of returned values regardless of aliasing.

Proposed PR #2: Long form “High performance parsing” guide (docs only)

Goals:
* Provide a clear, official guide for incremental parsing using `CodedInputStream`.

Contents:
* Explain tag loops, skipping fields, and nested parsing using `pushLimit`/`popLimit`.
* Include ASCII art style diagrams showing the cursor (`pos`) advancing through a buffer and how `pushLimit`/`popLimit` forms a limit stack.
* Include guidance on “read bytes first vs decode UTF-8 immediately”:
  * when it helps (dedup/interning / high cardinality attributes),
  * pitfalls (UTF-8 validation, security considerations),
  * and how aliasing affects buffer lifetime
* Include a generic “telemetry style payload” case study section (no OTLP/OpenTelemetry specific code) that mirrors common real world nested/repeated payload shapes.

Proposed PR #3: Safer `ByteBuffer` views when aliasing is enabled

Goals:
* Reduce accidental misuse of aliased `ByteBuffer` views.

Options (seeking maintainer preference):

* Minimal behavior change: when `readByteBuffer()` returns a view into an underlying array, return it as `asReadOnlyBuffer()` (aligns with existing `TODO`).
* Alternatively (or additionally), add a new API that makes this contract explicit (e.g. `readReadOnlyByteBuffer()` / `readByteBufferPossiblyAliasedReadOnly()`), leaving existing behavior unchanged for compatibility.

### **Describe alternatives you've considered**

* Use generated message parsing (`parseFrom`) and accept allocations.
   This was not viable for our ingestion workload where we parse large nested batches and only need certain fields. Full object graph allocation overhead dominated runtime/GC.
* Write custom decoding without protobuf runtime.
   This would be error prone and would lose protobuf’s correctness guarantees and compatibility benefits.
* Manually copy all bytes always.
   Works but defeats the purpose of aliasing/zero copy optimizations and increases allocations significantly.
* Keep improvements internal only.
   This would avoid upstreaming but leaves the ecosystem with the same discoverability/safety issues. We’d prefer to contribute improvements so other teams can avoid the same pitfalls.

### **Additional context**

* This request is motivated by implementing a production incremental parser for a nested, telemetry style protobuf payload with very high repetition of small strings (attribute keys/values) and large repeated arrays of messages.
* The runtime already supports the needed primitives, the primary issues are:
  * discoverability/documentation of correct patterns,
  * clearer aliasing semantics/capabilities,
  * and safer handling of `ByteBuffer` views (read only enforcement)
* We can provide simplified code snippets (generic parsing loops with `pushLimit`/`popLimit`) and performance/GC rationale, but we cannot share internal code verbatim due to legal review requirements.

Add any other context or screenshots about the feature request here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Java CodedInputStream: aliasing ergonomics + high-performance parsing documentation #25136

What language does this apply to?

Describe the problem you are trying to solve.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Java CodedInputStream: aliasing ergonomics + high-performance parsing documentation #25136

Description

What language does this apply to?

Describe the problem you are trying to solve.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions