Skip to content

KV-pair IR serialization of CLP-encoded strings inherits text IR's ~2 GiB logtype/variable size limits #2175

@junhaoliao

Description

@junhaoliao

Bug

The KV-pair IR serializer delegates to the text IR CLP-encoding functions when serializing string values that contain spaces. These text IR functions enforce an INT32_MAX (~2 GiB) limit on logtype and dictionary variable strings. Since virtually all log messages contain spaces, this limit effectively applies to all log event message strings serialized into KV-pair IR.

How the limit propagates

  1. serialize_value_string() checks if the string contains a space:

    • No space → calls serialize_string(), which supports up to UINT32_MAX (~4 GiB) — not affected.
    • Has space → calls serialize_clp_string() which calls four_byte_encoding::serialize_message() or eight_byte_encoding::serialize_message().
  2. serialize_logtype() fails if the logtype exceeds INT32_MAX:

    } else if (length <= INT32_MAX) {
        ir_buf.push_back(cProtocol::Payload::LogtypeStrLenInt);
        serialize_int(static_cast<int32_t>(length), ir_buf);
    } else {
        // Logtype is too long for encoding
        return false;
    }
  3. DictionaryVariableHandler::operator() similarly fails if a single dictionary variable string exceeds INT32_MAX.

The KV-pair IR protocol constants (protocol_constants.hpp) define LogtypeStrLenInt (0x23) using a signed 32-bit integer for the length field, which is the root cause of the ~2 GiB ceiling. In contrast, the KV-pair IR native string path uses StrLenUInt (0x43) with an unsigned 32-bit integer (protocol_constants.hpp:61), supporting ~4 GiB.

Additionally, the IR stream preamble metadata (serialize_metadata()) is capped at UINT16_MAX (65,535 bytes), shared by both text IR and KV-pair IR.

Size limits

Limit Value Source
CLP-encoded logtype string ~2 GiB (INT32_MAX) encoding_methods.cpp:84-89
CLP-encoded dictionary variable ~2 GiB (INT32_MAX) encoding_methods.cpp:61-65
Plain string (KV-pair IR native) ~4 GiB (UINT32_MAX) utils.cpp:45-48
IR stream preamble metadata 64 KiB (UINT16_MAX) utils.cpp:25-30

Practical impact today

These limits are not the binding constraint today. The log-converter has a 64 MiB buffer limit per log event (LogConverter.hpp:41) which is hit first, and JSON ingestion defaults to 512 MiB per record (--max-document-size). The ~2 GiB IR limit would become the bottleneck only if those upstream limits are raised.

Future impact on log-viewer

Once #2174 (MongoDB 16 MiB BSON limit for search results) is resolved and large log events can be retrieved through the WebUI, the log-viewer's extraction path could also be affected. Currently:

  • The clp_s log-viewer extracts ordered JSON chunks via JsonConstructor (clp-s x --ordered), which does not go through the KV-pair IR serializer. However, if this extraction path is ever changed to use KV-pair IR, the same limits would apply.
  • The clp engine log-viewer extracts text IR streams via clo i, but the text IR logtype/variable limits were already enforced during ingestion, so extraction would not introduce new failures.

CLP version

3b4d13f

Environment

Any environment that serializes string values into KV-pair IR. Today this is the log-converter during unstructured text ingestion in the CLP-JSON package, though the 64 MiB LogConverter buffer limit is hit first.

Reproduction steps

  1. Bypass the LogConverter's 64 MiB buffer by calling the KV-pair IR Serializer directly (e.g., in a unit test) with a string value larger than INT32_MAX (~2 GiB) that includes at least one space.
  2. Observe that serialize_logtype() returns false, causing the serialization to fail.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions