feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434

gibber9809 · 2025-10-17T17:44:39Z

Description

This PR adds support for detecting that an input might be unstructured logs during ingestion, and uses that support to explicitly reject unstructured log inputs during compression. This feature will also be used by the upcoming utility that converts unstructured logs to a structured representation for ingestion into clp-s.

The heuristic for detecting logtext is straightforwards -- we simply check if the first peeked chunk of the input stream corresponds to valid UTF-8, while accounting for the fact that the end of the buffer might contain truncated UTF-8.

Since a UTF-8 codepoint is at most 4 bytes, we know that we must scan at most 7 bytes to find the last complete UTF-8 codepoint in our peeked input. This is because in the worst case the stream terminates with a 4-byte codepoint followed by a 4-byte codepoint with its last byte truncated. The approach, then, is to scan the last 7 bytes of the peeked buffer to find the last complete codepoint and then validate that the entire buffer through that last complete codepoint corresponds to valid UTF-8. If there is not a complete codepoint in the last 7 bytes, we also reject the input without having to validate the whole buffer.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Manually validated that various combinations of truncated UTF-8 are accepted as logtext
Manually validated that various combinations of invalid UTF-8 are not accepted as logtext

Summary by CodeRabbit

New Features
- Added detection and classification of "logtext" input files during ingestion.
Bug Fixes / Behaviour
- When a logtext file is supplied for direct ingestion, the system now fails fast with a clear error and stops the write operation.

coderabbitai · 2025-10-17T17:45:01Z

Walkthrough

Adds LogText as a new FileType and implements detection and handling paths for logtext inputs; integrates SIMDJSON into build configuration and links it to the clp_s_io target; JsonParser explicitly rejects direct ingestion of unstructured LogText.

Changes

Cohort / File(s)	Change Summary
Build Configuration `components/core/cmake/Options/options.cmake`, `components/core/src/clp_s/CMakeLists.txt`	Adds `CLP_NEED_SIMDJSON` to CLP_NEED flags and links `simdjson::simdjson` as a `PRIVATE` dependency for `clp_s_io`.
Type Definition `components/core/src/clp_s/InputConfig.hpp`	Adds `LogText` enumerator to the `FileType` enum class.
Input Detection and Handling `components/core/src/clp_s/InputConfig.cpp`	Introduces `could_be_logtext(peek_buf, peek_size)` using UTF-8/byte-pattern checks; integrates detection into `peek_start_and_deduce_type` and `try_deduce_reader_type`; includes `simdjson.h` and UTF-8 utilities.
Ingestion Logic `components/core/src/clp_s/JsonParser.cpp`	Adds `FileType::LogText` case in `JsonParser::ingest` that logs an error, closes the archive writer, and returns `false` to prevent direct ingestion.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant ReaderDetection as "InputConfig::detection"
  participant ReaderFactory as "try_deduce_reader_type"
  participant Parser as "JsonParser::ingest"
  participant Archive as "ArchiveWriter"

  User->>ReaderDetection: provide peek_buf
  alt could_be_logtext
    ReaderDetection->>ReaderFactory: returns FileType::LogText
    ReaderFactory->>Parser: select LogText reader path
    Parser->>Archive: close()
    Parser-->>User: returns error (false)
  else non-logtext (Json/KV-IR/others)
    ReaderDetection->>ReaderFactory: returns other FileType
    ReaderFactory->>Parser: select appropriate reader
    Parser-->>User: continue ingestion
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression" accurately reflects the two primary objectives of the changeset. The changes add logtext detection via UTF-8 validation logic (introducing the could_be_logtext() function and FileType::LogText enum) and explicitly reject detected unstructured logs during compression (via error handling in JsonParser::ingest). The title is clear, specific, and concise—avoiding vague phrasing or noise—and directly maps to the substantive changes across the modified files. A teammate reviewing PR history would immediately understand that this PR introduces a new file type detection capability with corresponding rejection during the compression workflow.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed2c39 and 0f5b069.

📒 Files selected for processing (1)

components/core/src/clp_s/InputConfig.cpp (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

Prefer false == <expression> rather than !<expression>.

Files:

components/core/src/clp_s/InputConfig.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: package-image
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-14, false)
GitHub Check: rust-checks (macos-15)
GitHub Check: rust-checks (ubuntu-22.04)
GitHub Check: rust-checks (ubuntu-24.04)
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)
GitHub Check: build (macos-15)

🔇 Additional comments (5)

components/core/src/clp_s/InputConfig.cpp (5)

10-10: LGTM! Required dependencies properly included.

The new includes are necessary for the logtext detection implementation and are correctly positioned.

Also applies to: 14-14, 26-26

140-146: LGTM! Documentation is clear and consistent.

The function documentation follows the established pattern and clearly describes the purpose.

256-297: LGTM! UTF-8 detection logic is sound and handles edge cases correctly.

The implementation correctly:

Scans at most the last 7 bytes to locate the last complete UTF-8 codepoint (handles worst-case truncation).

Validates the entire buffer up to that codepoint using simdjson.

Returns false when no complete codepoint exists in the scan window (preventing false positives).

Follows the coding guideline (line 292: false == legal_last_byte_index.has_value()).

The algorithm properly handles buffer truncation at UTF-8 boundaries, and the lambda correctly calculates character boundaries for multi-byte sequences.

319-321: LGTM! Logtext detection properly integrated.

The check follows the established pattern and is appropriately placed after the JSON check (more specific formats checked first).

368-368: LGTM! LogText correctly included in terminal format handling.

LogText is appropriately grouped with JSON and KeyValueIr as formats that terminate the reader chain (no further decompression/unwrapping needed).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 221b064 and 2ed2c39.

📒 Files selected for processing (5)

components/core/cmake/Options/options.cmake (1 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/InputConfig.cpp (6 hunks)
components/core/src/clp_s/InputConfig.hpp (1 hunks)
components/core/src/clp_s/JsonParser.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

Prefer false == <expression> rather than !<expression>.

Files:

components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/InputConfig.hpp
components/core/src/clp_s/InputConfig.cpp

🧠 Learnings (3)

📚 Learning: 2024-10-07T21:35:04.362Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

components/core/src/clp_s/JsonParser.cpp

📚 Learning: 2024-10-07T21:16:41.660Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

components/core/src/clp_s/JsonParser.cpp

📚 Learning: 2024-10-08T15:52:50.753Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

components/core/src/clp_s/JsonParser.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: package-image
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: rust-checks (macos-15)
GitHub Check: rust-checks (ubuntu-24.04)
GitHub Check: rust-checks (ubuntu-22.04)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: build-macos (macos-14, false)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (8)

components/core/src/clp_s/InputConfig.hpp (1)

26-26: LGTM!

The addition of LogText to the FileType enum is straightforward and correctly positioned between KeyValueIr and Zstd.

components/core/src/clp_s/CMakeLists.txt (1)

217-217: LGTM!

The private linkage of simdjson::simdjson to the clp_s_io target is appropriate for the new LogText detection functionality that requires UTF-8 validation.

components/core/cmake/Options/options.cmake (1)

273-273: LGTM!

Adding CLP_NEED_SIMDJSON to the dependency flags for CLP_BUILD_CLP_S_IO is consistent with the CMakeLists.txt changes and properly integrates simdjson into the build configuration.

components/core/src/clp_s/JsonParser.cpp (1)

640-646: LGTM!

The error handling for FileType::LogText is correct: it logs a clear error message, properly closes the archive writer, and returns false to reject unsupported unstructured log inputs. This aligns with the PR objective to explicitly reject logtext during compression.

components/core/src/clp_s/InputConfig.cpp (4)

10-10: LGTM!

The new includes (<optional>, <simdjson.h>, utf8_utils.hpp) are correctly added to support the logtext detection functionality.

Also applies to: 14-14, 26-26

140-146: LGTM!

The function declaration and documentation are clear and consistent with the existing could_be_zstd, could_be_kvir, and could_be_json functions.

319-321: LGTM!

The integration of could_be_logtext into the type detection flow is correct, with logtext detection performed after JSON detection and before returning FileType::Unknown.

368-368: LGTM!

The FileType::LogText case is correctly added to the switch statement, allowing detected logtext inputs to return early with the appropriate reader chain.

components/core/src/clp_s/InputConfig.cpp

gibber9809 added 2 commits October 17, 2025 17:31

Add support for detecting unstructured logtext inputs.

578aa1e

Explicitly reject regular logtext as input to compression in clp-s.

2ed2c39

gibber9809 requested review from a team and wraymo as code owners October 17, 2025 17:44

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

components/core/src/clp_s/InputConfig.cpp Show resolved Hide resolved

Fix typo.

0f5b069

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434

feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434

gibber9809 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434

Are you sure you want to change the base?

feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434

Conversation

gibber9809 commented Oct 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gibber9809 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading