Skip to content

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Oct 17, 2025

Description

This PR adds support for detecting that an input might be unstructured logs during ingestion, and uses that support to explicitly reject unstructured log inputs during compression. This feature will also be used by the upcoming utility that converts unstructured logs to a structured representation for ingestion into clp-s.

The heuristic for detecting logtext is straightforwards -- we simply check if the first peeked chunk of the input stream corresponds to valid UTF-8, while accounting for the fact that the end of the buffer might contain truncated UTF-8.

Since a UTF-8 codepoint is at most 4 bytes, we know that we must scan at most 7 bytes to find the last complete UTF-8 codepoint in our peeked input. This is because in the worst case the stream terminates with a 4-byte codepoint followed by a 4-byte codepoint with its last byte truncated. The approach, then, is to scan the last 7 bytes of the peeked buffer to find the last complete codepoint and then validate that the entire buffer through that last complete codepoint corresponds to valid UTF-8. If there is not a complete codepoint in the last 7 bytes, we also reject the input without having to validate the whole buffer.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Manually validated that various combinations of truncated UTF-8 are accepted as logtext
  • Manually validated that various combinations of invalid UTF-8 are not accepted as logtext

Summary by CodeRabbit

  • New Features

    • Added detection and classification of "logtext" input files during ingestion.
  • Bug Fixes / Behaviour

    • When a logtext file is supplied for direct ingestion, the system now fails fast with a clear error and stops the write operation.

@gibber9809 gibber9809 requested review from a team and wraymo as code owners October 17, 2025 17:44
Copy link
Contributor

coderabbitai bot commented Oct 17, 2025

Walkthrough

Adds LogText as a new FileType and implements detection and handling paths for logtext inputs; integrates SIMDJSON into build configuration and links it to the clp_s_io target; JsonParser explicitly rejects direct ingestion of unstructured LogText.

Changes

Cohort / File(s) Change Summary
Build Configuration
components/core/cmake/Options/options.cmake, components/core/src/clp_s/CMakeLists.txt
Adds CLP_NEED_SIMDJSON to CLP_NEED flags and links simdjson::simdjson as a PRIVATE dependency for clp_s_io.
Type Definition
components/core/src/clp_s/InputConfig.hpp
Adds LogText enumerator to the FileType enum class.
Input Detection and Handling
components/core/src/clp_s/InputConfig.cpp
Introduces could_be_logtext(peek_buf, peek_size) using UTF-8/byte-pattern checks; integrates detection into peek_start_and_deduce_type and try_deduce_reader_type; includes simdjson.h and UTF-8 utilities.
Ingestion Logic
components/core/src/clp_s/JsonParser.cpp
Adds FileType::LogText case in JsonParser::ingest that logs an error, closes the archive writer, and returns false to prevent direct ingestion.

Sequence Diagram(s)

sequenceDiagram
  actor User
  participant ReaderDetection as "InputConfig::detection"
  participant ReaderFactory as "try_deduce_reader_type"
  participant Parser as "JsonParser::ingest"
  participant Archive as "ArchiveWriter"

  User->>ReaderDetection: provide peek_buf
  alt could_be_logtext
    ReaderDetection->>ReaderFactory: returns FileType::LogText
    ReaderFactory->>Parser: select LogText reader path
    Parser->>Archive: close()
    Parser-->>User: returns error (false)
  else non-logtext (Json/KV-IR/others)
    ReaderDetection->>ReaderFactory: returns other FileType
    ReaderFactory->>Parser: select appropriate reader
    Parser-->>User: continue ingestion
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression" accurately reflects the two primary objectives of the changeset. The changes add logtext detection via UTF-8 validation logic (introducing the could_be_logtext() function and FileType::LogText enum) and explicitly reject detected unstructured logs during compression (via error handling in JsonParser::ingest). The title is clear, specific, and concise—avoiding vague phrasing or noise—and directly maps to the substantive changes across the modified files. A teammate reviewing PR history would immediately understand that this PR introduces a new file type detection capability with corresponding rejection during the compression workflow.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed2c39 and 0f5b069.

📒 Files selected for processing (1)
  • components/core/src/clp_s/InputConfig.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/InputConfig.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (macos-15)
🔇 Additional comments (5)
components/core/src/clp_s/InputConfig.cpp (5)

10-10: LGTM! Required dependencies properly included.

The new includes are necessary for the logtext detection implementation and are correctly positioned.

Also applies to: 14-14, 26-26


140-146: LGTM! Documentation is clear and consistent.

The function documentation follows the established pattern and clearly describes the purpose.


256-297: LGTM! UTF-8 detection logic is sound and handles edge cases correctly.

The implementation correctly:

  • Scans at most the last 7 bytes to locate the last complete UTF-8 codepoint (handles worst-case truncation).
  • Validates the entire buffer up to that codepoint using simdjson.
  • Returns false when no complete codepoint exists in the scan window (preventing false positives).
  • Follows the coding guideline (line 292: false == legal_last_byte_index.has_value()).

The algorithm properly handles buffer truncation at UTF-8 boundaries, and the lambda correctly calculates character boundaries for multi-byte sequences.


319-321: LGTM! Logtext detection properly integrated.

The check follows the established pattern and is appropriately placed after the JSON check (more specific formats checked first).


368-368: LGTM! LogText correctly included in terminal format handling.

LogText is appropriately grouped with JSON and KeyValueIr as formats that terminate the reader chain (no further decompression/unwrapping needed).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 221b064 and 2ed2c39.

📒 Files selected for processing (5)
  • components/core/cmake/Options/options.cmake (1 hunks)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/InputConfig.cpp (6 hunks)
  • components/core/src/clp_s/InputConfig.hpp (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/InputConfig.hpp
  • components/core/src/clp_s/InputConfig.cpp
🧠 Learnings (3)
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (8)
components/core/src/clp_s/InputConfig.hpp (1)

26-26: LGTM!

The addition of LogText to the FileType enum is straightforward and correctly positioned between KeyValueIr and Zstd.

components/core/src/clp_s/CMakeLists.txt (1)

217-217: LGTM!

The private linkage of simdjson::simdjson to the clp_s_io target is appropriate for the new LogText detection functionality that requires UTF-8 validation.

components/core/cmake/Options/options.cmake (1)

273-273: LGTM!

Adding CLP_NEED_SIMDJSON to the dependency flags for CLP_BUILD_CLP_S_IO is consistent with the CMakeLists.txt changes and properly integrates simdjson into the build configuration.

components/core/src/clp_s/JsonParser.cpp (1)

640-646: LGTM!

The error handling for FileType::LogText is correct: it logs a clear error message, properly closes the archive writer, and returns false to reject unsupported unstructured log inputs. This aligns with the PR objective to explicitly reject logtext during compression.

components/core/src/clp_s/InputConfig.cpp (4)

10-10: LGTM!

The new includes (<optional>, <simdjson.h>, utf8_utils.hpp) are correctly added to support the logtext detection functionality.

Also applies to: 14-14, 26-26


140-146: LGTM!

The function declaration and documentation are clear and consistent with the existing could_be_zstd, could_be_kvir, and could_be_json functions.


319-321: LGTM!

The integration of could_be_logtext into the type detection flow is correct, with logtext detection performed after JSON detection and before returning FileType::Unknown.


368-368: LGTM!

The FileType::LogText case is correctly added to the switch statement, allowing detected logtext inputs to return early with the appropriate reader chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant