-
Notifications
You must be signed in to change notification settings - Fork 82
feat(clp-s): Add support for detecting unstructured log inputs; Explicitly reject unstructured log inputs during compression. #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds LogText as a new FileType and implements detection and handling paths for logtext inputs; integrates SIMDJSON into build configuration and links it to the clp_s_io target; JsonParser explicitly rejects direct ingestion of unstructured LogText. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant ReaderDetection as "InputConfig::detection"
participant ReaderFactory as "try_deduce_reader_type"
participant Parser as "JsonParser::ingest"
participant Archive as "ArchiveWriter"
User->>ReaderDetection: provide peek_buf
alt could_be_logtext
ReaderDetection->>ReaderFactory: returns FileType::LogText
ReaderFactory->>Parser: select LogText reader path
Parser->>Archive: close()
Parser-->>User: returns error (false)
else non-logtext (Json/KV-IR/others)
ReaderDetection->>ReaderFactory: returns other FileType
ReaderFactory->>Parser: select appropriate reader
Parser-->>User: continue ingestion
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}⚙️ CodeRabbit configuration file
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
🔇 Additional comments (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/cmake/Options/options.cmake
(1 hunks)components/core/src/clp_s/CMakeLists.txt
(1 hunks)components/core/src/clp_s/InputConfig.cpp
(6 hunks)components/core/src/clp_s/InputConfig.hpp
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/InputConfig.hpp
components/core/src/clp_s/InputConfig.cpp
🧠 Learnings (3)
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks (macos-15)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (8)
components/core/src/clp_s/InputConfig.hpp (1)
26-26
: LGTM!The addition of
LogText
to theFileType
enum is straightforward and correctly positioned betweenKeyValueIr
andZstd
.components/core/src/clp_s/CMakeLists.txt (1)
217-217
: LGTM!The private linkage of
simdjson::simdjson
to theclp_s_io
target is appropriate for the new LogText detection functionality that requires UTF-8 validation.components/core/cmake/Options/options.cmake (1)
273-273
: LGTM!Adding
CLP_NEED_SIMDJSON
to the dependency flags forCLP_BUILD_CLP_S_IO
is consistent with the CMakeLists.txt changes and properly integrates simdjson into the build configuration.components/core/src/clp_s/JsonParser.cpp (1)
640-646
: LGTM!The error handling for
FileType::LogText
is correct: it logs a clear error message, properly closes the archive writer, and returns false to reject unsupported unstructured log inputs. This aligns with the PR objective to explicitly reject logtext during compression.components/core/src/clp_s/InputConfig.cpp (4)
10-10
: LGTM!The new includes (
<optional>
,<simdjson.h>
,utf8_utils.hpp
) are correctly added to support the logtext detection functionality.Also applies to: 14-14, 26-26
140-146
: LGTM!The function declaration and documentation are clear and consistent with the existing
could_be_zstd
,could_be_kvir
, andcould_be_json
functions.
319-321
: LGTM!The integration of
could_be_logtext
into the type detection flow is correct, with logtext detection performed after JSON detection and before returningFileType::Unknown
.
368-368
: LGTM!The
FileType::LogText
case is correctly added to the switch statement, allowing detected logtext inputs to return early with the appropriate reader chain.
Description
This PR adds support for detecting that an input might be unstructured logs during ingestion, and uses that support to explicitly reject unstructured log inputs during compression. This feature will also be used by the upcoming utility that converts unstructured logs to a structured representation for ingestion into clp-s.
The heuristic for detecting logtext is straightforwards -- we simply check if the first peeked chunk of the input stream corresponds to valid UTF-8, while accounting for the fact that the end of the buffer might contain truncated UTF-8.
Since a UTF-8 codepoint is at most 4 bytes, we know that we must scan at most 7 bytes to find the last complete UTF-8 codepoint in our peeked input. This is because in the worst case the stream terminates with a 4-byte codepoint followed by a 4-byte codepoint with its last byte truncated. The approach, then, is to scan the last 7 bytes of the peeked buffer to find the last complete codepoint and then validate that the entire buffer through that last complete codepoint corresponds to valid UTF-8. If there is not a complete codepoint in the last 7 bytes, we also reject the input without having to validate the whole buffer.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes / Behaviour