Skip to content

Conversation

davidlion
Copy link
Member

@davidlion davidlion commented Aug 28, 2025

Description

Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.

For example, the heuristic's "equals" rule can be represented with the regex pattern .*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store the var capture group as a variable (storing the prefix .*= as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix .*=) as a variable.

This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g. text(?<var>variable)+text)) all repetitions will be stored together as a single variable.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests for schema creation with single or multiple captures.
TODO Generated archives and matched their contents.

Summary by CodeRabbit

  • New Features

    • Extract variables from a single named capture group within a token while preserving surrounding text.
    • Schema pattern updated to expose the value after '=' as a named capture for easier access.
    • Encoding now supports decomposing tokens into constant parts and a captured variable for more compact storage.
  • Bug Fixes

    • Clear validation error when a schema rule contains more than one capture group, including file and line context.
    • Improved robustness and error handling during message encoding and timestamp pattern detection.
  • Tests

    • Added tests for valid single-capture schemas and for rejecting multiple-capture schemas.

@davidlion davidlion requested a review from a team as a code owner August 28, 2025 15:37
Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

Walkthrough

Adds runtime validation to schema loading enforcing ≤1 regex capture per schema variable, extends Archive writer to decompose tokens by capture groups and track encoded variables (and updates LogEventView qualification and timestamp handling), and adds tests and schema fixtures for single vs. multiple capture-group cases.

Changes

Cohort / File(s) Summary
Schema validation
components/core/src/clp/Utils.cpp
Adds #include <string> and inserts runtime checks during schema loading (in both schema_vars loops) to ensure each rule's regex has at most one positive capture; throws std::runtime_error containing schema path, rule line, rule name and capture count when >1. Keeps existing timestamp rule skip.
Archive writer — capture-aware token encoding & timestamp plumbing
components/core/src/clp/streaming_archive/writer/Archive.cpp
Adds explicit includes, changes API to accept log_surgeon::LogEventView const&, passes start/end by reference into timestamp pattern search and throws if none found, tightens end-position handling, modernizes types/initialization, and adds logic to split tokens when capture groups exist: emit pre/post constants, add capture as encoded dictionary variable, track m_var_ids/m_encoded_vars, and throw on missing register IDs; preserves original behavior if no captures.
Tests — parser/schema validation
components/core/tests/test-ParserWithUserSchema.cpp
Adds tests: asserts single named capture is detected and named correctly; asserts load_lexer_from_file throws for a rule with multiple capture groups and includes file path and rule name in the error message.
Test fixtures — schema examples
components/core/tests/test_schema_files/single_capture_group.txt, components/core/tests/test_schema_files/multiple_capture_groups.txt
Adds schema examples: single named capture (capture:text(?<group>var)text) and multicapture (multicapture:text(?<group0>var0)text(?<group1>var1)text) with delimiters: \r\n line to exercise validation.
Config schemas update
components/core/config/schemas.txt
Changes equals pattern from equals:.*=.*[a-zA-Z0-9].* to equals:.*=(?<var>.*[a-zA-Z0-9].*), introducing a named capture var for the RHS value.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Loader
  participant Utils as Utils::load_lexer_from_file
  participant Rule as Rule (regex)
  note over Utils,Rule: Schema load and capture validation (runs in both schema_vars loops)
  App->>Utils: load_lexer_from_file(schema_path)
  loop For each schema_var rule
    Utils->>Rule: get_subtree_positive_captures()
    alt > 1 capture
      Utils-->>App: throw runtime_error(file, line, rule_name, capture_count)
    else <= 1 capture
      Utils-->>App: continue processing
    end
  end
Loading
sequenceDiagram
  autonumber
  participant Caller as In
  participant Arch as Archive::write_msg_using_schema
  participant TS as TimestampPattern
  participant Dict as Dictionaries/Encoders
  Caller->>Arch: write_msg_using_schema(log_surgeon::LogEventView)
  Arch->>TS: search_known_ts_patterns(buf, start&, end&)
  TS-->>Arch: start/end positions (or none -> error)
  loop For each token
    alt Token has capture group
      Arch->>Dict: add constant (pre-capture)
      Arch->>Dict: add encoded variable (capture) and record var id
      Arch->>Dict: add constant (post-capture)
    else No capture
      Arch->>Dict: add variable (entire token)
    end
  end
  Arch-->>Caller: write complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change — adding support for a single capture group in schema rules to achieve parity with the heuristic parser — and includes scope and a breaking-change marker; it is specific, concise, and clearly conveys the main intent for reviewers scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fbd5a12 and a7a0be1.

📒 Files selected for processing (5)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
  • components/core/tests/test-ParserWithUserSchema.cpp (1 hunks)
  • components/core/tests/test_schema_files/multiple_capture_groups.txt (1 hunks)
  • components/core/tests/test_schema_files/single_capture_group.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-ParserWithUserSchema.cpp
  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
  • load_lexer_from_file (125-254)
  • load_lexer_from_file (126-126)
components/core/src/clp/Utils.hpp (1)
  • load_lexer_from_file (52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.hpp (3)
  • encoded_var (52-52)
  • encoded_var (111-111)
  • id (51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: antlr-code-committed (macos-15)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)

11-11: Header addition looks good.

Adding resolves downstream std::string usage here.

components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

362-368: Potential off-by-one in uncompressed byte count (verify semantics).

If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.

Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).


371-382: Good: token_type made const and delimiter handling left intact.

No issues spotted; aligns with existing flow.


485-496: OK: zero-initialised logtype_id and downstream writes.

This aligns with safer initialisation and existing dictionary API.


317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.

components/core/tests/test_schema_files/single_capture_group.txt (1)

1-1: Fixture is minimal and appropriate.

Covers the intended single-capture scenario with surrounding literals.

components/core/tests/test_schema_files/multiple_capture_groups.txt (1)

1-1: Good negative fixture.

Triggers the >1 capture validation path as desired.

components/core/tests/test-ParserWithUserSchema.cpp (1)

195-204: Exact error assertion is OK; keep in sync if message changes.

If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).

Comment on lines 423 to 440
default: {
// Variable string looks like a dictionary variable, so encode it as so
encoded_variable_t encoded_var;
variable_dictionary_id_t id;
m_var_dict.add_entry(token.to_string(), id);
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
m_var_ids.push_back(id);
auto const& lexer{log_view.get_log_parser().m_lexer};
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};

// If the variable token contains capture groups, we break the token up by storing
// each capture as a variable and any substrings surrounding the capture as part of
// the logtype. If there are no capture groups the entire variable token is stored
// as a variable.
if (false == capture_ids.has_value()) {
variable_dictionary_id_t id{};
m_var_dict.add_entry(token.to_string(), id);
m_var_ids.push_back(id);
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
m_logtype_dict_entry.add_dictionary_var();

break;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard against empty capture-id vector.

has_value() can hold an empty vector; at(0) would throw. Treat empty as “no capture.”

Apply:

-                if (false == capture_ids.has_value()) {
+                if (false == capture_ids.has_value() || capture_ids->empty()) {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
                     m_var_ids.push_back(id);
                     m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                     m_logtype_dict_entry.add_dictionary_var();
                     break;
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
default: {
// Variable string looks like a dictionary variable, so encode it as so
encoded_variable_t encoded_var;
variable_dictionary_id_t id;
m_var_dict.add_entry(token.to_string(), id);
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
m_var_ids.push_back(id);
auto const& lexer{log_view.get_log_parser().m_lexer};
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
// If the variable token contains capture groups, we break the token up by storing
// each capture as a variable and any substrings surrounding the capture as part of
// the logtype. If there are no capture groups the entire variable token is stored
// as a variable.
if (false == capture_ids.has_value()) {
variable_dictionary_id_t id{};
m_var_dict.add_entry(token.to_string(), id);
m_var_ids.push_back(id);
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
m_logtype_dict_entry.add_dictionary_var();
break;
}
default: {
auto const& lexer{log_view.get_log_parser().m_lexer};
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
// If the variable token contains capture groups, we break the token up by storing
// each capture as a variable and any substrings surrounding the capture as part of
// the logtype. If there are no capture groups the entire variable token is stored
// as a variable.
if (false == capture_ids.has_value() || capture_ids->empty()) {
variable_dictionary_id_t id{};
m_var_dict.add_entry(token.to_string(), id);
m_var_ids.push_back(id);
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
m_logtype_dict_entry.add_dictionary_var();
break;
}
// … rest of processing when capture_ids has at least one element …
🤖 Prompt for AI Agents
In components/core/src/clp/streaming_archive/writer/Archive.cpp around lines
423-440, the code treats capture_ids.has_value() as sufficient but an empty
vector is possible and later at(0) would throw; update the guard to treat an
empty capture-id vector as “no capture” by checking both has_value() and
!capture_ids->empty() (e.g., if (!capture_ids.has_value() ||
capture_ids->empty()) handle as the no-capture path), and ensure subsequent uses
of capture_ids access elements only after this emptiness check.

Comment on lines 183 to 193
TEST_CASE("Test schema with single capture group", "[load_lexer]") {
std::string const schema_path{"../tests/test_schema_files/single_capture_group.txt"};
ByteLexer lexer;
load_lexer_from_file(schema_path, lexer);

auto const rule_id{lexer.m_symbol_id.at("capture")};
auto const capture_ids{lexer.get_capture_ids_from_rule_id(rule_id)};
REQUIRE(capture_ids.has_value());
REQUIRE(1 == capture_ids.value().size());
REQUIRE("group" == lexer.m_id_symbol.at(capture_ids->at(0)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Solid positive test; consider adding a repetition case.

This verifies single capture presence and naming. Add a case where the capture is repeated (e.g., (?a)+) to assert aggregation semantics end-to-end.

I can draft that test using the existing ByteLexer and a small sample line.

🤖 Prompt for AI Agents
In components/core/tests/test-ParserWithUserSchema.cpp around lines 183-193, add
a new TEST_CASE that mirrors the single-capture test but uses a schema with a
repeated named capture (e.g., (?<g>a)+); load that schema into ByteLexer,
retrieve the rule and capture ids and assert the capture id exists and maps to
"g", then run the lexer against a sample input like "aaa" and assert the lexer
returns multiple aggregated capture values for "g" (e.g., three captured "a"
entries) to validate repetition aggregation end-to-end.

Comment on lines 427 to 430
// If the variable token contains capture groups, we break the token up by storing
// each capture as a variable and any substrings surrounding the capture as part of
// the logtype. If there are no capture groups the entire variable token is stored
// as a variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flip the order of the sentences in the docstring to match the order of the cases in the code? I.e., the no capture case first, then the capture case.

Comment on lines 449 to 450
auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()};
auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()};
Copy link
Contributor

@SharafMohamed SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the behavior we want is to treat multi-valued captures as a single variable. In that case the correct usage should be .back() for the start capture and .front() for the end capture (since the reg positions are reversed).

Ideally we'd have a unit-test for this case, but I think multi-valued captures are currently a log-surgeon issue.

auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()};
auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()};
auto const token_view{token.to_string_view()};
size_t token_pos{0};
Copy link
Contributor

@SharafMohamed SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to track token_pos. Aren't the segments [0, capture_start), [capture_start, capture_end) and [capture_end, token_view.length())?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the following is the most direct (and most log-surgeon independent) implementation. Manipulate token and call to_string_view as needed:

  • Store token_end_pos = token.m_end_pos
  • For the prefix static-text: Set token.m_end_pos= capture_start
  • For the variable itself: Set token.m_start_pos= capture_start and token.m_end_pos= capture_end
  • For the suffix static-text: Set token.m_start_pos= capture_end and token.m_end_pos= token_end_pos

This way CLP does not care about how the tokens wrap around is handled internally. Also nice for future-proofing in case LS changes its behavior.

We should also unit-test this if possible.

Copy link
Contributor

@SharafMohamed SharafMohamed Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: for clarity, maybe format the code such that the prefix is a block with a newline after it, the variable is a block with a newline after it, and the suffix is a block with a newline after it.

@@ -0,0 +1 @@
multicapture:text(?<group0>var0)text(?<group1>var1)text
Copy link
Contributor

@SharafMohamed SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically lexers can exist with the schema as is, so I'm not entirely sure about this comment. But, if any LogParser object is created with this schema it will error as it lacks delimiters. So this schema file is malformed according to Log Surgeon.

@@ -0,0 +1 @@
capture:text(?<group>var)text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

362-366: Guard against empty buffer when computing end_pos.

pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.

Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?


391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.

When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.

-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_int_var();
                 }
                 m_encoded_vars.push_back(encoded_var);
@@
-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_float_var();
                 }
                 m_encoded_vars.push_back(encoded_var);

Also applies to: 407-421

♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

431-440: Also treat empty capture-id vectors as “no capture.”

has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.

-                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
-                if (false == capture_ids.has_value()) {
+                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
+                if (false == capture_ids.has_value() || capture_ids->empty()) {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
                     m_var_ids.push_back(id);
                     m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                     m_logtype_dict_entry.add_dictionary_var();
 
                     break;
                 }

442-454: Validate register positions before using front()/back().

get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.

-                auto const [start_reg_id, end_reg_id]{register_ids.value()};
-                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()};
-                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()};
+                auto const [start_reg_id, end_reg_id]{register_ids.value()};
+                auto const& start_positions = token.get_reversed_reg_positions(start_reg_id);
+                auto const& end_positions = token.get_reversed_reg_positions(end_reg_id);
+                if (start_positions.empty() || end_positions.empty()) {
+                    throw(std::runtime_error(
+                            "Empty register positions for variable's capture group. Full token: "
+                            + token.to_string()
+                    ));
+                }
+                auto const capture_start{start_positions.back()};
+                auto const capture_end{end_positions.front()};
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc591cd and 03a1f02.

📒 Files selected for processing (2)
  • components/core/src/clp/Utils.cpp (2 hunks)
  • components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Utils.cpp
  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
  • timestamp (146-146)
  • logtype_id (229-232)
components/core/src/clp/streaming_archive/writer/File.hpp (1)
  • timestamp (89-95)
components/core/src/clp/EncodedVariableInterpreter.cpp (6)
  • convert_string_to_representable_integer_var (24-61)
  • convert_string_to_representable_integer_var (24-27)
  • convert_string_to_representable_float_var (63-142)
  • convert_string_to_representable_float_var (63-66)
  • encode_var_dict_id (199-201)
  • encode_var_dict_id (199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)

11-11: LGTM: include is appropriate.

Needed for the new error message construction.

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

5-10: LGTM: header additions are appropriate.

These headers match the new usage and remove transitive‑include reliance.

Also applies to: 15-15


479-491: LGTM: correct style and write path.

Style matches guideline (false == …) and the write/update path is consistent.

@@ -369,7 +368,7 @@ void Archive::write_msg_using_schema(LogEventView const& log_view) {
}
for (uint32_t i = 1; i < log_output_buffer->pos(); i++) {
log_surgeon::Token& token = log_output_buffer->get_mutable_token(i);
int token_type = token.m_type_ids_ptr->at(0);
auto const token_type{token.m_type_ids_ptr->at(0)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Sanity‑check token type vector.

at(0) assumes m_type_ids_ptr is non‑null and non‑empty; add a brief guard/diagnostic.

Would you assert that token.m_type_ids_ptr != nullptr and token.m_type_ids_ptr->empty() == false before use?

🤖 Prompt for AI Agents
In components/core/src/clp/streaming_archive/writer/Archive.cpp around line 371,
the code calls token.m_type_ids_ptr->at(0) without checking that m_type_ids_ptr
is non-null and that the vector is non-empty; add a guard/diagnostic by
asserting token.m_type_ids_ptr != nullptr and !token.m_type_ids_ptr->empty() (or
using an explicit if that logs/error-handles and returns) immediately before
this line so callers get a clear diagnostic instead of UB, and include a brief
message describing the token/stream context in the assertion/log.

Comment on lines +192 to +203
// Currently, we only support at most a single capture group in each variable. If a capture
// group is present its match will be treated as the variable rather than the full match.
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
if (1 < num_captures) {
throw std::runtime_error(
schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+ ": error: the schema rule '" + rule->m_name
+ "' has a regex pattern containing > 1 capture groups (found "
+ std::to_string(num_captures) + ").\n"
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Don’t enforce single‑capture on the 'timestamp' rule (unless intended) + tweak wording.

The validation block currently runs before the 'timestamp' early‑continue, changing behaviour for timestamp rules. If timestamps are exempt (as before), move the check below the 'timestamp' guard. Also, fix the error text (“more than one capture group”).

-        // Currently, we only support at most a single capture group in each variable. If a capture
-        // group is present its match will be treated as the variable rather than the full match.
-        auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
-        if (1 < num_captures) {
-            throw std::runtime_error(
-                    schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
-                    + ": error: the schema rule '" + rule->m_name
-                    + "' has a regex pattern containing > 1 capture groups (found "
-                    + std::to_string(num_captures) + ").\n"
-            );
-        }
-
         if ("timestamp" == rule->m_name) {
             continue;
         }
+
+        // Currently, we only support at most a single capture group in each variable. If a capture
+        // group is present its match will be treated as the variable rather than the full match.
+        auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
+        if (1 < num_captures) {
+            throw std::runtime_error(
+                    schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+                    + ": error: the schema rule '" + rule->m_name
+                    + "' has a regex pattern containing more than one capture group (found "
+                    + std::to_string(num_captures) + ").\n"
+            );
+        }

Also applies to: 204-207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Searching repository for occurrences of get_subtree_positive_captures..."
rg -n -C3 -S 'get_subtree_positive_captures' || true

echo
echo "Searching repository for occurrences of the timestamp rule guard (\"timestamp\" == rule->m_name)..."
rg -n -C5 -S '"timestamp"\s*==\s*rule->m_name' || true

Length of output: 1696


Move single-capture validation below the 'timestamp' guard and fix error wording.

The capture-count check currently runs before the 'timestamp' early-continue, changing behaviour for timestamp rules — move it after the guard and change the error text to "more than one capture group".

File: components/core/src/clp/Utils.cpp

-        // Currently, we only support at most a single capture group in each variable. If a capture
-        // group is present its match will be treated as the variable rather than the full match.
-        auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
-        if (1 < num_captures) {
-            throw std::runtime_error(
-                    schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
-                    + ": error: the schema rule '" + rule->m_name
-                    + "' has a regex pattern containing > 1 capture groups (found "
-                    + std::to_string(num_captures) + ").\n"
-            );
-        }
-
         if ("timestamp" == rule->m_name) {
             continue;
         }
+
+        // Currently, we only support at most a single capture group in each variable. If a capture
+        // group is present its match will be treated as the variable rather than the full match.
+        auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
+        if (1 < num_captures) {
+            throw std::runtime_error(
+                    schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+                    + ": error: the schema rule '" + rule->m_name
+                    + "' has a regex pattern containing more than one capture group (found "
+                    + std::to_string(num_captures) + ").\n"
+            );
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Currently, we only support at most a single capture group in each variable. If a capture
// group is present its match will be treated as the variable rather than the full match.
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
if (1 < num_captures) {
throw std::runtime_error(
schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+ ": error: the schema rule '" + rule->m_name
+ "' has a regex pattern containing > 1 capture groups (found "
+ std::to_string(num_captures) + ").\n"
);
}
if ("timestamp" == rule->m_name) {
continue;
}
// Currently, we only support at most a single capture group in each variable. If a capture
// group is present its match will be treated as the variable rather than the full match.
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
if (1 < num_captures) {
throw std::runtime_error(
schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
": error: the schema rule '" + rule->m_name
"' has a regex pattern containing more than one capture group (found "
std::to_string(num_captures) + ").\n"
);
}
🤖 Prompt for AI Agents
In components/core/src/clp/Utils.cpp around lines 192–203, the check that throws
when a regex has more than one capture group must be moved to after the
early-continue/guard that handles 'timestamp' rules so timestamp rules are not
affected; relocate the entire num_captures check so it runs only when the
timestamp branch has been bypassed, and change the thrown error text to read
"... has a regex pattern with more than one capture group (found X)." keeping
the same numeric check and including schema_file_path, line number and rule name
as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants