feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273

davidlion · 2025-08-28T15:37:32Z

Description

Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.

For example, the heuristic's "equals" rule can be represented with the regex pattern .*=(?<var>.*[a-zA-Z0-9].*). The heuristic parser will only store the var capture group as a variable (storing the prefix .*= as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix .*=) as a variable.

This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g. text(?<var>variable)+text)) all repetitions will be stored together as a single variable.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Added unit tests for schema creation with single or multiple captures.
TODO Generated archives and matched their contents.

Summary by CodeRabbit

New Features
- Extract variables from a single named capture group within a token while preserving surrounding text.
- Schema pattern updated to expose the value after '=' as a named capture for easier access.
- Encoding now supports decomposing tokens into constant parts and a captured variable for more compact storage.
Bug Fixes
- Clear validation error when a schema rule contains more than one capture group, including file and line context.
- Improved robustness and error handling during message encoding and timestamp pattern detection.
Tests
- Added tests for valid single-capture schemas and for rejecting multiple-capture schemas.

…a rule to have parity with the heuristic parser.

coderabbitai · 2025-08-28T15:37:40Z

Walkthrough

Adds runtime validation to schema loading enforcing ≤1 regex capture per schema variable, extends Archive writer to decompose tokens by capture groups and track encoded variables (and updates LogEventView qualification and timestamp handling), and adds tests and schema fixtures for single vs. multiple capture-group cases.

Changes

Cohort / File(s)	Summary
Schema validation `components/core/src/clp/Utils.cpp`	Adds `#include <string>` and inserts runtime checks during schema loading (in both schema_vars loops) to ensure each rule's regex has at most one positive capture; throws `std::runtime_error` containing schema path, rule line, rule name and capture count when >1. Keeps existing timestamp rule skip.
Archive writer — capture-aware token encoding & timestamp plumbing `components/core/src/clp/streaming_archive/writer/Archive.cpp`	Adds explicit includes, changes API to accept `log_surgeon::LogEventView const&`, passes `start`/`end` by reference into timestamp pattern search and throws if none found, tightens end-position handling, modernizes types/initialization, and adds logic to split tokens when capture groups exist: emit pre/post constants, add capture as encoded dictionary variable, track `m_var_ids`/`m_encoded_vars`, and throw on missing register IDs; preserves original behavior if no captures.
Tests — parser/schema validation `components/core/tests/test-ParserWithUserSchema.cpp`	Adds tests: asserts single named capture is detected and named correctly; asserts `load_lexer_from_file` throws for a rule with multiple capture groups and includes file path and rule name in the error message.
Test fixtures — schema examples `components/core/tests/test_schema_files/single_capture_group.txt`, `components/core/tests/test_schema_files/multiple_capture_groups.txt`	Adds schema examples: single named capture (`capture:text(?<group>var)text`) and multicapture (`multicapture:text(?<group0>var0)text(?<group1>var1)text`) with `delimiters: \r\n` line to exercise validation.
Config schemas update `components/core/config/schemas.txt`	Changes `equals` pattern from `equals:.=.[a-zA-Z0-9].` to `equals:.=(?<var>.[a-zA-Z0-9].)`, introducing a named capture `var` for the RHS value.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Loader
  participant Utils as Utils::load_lexer_from_file
  participant Rule as Rule (regex)
  note over Utils,Rule: Schema load and capture validation (runs in both schema_vars loops)
  App->>Utils: load_lexer_from_file(schema_path)
  loop For each schema_var rule
    Utils->>Rule: get_subtree_positive_captures()
    alt > 1 capture
      Utils-->>App: throw runtime_error(file, line, rule_name, capture_count)
    else <= 1 capture
      Utils-->>App: continue processing
    end
  end

sequenceDiagram
  autonumber
  participant Caller as In
  participant Arch as Archive::write_msg_using_schema
  participant TS as TimestampPattern
  participant Dict as Dictionaries/Encoders
  Caller->>Arch: write_msg_using_schema(log_surgeon::LogEventView)
  Arch->>TS: search_known_ts_patterns(buf, start&, end&)
  TS-->>Arch: start/end positions (or none -> error)
  loop For each token
    alt Token has capture group
      Arch->>Dict: add constant (pre-capture)
      Arch->>Dict: add encoded variable (capture) and record var id
      Arch->>Dict: add constant (post-capture)
    else No capture
      Arch->>Dict: add variable (entire token)
    end
  end
  Arch-->>Caller: write complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change — adding support for a single capture group in schema rules to achieve parity with the heuristic parser — and includes scope and a breaking-change marker; it is specific, concise, and clearly conveys the main intent for reviewers scanning history.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fbd5a12 and a7a0be1.

📒 Files selected for processing (5)

components/core/src/clp/Utils.cpp (2 hunks)
components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)
components/core/tests/test-ParserWithUserSchema.cpp (1 hunks)
components/core/tests/test_schema_files/multiple_capture_groups.txt (1 hunks)
components/core/tests/test_schema_files/single_capture_group.txt (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

Prefer false == <expression> rather than !<expression>.

Files:

components/core/tests/test-ParserWithUserSchema.cpp
components/core/src/clp/Utils.cpp
components/core/src/clp/streaming_archive/writer/Archive.cpp

🧠 Learnings (1)

📚 Learning: 2024-10-14T03:42:10.355Z

Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

components/core/src/clp/streaming_archive/writer/Archive.cpp

🧬 Code graph analysis (2)

components/core/tests/test-ParserWithUserSchema.cpp (2)

components/core/src/clp/Utils.cpp (2)

load_lexer_from_file (125-254)

load_lexer_from_file (126-126)

components/core/src/clp/Utils.hpp (1)

load_lexer_from_file (52-55)

components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

components/core/src/clp/streaming_archive/writer/Archive.hpp (2)

timestamp (146-146)

logtype_id (229-232)

components/core/src/clp/streaming_archive/writer/File.hpp (1)

timestamp (89-95)

components/core/src/clp/EncodedVariableInterpreter.hpp (3)

encoded_var (52-52)

encoded_var (111-111)

id (51-51)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: antlr-code-committed (macos-15)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (8)

components/core/src/clp/Utils.cpp (1)

11-11: Header addition looks good.

Adding resolves downstream std::string usage here.

components/core/src/clp/streaming_archive/writer/Archive.cpp (4)

362-368: Potential off-by-one in uncompressed byte count (verify semantics).

If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.

Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).

371-382: Good: token_type made const and delimiter handling left intact.

No issues spotted; aligns with existing flow.

485-496: OK: zero-initialised logtype_id and downstream writes.

This aligns with safer initialisation and existing dictionary API.

317-341: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.

components/core/tests/test_schema_files/single_capture_group.txt (1)

1-1: Fixture is minimal and appropriate.

Covers the intended single-capture scenario with surrounding literals.

components/core/tests/test_schema_files/multiple_capture_groups.txt (1)

1-1: Good negative fixture.

Triggers the >1 capture validation path as desired.

components/core/tests/test-ParserWithUserSchema.cpp (1)

195-204: Exact error assertion is OK; keep in sync if message changes.

If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).

coderabbitai · 2025-08-28T15:47:21Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

            default: {
-                // Variable string looks like a dictionary variable, so encode it as so
-                encoded_variable_t encoded_var;
-                variable_dictionary_id_t id;
-                m_var_dict.add_entry(token.to_string(), id);
-                encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
-                m_var_ids.push_back(id);
+                auto const& lexer{log_view.get_log_parser().m_lexer};
+                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
+
+                // If the variable token contains capture groups, we break the token up by storing
+                // each capture as a variable and any substrings surrounding the capture as part of
+                // the logtype. If there are no capture groups the entire variable token is stored
+                // as a variable.
+                if (false == capture_ids.has_value()) {
+                    variable_dictionary_id_t id{};
+                    m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
+                    m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
+                    m_logtype_dict_entry.add_dictionary_var();
+
+                    break;
+                }



⚠️ Potential issue

Guard against empty capture-id vector.

has_value() can hold an empty vector; at(0) would throw. Treat empty as “no capture.”

Apply:

- if (false == capture_ids.has_value()) { + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

default: {

// Variable string looks like a dictionary variable, so encode it as so

encoded_variable_t encoded_var;

variable_dictionary_id_t id;

m_var_dict.add_entry(token.to_string(), id);

encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);

m_var_ids.push_back(id);

auto const& lexer{log_view.get_log_parser().m_lexer};

auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};

// If the variable token contains capture groups, we break the token up by storing

// each capture as a variable and any substrings surrounding the capture as part of

// the logtype. If there are no capture groups the entire variable token is stored

// as a variable.

if (false == capture_ids.has_value()) {

variable_dictionary_id_t id{};

m_var_dict.add_entry(token.to_string(), id);

m_var_ids.push_back(id);

m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));

m_logtype_dict_entry.add_dictionary_var();

break;

}

default: {

auto const& lexer{log_view.get_log_parser().m_lexer};

auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};

// If the variable token contains capture groups, we break the token up by storing

// each capture as a variable and any substrings surrounding the capture as part of

// the logtype. If there are no capture groups the entire variable token is stored

// as a variable.

if (false == capture_ids.has_value() || capture_ids->empty()) {

variable_dictionary_id_t id{};

m_var_dict.add_entry(token.to_string(), id);

m_var_ids.push_back(id);

m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));

m_logtype_dict_entry.add_dictionary_var();

break;

}

// … rest of processing when capture_ids has at least one element …

🤖 Prompt for AI Agents

In components/core/src/clp/streaming_archive/writer/Archive.cpp around lines 423-440, the code treats capture_ids.has_value() as sufficient but an empty vector is possible and later at(0) would throw; update the guard to treat an empty capture-id vector as “no capture” by checking both has_value() and !capture_ids->empty() (e.g., if (!capture_ids.has_value() || capture_ids->empty()) handle as the no-capture path), and ensure subsequent uses of capture_ids access elements only after this emptiness check.

components/core/src/clp/streaming_archive/writer/Archive.cpp

components/core/src/clp/Utils.cpp

coderabbitai · 2025-08-28T15:47:22Z

components/core/tests/test-ParserWithUserSchema.cpp

+TEST_CASE("Test schema with single capture group", "[load_lexer]") {
+    std::string const schema_path{"../tests/test_schema_files/single_capture_group.txt"};
+    ByteLexer lexer;
+    load_lexer_from_file(schema_path, lexer);
+
+    auto const rule_id{lexer.m_symbol_id.at("capture")};
+    auto const capture_ids{lexer.get_capture_ids_from_rule_id(rule_id)};
+    REQUIRE(capture_ids.has_value());
+    REQUIRE(1 == capture_ids.value().size());
+    REQUIRE("group" == lexer.m_id_symbol.at(capture_ids->at(0)));
+}


🧹 Nitpick (assertive)

Solid positive test; consider adding a repetition case.

This verifies single capture presence and naming. Add a case where the capture is repeated (e.g., (?a)+) to assert aggregation semantics end-to-end.

I can draft that test using the existing ByteLexer and a small sample line.

🤖 Prompt for AI Agents

In components/core/tests/test-ParserWithUserSchema.cpp around lines 183-193, add a new TEST_CASE that mirrors the single-capture test but uses a schema with a repeated named capture (e.g., (?<g>a)+); load that schema into ByteLexer, retrieve the rule and capture ids and assert the capture id exists and maps to "g", then run the lexer against a sample input like "aaa" and assert the lexer returns multiple aggregated capture values for "g" (e.g., three captured "a" entries) to validate repetition aggregation end-to-end.

SharafMohamed · 2025-09-02T11:11:21Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

+                // If the variable token contains capture groups, we break the token up by storing
+                // each capture as a variable and any substrings surrounding the capture as part of
+                // the logtype. If there are no capture groups the entire variable token is stored
+                // as a variable.


Flip the order of the sentences in the docstring to match the order of the cases in the code? I.e., the no capture case first, then the capture case.

SharafMohamed · 2025-09-02T11:22:22Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

+                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()};
+                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()};


I'm assuming the behavior we want is to treat multi-valued captures as a single variable. In that case the correct usage should be .back() for the start capture and .front() for the end capture (since the reg positions are reversed).

Ideally we'd have a unit-test for this case, but I think multi-valued captures are currently a log-surgeon issue.

SharafMohamed · 2025-09-02T11:26:54Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

+                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()};
+                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()};
+                auto const token_view{token.to_string_view()};
+                size_t token_pos{0};


Is it needed to track token_pos. Aren't the segments [0, capture_start), [capture_start, capture_end) and [capture_end, token_view.length())?

Perhaps the following is the most direct (and most log-surgeon independent) implementation. Manipulate token and call to_string_view as needed:

Store token_end_pos = token.m_end_pos

For the prefix static-text: Set token.m_end_pos= capture_start

For the variable itself: Set token.m_start_pos= capture_start and token.m_end_pos= capture_end

For the suffix static-text: Set token.m_start_pos= capture_end and token.m_end_pos= token_end_pos

This way CLP does not care about how the tokens wrap around is handled internally. Also nice for future-proofing in case LS changes its behavior.

We should also unit-test this if possible.

Also: for clarity, maybe format the code such that the prefix is a block with a newline after it, the variable is a block with a newline after it, and the suffix is a block with a newline after it.

SharafMohamed · 2025-09-02T11:29:59Z

components/core/tests/test_schema_files/multiple_capture_groups.txt

@@ -0,0 +1 @@
+multicapture:text(?<group0>var0)text(?<group1>var1)text


Technically lexers can exist with the schema as is, so I'm not entirely sure about this comment. But, if any LogParser object is created with this schema it will error as it lacks delimiters. So this schema file is malformed according to Log Surgeon.

SharafMohamed · 2025-09-02T11:30:13Z

components/core/tests/test_schema_files/single_capture_group.txt

@@ -0,0 +1 @@
+capture:text(?<group>var)text


Same as above.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

362-366: Guard against empty buffer when computing end_pos.

pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.

Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?

391-405: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.

When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.

-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_int_var();
                 }
                 m_encoded_vars.push_back(encoded_var);
@@
-                encoded_variable_t encoded_var{};
-                if (!EncodedVariableInterpreter::convert_string_to_representable_float_var(
+                encoded_variable_t encoded_var{};
+                if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var(
                             token.to_string(),
                             encoded_var
                     ))
                 {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
+                    m_var_ids.push_back(id);
                     encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id);
                     m_logtype_dict_entry.add_dictionary_var();
                 } else {
                     m_logtype_dict_entry.add_float_var();
                 }
                 m_encoded_vars.push_back(encoded_var);

Also applies to: 407-421

♻️ Duplicate comments (2)

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

431-440: Also treat empty capture-id vectors as “no capture.”

has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.

-                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
-                if (false == capture_ids.has_value()) {
+                auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)};
+                if (false == capture_ids.has_value() || capture_ids->empty()) {
                     variable_dictionary_id_t id{};
                     m_var_dict.add_entry(token.to_string(), id);
                     m_var_ids.push_back(id);
                     m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
                     m_logtype_dict_entry.add_dictionary_var();
 
                     break;
                 }

442-454: Validate register positions before using front()/back().

get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.

-                auto const [start_reg_id, end_reg_id]{register_ids.value()};
-                auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()};
-                auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()};
+                auto const [start_reg_id, end_reg_id]{register_ids.value()};
+                auto const& start_positions = token.get_reversed_reg_positions(start_reg_id);
+                auto const& end_positions = token.get_reversed_reg_positions(end_reg_id);
+                if (start_positions.empty() || end_positions.empty()) {
+                    throw(std::runtime_error(
+                            "Empty register positions for variable's capture group. Full token: "
+                            + token.to_string()
+                    ));
+                }
+                auto const capture_start{start_positions.back()};
+                auto const capture_end{end_positions.front()};

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc591cd and 03a1f02.

📒 Files selected for processing (2)

components/core/src/clp/Utils.cpp (2 hunks)
components/core/src/clp/streaming_archive/writer/Archive.cpp (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

Prefer false == <expression> rather than !<expression>.

Files:

components/core/src/clp/Utils.cpp
components/core/src/clp/streaming_archive/writer/Archive.cpp

🧠 Learnings (1)

📚 Learning: 2024-10-14T03:42:10.355Z

Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

components/core/src/clp/streaming_archive/writer/Archive.cpp

🧬 Code graph analysis (1)

components/core/src/clp/streaming_archive/writer/Archive.cpp (3)

components/core/src/clp/streaming_archive/writer/Archive.hpp (2)

timestamp (146-146)

logtype_id (229-232)

components/core/src/clp/streaming_archive/writer/File.hpp (1)

timestamp (89-95)

components/core/src/clp/EncodedVariableInterpreter.cpp (6)

convert_string_to_representable_integer_var (24-61)

convert_string_to_representable_integer_var (24-27)

convert_string_to_representable_float_var (63-142)

convert_string_to_representable_float_var (63-66)

encode_var_dict_id (199-201)

encode_var_dict_id (199-199)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: antlr-code-committed (macos-15)

🔇 Additional comments (3)

components/core/src/clp/Utils.cpp (1)

11-11: LGTM: include is appropriate.

Needed for the new error message construction.

components/core/src/clp/streaming_archive/writer/Archive.cpp (2)

5-10: LGTM: header additions are appropriate.

These headers match the new usage and remove transitive‑include reliance.

Also applies to: 15-15

479-491: LGTM: correct style and write path.

Style matches guideline (false == …) and the write/update path is consistent.

coderabbitai · 2025-09-19T14:40:34Z

components/core/src/clp/streaming_archive/writer/Archive.cpp

@@ -369,7 +368,7 @@ void Archive::write_msg_using_schema(LogEventView const& log_view) {
    }
    for (uint32_t i = 1; i < log_output_buffer->pos(); i++) {
        log_surgeon::Token& token = log_output_buffer->get_mutable_token(i);
-        int token_type = token.m_type_ids_ptr->at(0);
+        auto const token_type{token.m_type_ids_ptr->at(0)};


🧹 Nitpick

Sanity‑check token type vector.

at(0) assumes m_type_ids_ptr is non‑null and non‑empty; add a brief guard/diagnostic.

Would you assert that token.m_type_ids_ptr != nullptr and token.m_type_ids_ptr->empty() == false before use?

🤖 Prompt for AI Agents

In components/core/src/clp/streaming_archive/writer/Archive.cpp around line 371, the code calls token.m_type_ids_ptr->at(0) without checking that m_type_ids_ptr is non-null and that the vector is non-empty; add a guard/diagnostic by asserting token.m_type_ids_ptr != nullptr and !token.m_type_ids_ptr->empty() (or using an explicit if that logs/error-handles and returns) immediately before this line so callers get a clear diagnostic instead of UB, and include a brief message describing the token/stream context in the assertion/log.

coderabbitai · 2025-09-19T14:40:35Z

components/core/src/clp/Utils.cpp

+        // Currently, we only support at most a single capture group in each variable. If a capture
+        // group is present its match will be treated as the variable rather than the full match.
+        auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
+        if (1 < num_captures) {
+            throw std::runtime_error(
+                    schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+                    + ": error: the schema rule '" + rule->m_name
+                    + "' has a regex pattern containing > 1 capture groups (found "
+                    + std::to_string(num_captures) + ").\n"
+            );
+        }
+


🛠️ Refactor suggestion

🧩 Analysis chain

Don’t enforce single‑capture on the 'timestamp' rule (unless intended) + tweak wording.

The validation block currently runs before the 'timestamp' early‑continue, changing behaviour for timestamp rules. If timestamps are exempt (as before), move the check below the 'timestamp' guard. Also, fix the error text (“more than one capture group”).

- // Currently, we only support at most a single capture group in each variable. If a capture - // group is present its match will be treated as the variable rather than the full match. - auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); - if (1 < num_captures) { - throw std::runtime_error( - schema_file_path + ":" + std::to_string(rule->m_line_num + 1) - + ": error: the schema rule '" + rule->m_name - + "' has a regex pattern containing > 1 capture groups (found " - + std::to_string(num_captures) + ").\n" - ); - } - if ("timestamp" == rule->m_name) { continue; } + + // Currently, we only support at most a single capture group in each variable. If a capture + // group is present its match will be treated as the variable rather than the full match. + auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); + if (1 < num_captures) { + throw std::runtime_error( + schema_file_path + ":" + std::to_string(rule->m_line_num + 1) + + ": error: the schema rule '" + rule->m_name + + "' has a regex pattern containing more than one capture group (found " + + std::to_string(num_captures) + ").\n" + ); + }

Also applies to: 204-207

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Searching repository for occurrences of get_subtree_positive_captures..." rg -n -C3 -S 'get_subtree_positive_captures' || true echo echo "Searching repository for occurrences of the timestamp rule guard (\"timestamp\" == rule->m_name)..." rg -n -C5 -S '"timestamp"\s*==\s*rule->m_name' || true

Length of output: 1696

Move single-capture validation below the 'timestamp' guard and fix error wording.

The capture-count check currently runs before the 'timestamp' early-continue, changing behaviour for timestamp rules — move it after the guard and change the error text to "more than one capture group".

File: components/core/src/clp/Utils.cpp

- // Currently, we only support at most a single capture group in each variable. If a capture - // group is present its match will be treated as the variable rather than the full match. - auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); - if (1 < num_captures) { - throw std::runtime_error( - schema_file_path + ":" + std::to_string(rule->m_line_num + 1) - + ": error: the schema rule '" + rule->m_name - + "' has a regex pattern containing > 1 capture groups (found " - + std::to_string(num_captures) + ").\n" - ); - } - if ("timestamp" == rule->m_name) { continue; } + + // Currently, we only support at most a single capture group in each variable. If a capture + // group is present its match will be treated as the variable rather than the full match. + auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); + if (1 < num_captures) { + throw std::runtime_error( + schema_file_path + ":" + std::to_string(rule->m_line_num + 1) + + ": error: the schema rule '" + rule->m_name + + "' has a regex pattern containing more than one capture group (found " + + std::to_string(num_captures) + ").\n" + ); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Currently, we only support at most a single capture group in each variable. If a capture

// group is present its match will be treated as the variable rather than the full match.

auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();

if (1 < num_captures) {

throw std::runtime_error(

schema_file_path + ":" + std::to_string(rule->m_line_num + 1)

+ ": error: the schema rule '" + rule->m_name

+ "' has a regex pattern containing > 1 capture groups (found "

+ std::to_string(num_captures) + ").\n"

);

}

if ("timestamp" == rule->m_name) {

continue;

}

// Currently, we only support at most a single capture group in each variable. If a capture

// group is present its match will be treated as the variable rather than the full match.

auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();

if (1 < num_captures) {

throw std::runtime_error(

schema_file_path + ":" + std::to_string(rule->m_line_num + 1)

": error: the schema rule '" + rule->m_name

"' has a regex pattern containing more than one capture group (found "

std::to_string(num_captures) + ").\n"

);

}

🤖 Prompt for AI Agents

In components/core/src/clp/Utils.cpp around lines 192–203, the check that throws when a regex has more than one capture group must be moved to after the early-continue/guard that handles 'timestamp' rules so timestamp rules are not affected; relocate the entire num_captures check so it runs only when the timestamp branch has been bypassed, and change the thrown error text to read "... has a regex pattern with more than one capture group (found X)." keeping the same numeric check and including schema_file_path, line number and rule name as before.

davidlion added 2 commits August 28, 2025 11:20

feat(log-surgeon)!: Add support for a single capture group in a schem…

4931f6b

…a rule to have parity with the heuristic parser.

Merge remote-tracking branch 'upstream/main' into capture-support

a7a0be1

davidlion requested a review from a team as a code owner August 28, 2025 15:37

Add schema equals rule capture.

61d5715

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

SharafMohamed reviewed Sep 2, 2025

View reviewed changes

components/core/tests/test_schema_files/single_capture_group.txt

@@ -0,0 +1 @@

capture:text(?<group>var)text

Copy link

Contributor

SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

davidlion added 5 commits September 16, 2025 23:42

Merge remote-tracking branch 'upstream/main' into capture-support

daf18ad

Address some review comments.

20b26f9

Tweak tests.

b8089f1

Fix wrapped token logic.

98d1088

Merge remote-tracking branch 'upstream/main' into capture-support

dc591cd

davidlion requested a review from SharafMohamed September 19, 2025 14:03

davidlion added 2 commits September 19, 2025 10:28

Fix formatting.

dec9f6e

Remove wrapping logic by manipulating the token.

03a1f02

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273

feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273

davidlion commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 28, 2025

Uh oh!

SharafMohamed Sep 2, 2025

Uh oh!

SharafMohamed Sep 2, 2025 •

edited

Loading

Uh oh!

SharafMohamed Sep 2, 2025 •

edited

Loading

Uh oh!

SharafMohamed Sep 18, 2025

Uh oh!

SharafMohamed Sep 18, 2025 •

edited

Loading

Uh oh!

SharafMohamed Sep 2, 2025 •

edited

Loading

Uh oh!

SharafMohamed Sep 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 19, 2025

Uh oh!

coderabbitai bot Sep 19, 2025

Uh oh!

Uh oh!

		auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()};
		auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()};

		@@ -0,0 +1 @@
		multicapture:text(?<group0>var0)text(?<group1>var1)text

feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273

Are you sure you want to change the base?

feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273

Conversation

davidlion commented Aug 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SharafMohamed Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidlion commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 28, 2025 •

edited

Loading

SharafMohamed Sep 2, 2025 •

edited

Loading

SharafMohamed Sep 2, 2025 •

edited

Loading

SharafMohamed Sep 18, 2025 •

edited

Loading

SharafMohamed Sep 2, 2025 •

edited

Loading