-
Notifications
You must be signed in to change notification settings - Fork 81
feat(log-surgeon)!: Add support for a single capture group in a schema rule to have parity with the heuristic parser. #1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…a rule to have parity with the heuristic parser.
WalkthroughAdds runtime validation to schema loading enforcing ≤1 regex capture per schema variable, extends Archive writer to decompose tokens by capture groups and track encoded variables (and updates LogEventView qualification and timestamp handling), and adds tests and schema fixtures for single vs. multiple capture-group cases. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as Loader
participant Utils as Utils::load_lexer_from_file
participant Rule as Rule (regex)
note over Utils,Rule: Schema load and capture validation (runs in both schema_vars loops)
App->>Utils: load_lexer_from_file(schema_path)
loop For each schema_var rule
Utils->>Rule: get_subtree_positive_captures()
alt > 1 capture
Utils-->>App: throw runtime_error(file, line, rule_name, capture_count)
else <= 1 capture
Utils-->>App: continue processing
end
end
sequenceDiagram
autonumber
participant Caller as In
participant Arch as Archive::write_msg_using_schema
participant TS as TimestampPattern
participant Dict as Dictionaries/Encoders
Caller->>Arch: write_msg_using_schema(log_surgeon::LogEventView)
Arch->>TS: search_known_ts_patterns(buf, start&, end&)
TS-->>Arch: start/end positions (or none -> error)
loop For each token
alt Token has capture group
Arch->>Dict: add constant (pre-capture)
Arch->>Dict: add encoded variable (capture) and record var id
Arch->>Dict: add constant (post-capture)
else No capture
Arch->>Dict: add variable (entire token)
end
end
Arch-->>Caller: write complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
components/core/src/clp/Utils.cpp
(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(7 hunks)components/core/tests/test-ParserWithUserSchema.cpp
(1 hunks)components/core/tests/test_schema_files/multiple_capture_groups.txt
(1 hunks)components/core/tests/test_schema_files/single_capture_group.txt
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/tests/test-ParserWithUserSchema.cpp
components/core/src/clp/Utils.cpp
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (2)
components/core/tests/test-ParserWithUserSchema.cpp (2)
components/core/src/clp/Utils.cpp (2)
load_lexer_from_file
(125-254)load_lexer_from_file
(126-126)components/core/src/clp/Utils.hpp (1)
load_lexer_from_file
(52-55)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp
(146-146)logtype_id
(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp
(89-95)components/core/src/clp/EncodedVariableInterpreter.hpp (3)
encoded_var
(52-52)encoded_var
(111-111)id
(51-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: antlr-code-committed (macos-15)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (8)
components/core/src/clp/Utils.cpp (1)
11-11
: Header addition looks good.Adding resolves downstream std::string usage here.
components/core/src/clp/streaming_archive/writer/Archive.cpp (4)
362-368
: Potential off-by-one in uncompressed byte count (verify semantics).If m_end_pos is exclusive, end_pos - start_pos is correct; if inclusive, add +1. Please verify Token.m_end_pos semantics, especially across buffer wrap.
Consider a focused unit test with a two-token message exercising boundary conditions (start_pos == end_pos and wrap).
371-382
: Good: token_type made const and delimiter handling left intact.No issues spotted; aligns with existing flow.
485-496
: OK: zero-initialised logtype_id and downstream writes.This aligns with safer initialisation and existing dictionary API.
317-341
: No signature mismatch found
Header and implementation both declare write_msg_using_schema(log_surgeon::LogEventView const&); no action required.components/core/tests/test_schema_files/single_capture_group.txt (1)
1-1
: Fixture is minimal and appropriate.Covers the intended single-capture scenario with surrounding literals.
components/core/tests/test_schema_files/multiple_capture_groups.txt (1)
1-1
: Good negative fixture.Triggers the >1 capture validation path as desired.
components/core/tests/test-ParserWithUserSchema.cpp (1)
195-204
: Exact error assertion is OK; keep in sync if message changes.If you accept the optional richer error in Utils.cpp, adjust this expectation accordingly (or match a substring).
default: { | ||
// Variable string looks like a dictionary variable, so encode it as so | ||
encoded_variable_t encoded_var; | ||
variable_dictionary_id_t id; | ||
m_var_dict.add_entry(token.to_string(), id); | ||
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); | ||
m_var_ids.push_back(id); | ||
auto const& lexer{log_view.get_log_parser().m_lexer}; | ||
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; | ||
|
||
// If the variable token contains capture groups, we break the token up by storing | ||
// each capture as a variable and any substrings surrounding the capture as part of | ||
// the logtype. If there are no capture groups the entire variable token is stored | ||
// as a variable. | ||
if (false == capture_ids.has_value()) { | ||
variable_dictionary_id_t id{}; | ||
m_var_dict.add_entry(token.to_string(), id); | ||
m_var_ids.push_back(id); | ||
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); | ||
m_logtype_dict_entry.add_dictionary_var(); | ||
|
||
break; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against empty capture-id vector.
has_value() can hold an empty vector; at(0) would throw. Treat empty as “no capture.”
Apply:
- if (false == capture_ids.has_value()) {
+ if (false == capture_ids.has_value() || capture_ids->empty()) {
variable_dictionary_id_t id{};
m_var_dict.add_entry(token.to_string(), id);
m_var_ids.push_back(id);
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id));
m_logtype_dict_entry.add_dictionary_var();
break;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
default: { | |
// Variable string looks like a dictionary variable, so encode it as so | |
encoded_variable_t encoded_var; | |
variable_dictionary_id_t id; | |
m_var_dict.add_entry(token.to_string(), id); | |
encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); | |
m_var_ids.push_back(id); | |
auto const& lexer{log_view.get_log_parser().m_lexer}; | |
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; | |
// If the variable token contains capture groups, we break the token up by storing | |
// each capture as a variable and any substrings surrounding the capture as part of | |
// the logtype. If there are no capture groups the entire variable token is stored | |
// as a variable. | |
if (false == capture_ids.has_value()) { | |
variable_dictionary_id_t id{}; | |
m_var_dict.add_entry(token.to_string(), id); | |
m_var_ids.push_back(id); | |
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); | |
m_logtype_dict_entry.add_dictionary_var(); | |
break; | |
} | |
default: { | |
auto const& lexer{log_view.get_log_parser().m_lexer}; | |
auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; | |
// If the variable token contains capture groups, we break the token up by storing | |
// each capture as a variable and any substrings surrounding the capture as part of | |
// the logtype. If there are no capture groups the entire variable token is stored | |
// as a variable. | |
if (false == capture_ids.has_value() || capture_ids->empty()) { | |
variable_dictionary_id_t id{}; | |
m_var_dict.add_entry(token.to_string(), id); | |
m_var_ids.push_back(id); | |
m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); | |
m_logtype_dict_entry.add_dictionary_var(); | |
break; | |
} | |
// … rest of processing when capture_ids has at least one element … |
🤖 Prompt for AI Agents
In components/core/src/clp/streaming_archive/writer/Archive.cpp around lines
423-440, the code treats capture_ids.has_value() as sufficient but an empty
vector is possible and later at(0) would throw; update the guard to treat an
empty capture-id vector as “no capture” by checking both has_value() and
!capture_ids->empty() (e.g., if (!capture_ids.has_value() ||
capture_ids->empty()) handle as the no-capture path), and ensure subsequent uses
of capture_ids access elements only after this emptiness check.
TEST_CASE("Test schema with single capture group", "[load_lexer]") { | ||
std::string const schema_path{"../tests/test_schema_files/single_capture_group.txt"}; | ||
ByteLexer lexer; | ||
load_lexer_from_file(schema_path, lexer); | ||
|
||
auto const rule_id{lexer.m_symbol_id.at("capture")}; | ||
auto const capture_ids{lexer.get_capture_ids_from_rule_id(rule_id)}; | ||
REQUIRE(capture_ids.has_value()); | ||
REQUIRE(1 == capture_ids.value().size()); | ||
REQUIRE("group" == lexer.m_id_symbol.at(capture_ids->at(0))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Solid positive test; consider adding a repetition case.
This verifies single capture presence and naming. Add a case where the capture is repeated (e.g., (?a)+) to assert aggregation semantics end-to-end.
I can draft that test using the existing ByteLexer and a small sample line.
🤖 Prompt for AI Agents
In components/core/tests/test-ParserWithUserSchema.cpp around lines 183-193, add
a new TEST_CASE that mirrors the single-capture test but uses a schema with a
repeated named capture (e.g., (?<g>a)+); load that schema into ByteLexer,
retrieve the rule and capture ids and assert the capture id exists and maps to
"g", then run the lexer against a sample input like "aaa" and assert the lexer
returns multiple aggregated capture values for "g" (e.g., three captured "a"
entries) to validate repetition aggregation end-to-end.
// If the variable token contains capture groups, we break the token up by storing | ||
// each capture as a variable and any substrings surrounding the capture as part of | ||
// the logtype. If there are no capture groups the entire variable token is stored | ||
// as a variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flip the order of the sentences in the docstring to match the order of the cases in the code? I.e., the no capture case first, then the capture case.
auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()}; | ||
auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the behavior we want is to treat multi-valued captures as a single variable. In that case the correct usage should be .back()
for the start capture and .front()
for the end capture (since the reg positions are reversed).
Ideally we'd have a unit-test for this case, but I think multi-valued captures are currently a log-surgeon issue.
auto const capture_start{token.get_reversed_reg_positions(start_reg_id).front()}; | ||
auto const capture_end{token.get_reversed_reg_positions(end_reg_id).back()}; | ||
auto const token_view{token.to_string_view()}; | ||
size_t token_pos{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to track token_pos
. Aren't the segments [0, capture_start
), [capture_start
, capture_end
) and [capture_end
, token_view.length()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the following is the most direct (and most log-surgeon independent) implementation. Manipulate token and call to_string_view
as needed:
- Store
token_end_pos = token.m_end_pos
- For the prefix static-text: Set
token.m_end_pos= capture_start
- For the variable itself: Set
token.m_start_pos= capture_start
andtoken.m_end_pos= capture_end
- For the suffix static-text: Set
token.m_start_pos= capture_end
andtoken.m_end_pos= token_end_pos
This way CLP does not care about how the tokens wrap around is handled internally. Also nice for future-proofing in case LS changes its behavior.
We should also unit-test this if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: for clarity, maybe format the code such that the prefix is a block with a newline after it, the variable is a block with a newline after it, and the suffix is a block with a newline after it.
@@ -0,0 +1 @@ | |||
multicapture:text(?<group0>var0)text(?<group1>var1)text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically lexers can exist with the schema as is, so I'm not entirely sure about this comment. But, if any LogParser
object is created with this schema it will error as it lacks delimiters. So this schema file is malformed according to Log Surgeon.
@@ -0,0 +1 @@ | |||
capture:text(?<group>var)text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
362-366
: Guard against empty buffer when computing end_pos.pos() − 1 will underflow if the buffer is empty. Unlikely, but add a defensive check/assert.
Would you add a precondition (e.g., assert(log_output_buffer->pos() > 0)) before using pos() − 1?
391-405
: Fix: track var_ids for int/float dictionary fallbacks and follow negation style.When integer/float cannot be encoded, you add a dictionary entry but don’t push the var ID into m_var_ids, which breaks segment indexing. Also, prefer “false == …” per repo style.
- encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_integer_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_integer_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_int_var(); } m_encoded_vars.push_back(encoded_var); @@ - encoded_variable_t encoded_var{}; - if (!EncodedVariableInterpreter::convert_string_to_representable_float_var( + encoded_variable_t encoded_var{}; + if (false == EncodedVariableInterpreter::convert_string_to_representable_float_var( token.to_string(), encoded_var )) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); + m_var_ids.push_back(id); encoded_var = EncodedVariableInterpreter::encode_var_dict_id(id); m_logtype_dict_entry.add_dictionary_var(); } else { m_logtype_dict_entry.add_float_var(); } m_encoded_vars.push_back(encoded_var);Also applies to: 407-421
♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
431-440
: Also treat empty capture-id vectors as “no capture.”has_value() can hold an empty vector; later at(0) would throw. Handle empty as the no‑capture path.
- auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; - if (false == capture_ids.has_value()) { + auto capture_ids{lexer.get_capture_ids_from_rule_id(token_type)}; + if (false == capture_ids.has_value() || capture_ids->empty()) { variable_dictionary_id_t id{}; m_var_dict.add_entry(token.to_string(), id); m_var_ids.push_back(id); m_encoded_vars.push_back(EncodedVariableInterpreter::encode_var_dict_id(id)); m_logtype_dict_entry.add_dictionary_var(); break; }
442-454
: Validate register positions before using front()/back().get_reversed_reg_positions(...) may return empty; front()/back() would be UB. Throw a clear error if empty.
- auto const [start_reg_id, end_reg_id]{register_ids.value()}; - auto const capture_start{token.get_reversed_reg_positions(start_reg_id).back()}; - auto const capture_end{token.get_reversed_reg_positions(end_reg_id).front()}; + auto const [start_reg_id, end_reg_id]{register_ids.value()}; + auto const& start_positions = token.get_reversed_reg_positions(start_reg_id); + auto const& end_positions = token.get_reversed_reg_positions(end_reg_id); + if (start_positions.empty() || end_positions.empty()) { + throw(std::runtime_error( + "Empty register positions for variable's capture group. Full token: " + + token.to_string() + )); + } + auto const capture_start{start_positions.back()}; + auto const capture_end{end_positions.front()};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/Utils.cpp
(2 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp/Utils.cpp
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧠 Learnings (1)
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/streaming_archive/writer/Archive.cpp
🧬 Code graph analysis (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
timestamp
(146-146)logtype_id
(229-232)components/core/src/clp/streaming_archive/writer/File.hpp (1)
timestamp
(89-95)components/core/src/clp/EncodedVariableInterpreter.cpp (6)
convert_string_to_representable_integer_var
(24-61)convert_string_to_representable_integer_var
(24-27)convert_string_to_representable_float_var
(63-142)convert_string_to_representable_float_var
(63-66)encode_var_dict_id
(199-201)encode_var_dict_id
(199-199)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (3)
components/core/src/clp/Utils.cpp (1)
11-11
: LGTM: include is appropriate.Needed for the new error message construction.
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
5-10
: LGTM: header additions are appropriate.These headers match the new usage and remove transitive‑include reliance.
Also applies to: 15-15
479-491
: LGTM: correct style and write path.Style matches guideline (false == …) and the write/update path is consistent.
@@ -369,7 +368,7 @@ void Archive::write_msg_using_schema(LogEventView const& log_view) { | |||
} | |||
for (uint32_t i = 1; i < log_output_buffer->pos(); i++) { | |||
log_surgeon::Token& token = log_output_buffer->get_mutable_token(i); | |||
int token_type = token.m_type_ids_ptr->at(0); | |||
auto const token_type{token.m_type_ids_ptr->at(0)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Sanity‑check token type vector.
at(0) assumes m_type_ids_ptr is non‑null and non‑empty; add a brief guard/diagnostic.
Would you assert that token.m_type_ids_ptr != nullptr and token.m_type_ids_ptr->empty() == false before use?
🤖 Prompt for AI Agents
In components/core/src/clp/streaming_archive/writer/Archive.cpp around line 371,
the code calls token.m_type_ids_ptr->at(0) without checking that m_type_ids_ptr
is non-null and that the vector is non-empty; add a guard/diagnostic by
asserting token.m_type_ids_ptr != nullptr and !token.m_type_ids_ptr->empty() (or
using an explicit if that logs/error-handles and returns) immediately before
this line so callers get a clear diagnostic instead of UB, and include a brief
message describing the token/stream context in the assertion/log.
// Currently, we only support at most a single capture group in each variable. If a capture | ||
// group is present its match will be treated as the variable rather than the full match. | ||
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); | ||
if (1 < num_captures) { | ||
throw std::runtime_error( | ||
schema_file_path + ":" + std::to_string(rule->m_line_num + 1) | ||
+ ": error: the schema rule '" + rule->m_name | ||
+ "' has a regex pattern containing > 1 capture groups (found " | ||
+ std::to_string(num_captures) + ").\n" | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
🧩 Analysis chain
Don’t enforce single‑capture on the 'timestamp' rule (unless intended) + tweak wording.
The validation block currently runs before the 'timestamp' early‑continue, changing behaviour for timestamp rules. If timestamps are exempt (as before), move the check below the 'timestamp' guard. Also, fix the error text (“more than one capture group”).
- // Currently, we only support at most a single capture group in each variable. If a capture
- // group is present its match will be treated as the variable rather than the full match.
- auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
- if (1 < num_captures) {
- throw std::runtime_error(
- schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
- + ": error: the schema rule '" + rule->m_name
- + "' has a regex pattern containing > 1 capture groups (found "
- + std::to_string(num_captures) + ").\n"
- );
- }
-
if ("timestamp" == rule->m_name) {
continue;
}
+
+ // Currently, we only support at most a single capture group in each variable. If a capture
+ // group is present its match will be treated as the variable rather than the full match.
+ auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
+ if (1 < num_captures) {
+ throw std::runtime_error(
+ schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+ + ": error: the schema rule '" + rule->m_name
+ + "' has a regex pattern containing more than one capture group (found "
+ + std::to_string(num_captures) + ").\n"
+ );
+ }
Also applies to: 204-207
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching repository for occurrences of get_subtree_positive_captures..."
rg -n -C3 -S 'get_subtree_positive_captures' || true
echo
echo "Searching repository for occurrences of the timestamp rule guard (\"timestamp\" == rule->m_name)..."
rg -n -C5 -S '"timestamp"\s*==\s*rule->m_name' || true
Length of output: 1696
Move single-capture validation below the 'timestamp' guard and fix error wording.
The capture-count check currently runs before the 'timestamp' early-continue, changing behaviour for timestamp rules — move it after the guard and change the error text to "more than one capture group".
File: components/core/src/clp/Utils.cpp
- // Currently, we only support at most a single capture group in each variable. If a capture
- // group is present its match will be treated as the variable rather than the full match.
- auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
- if (1 < num_captures) {
- throw std::runtime_error(
- schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
- + ": error: the schema rule '" + rule->m_name
- + "' has a regex pattern containing > 1 capture groups (found "
- + std::to_string(num_captures) + ").\n"
- );
- }
-
if ("timestamp" == rule->m_name) {
continue;
}
+
+ // Currently, we only support at most a single capture group in each variable. If a capture
+ // group is present its match will be treated as the variable rather than the full match.
+ auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size();
+ if (1 < num_captures) {
+ throw std::runtime_error(
+ schema_file_path + ":" + std::to_string(rule->m_line_num + 1)
+ + ": error: the schema rule '" + rule->m_name
+ + "' has a regex pattern containing more than one capture group (found "
+ + std::to_string(num_captures) + ").\n"
+ );
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// Currently, we only support at most a single capture group in each variable. If a capture | |
// group is present its match will be treated as the variable rather than the full match. | |
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); | |
if (1 < num_captures) { | |
throw std::runtime_error( | |
schema_file_path + ":" + std::to_string(rule->m_line_num + 1) | |
+ ": error: the schema rule '" + rule->m_name | |
+ "' has a regex pattern containing > 1 capture groups (found " | |
+ std::to_string(num_captures) + ").\n" | |
); | |
} | |
if ("timestamp" == rule->m_name) { | |
continue; | |
} | |
// Currently, we only support at most a single capture group in each variable. If a capture | |
// group is present its match will be treated as the variable rather than the full match. | |
auto const num_captures = rule->m_regex_ptr->get_subtree_positive_captures().size(); | |
if (1 < num_captures) { | |
throw std::runtime_error( | |
schema_file_path + ":" + std::to_string(rule->m_line_num + 1) | |
": error: the schema rule '" + rule->m_name | |
"' has a regex pattern containing more than one capture group (found " | |
std::to_string(num_captures) + ").\n" | |
); | |
} |
🤖 Prompt for AI Agents
In components/core/src/clp/Utils.cpp around lines 192–203, the check that throws
when a regex has more than one capture group must be moved to after the
early-continue/guard that handles 'timestamp' rules so timestamp rules are not
affected; relocate the entire num_captures check so it runs only when the
timestamp branch has been bypassed, and change the thrown error text to read
"... has a regex pattern with more than one capture group (found X)." keeping
the same numeric check and including schema_file_path, line number and rule name
as before.
Description
Previously, when using log surgeon for parsing the full match of a schema rule's regex pattern would be stored as a variable in CLP. This created differences from the heuristic parser's behaviour for certain cases.
For example, the heuristic's "equals" rule can be represented with the regex pattern
.*=(?<var>.*[a-zA-Z0-9].*)
. The heuristic parser will only store thevar
capture group as a variable (storing the prefix.*=
as static text). When using log surgeon without capture groups this behaviour was not possible as we would store the full match (including the prefix.*=
) as a variable.This PR allows schema rules to contain up to 1 capture group. If a capture group is present only the capture's match will be stored as a variable and anything surrounding it will be stored as static text. In the case where the capture is repeated (e.g.
text(?<var>variable)+text)
) all repetitions will be stored together as a single variable.Checklist
breaking change.
Validation performed
Added unit tests for schema creation with single or multiple captures.
TODO Generated archives and matched their contents.
Summary by CodeRabbit
New Features
Bug Fixes
Tests