Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
88ccce7
Updated log-surgeon to use simulation branch.
SharafMohamed Mar 17, 2025
8aa0350
Update schema to have all timestamps and capture.
SharafMohamed Mar 17, 2025
b3f217b
Testing new log surgeon.
SharafMohamed Mar 24, 2025
ea09df0
Testing log surgeon x2.
SharafMohamed Mar 24, 2025
551447f
Add timers.
SharafMohamed Mar 31, 2025
ee3f682
Switch timers.
SharafMohamed Apr 7, 2025
042a4f6
Make make dictionaries compilable.
SharafMohamed Apr 23, 2025
c311fb8
Merge branch 'main' into new-log-surgeon
SharafMohamed Jun 25, 2025
4b53427
Lint and try to reduce the number of jobs during deps:core to avoid m…
SharafMohamed Jun 25, 2025
3d27fdf
Remove JOBS from task, the correct way is to set the parallel tasks v…
SharafMohamed Jun 25, 2025
018eccb
Remove components/core/submodules/log-surgeon.
davidlion Jul 10, 2025
5971025
Unset PROF_ENABLED.
davidlion Jul 10, 2025
b932b6a
Get clp building with log surgeon locally.
davidlion Jul 10, 2025
8637db5
Merge remote-tracking branch 'upstream/main' into pr-1033
davidlion Jul 10, 2025
50cfd39
Unit tests build, but fail with possible logical errors.
davidlion Jul 10, 2025
0e17e3f
Merge branch 'new-log-surgeon' of https://github.com/SharafMohamed/cl…
SharafMohamed Jul 14, 2025
2083aa4
Remove duplicate REQUIRE check.
SharafMohamed Jul 14, 2025
e778979
Fix unit-test bugs.
SharafMohamed Jul 16, 2025
da18646
Fix unit-test typo and spacing.
SharafMohamed Jul 16, 2025
8b4b24c
Merge branch 'main' into new-log-surgeon
SharafMohamed Jul 16, 2025
6dcadca
Update log-surgeon to newest version.
SharafMohamed Jul 16, 2025
0844041
Remove profiling changes.
SharafMohamed Jul 16, 2025
3c20be7
Test disabling macos-14.
SharafMohamed Jul 16, 2025
e8d5fbb
Test disabling macos-13.
SharafMohamed Jul 16, 2025
c8edd58
Readd macos-13 and macos-14 to the CI.
SharafMohamed Jul 16, 2025
c0f86e9
Bump log-surgeon version.
davidlion Jul 16, 2025
f631545
Add spacing to schemas.txt.
davidlion Jul 16, 2025
bb23a83
Drop unused parameter from load_lexer_from_file.
davidlion Jul 16, 2025
ea777c3
Remove a missed benchmark change.
SharafMohamed Jul 16, 2025
1d63a20
Format fix.
davidlion Jul 16, 2025
dcf3734
Merge commit 'refs/pull/1033/head' of https://github.com/y-scope/clp …
davidlion Jul 16, 2025
9552272
Remove reverse lexer; Rename forward lexer to just lexer.
SharafMohamed Jul 16, 2025
5cc0b14
Lint.
SharafMohamed Jul 16, 2025
23de2c3
Remove TODO.
SharafMohamed Jul 16, 2025
1fa8b79
Merge branch 'main' into new-log-surgeon
SharafMohamed Jul 16, 2025
bd2ae8e
Add heading comments to schemas.
davidlion Jul 17, 2025
b6c02bf
Merge commit 'refs/pull/1033/head' of https://github.com/y-scope/clp …
davidlion Jul 17, 2025
d43c4c7
Update schema timestamps to escape '.'.
SharafMohamed Jul 17, 2025
327d41e
Merge branch 'main' into new-log-surgeon
SharafMohamed Jul 17, 2025
d557467
Update regex in schema for leading spaces.
SharafMohamed Jul 17, 2025
a41ee92
Add issue link in schemas.txt.
davidlion Jul 19, 2025
f3f3c59
Merge commit 'refs/pull/1033/head' of https://github.com/y-scope/clp …
davidlion Jul 19, 2025
fbf6195
Make lexer safer in clo; Remove dead declaration in clg.
davidlion Jul 21, 2025
d1756ef
Update components/core/config/schemas.txt
davidlion Jul 21, 2025
8d12a2b
Update components/core/config/schemas.txt
davidlion Jul 21, 2025
46bc35f
Update components/core/config/schemas.txt
davidlion Jul 21, 2025
063297d
Delete redundant timestamps.
davidlion Jul 21, 2025
75dab67
Allow dates with space + single digit.
davidlion Jul 21, 2025
716e871
Tweak schemas.
davidlion Jul 21, 2025
70db06f
Consolidate timestamps into fewer regex supporting many more combinat…
davidlion Jul 21, 2025
64c0216
Add missing date case caught by rabbit.
davidlion Jul 21, 2025
4a05836
Avoid recreating the lexer object.
SharafMohamed Jul 21, 2025
4a730a4
Use Lexer object directly instead of a unique_ptr.
SharafMohamed Jul 21, 2025
d654f46
Throw error if parsed timestamp can't be encoded.
SharafMohamed Jul 21, 2025
a02efee
Reogranize schema slightly.
davidlion Jul 21, 2025
82f9825
Update error message; Reorded if statement; Lint.
SharafMohamed Jul 21, 2025
dd5a8a4
Merge branch 'new-log-surgeon' of https://github.com/SharafMohamed/cl…
SharafMohamed Jul 21, 2025
922053b
Add relative timestamp regex.
SharafMohamed Jul 21, 2025
88cea99
Merge branch 'main' into new-log-surgeon
davidlion Jul 23, 2025
e0169f1
Merge branch 'main' into new-log-surgeon
davidlion Jul 25, 2025
09889d6
Revert schemas.txt.
davidlion Jul 25, 2025
b37b7c1
Merge commit 'refs/pull/1033/head' of https://github.com/y-scope/clp …
davidlion Jul 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 18 additions & 42 deletions components/core/src/clp/Grep.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -502,8 +502,7 @@ std::optional<Query> Grep::process_raw_query(
epochtime_t search_begin_ts,
epochtime_t search_end_ts,
bool ignore_case,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer,
log_surgeon::lexers::ByteLexer& lexer,
bool use_heuristic
) {
// Add prefix and suffix '*' to make the search a sub-string match
Expand Down Expand Up @@ -546,8 +545,7 @@ std::optional<Query> Grep::process_raw_query(
begin_pos,
end_pos,
is_var,
forward_lexer,
reverse_lexer
lexer
))
{
query_tokens.emplace_back(search_string_for_sub_queries, begin_pos, end_pos, is_var);
Expand Down Expand Up @@ -752,8 +750,7 @@ bool Grep::get_bounds_of_next_potential_var(
size_t& begin_pos,
size_t& end_pos,
bool& is_var,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer
log_surgeon::lexers::ByteLexer& lexer
) {
size_t const value_length = value.length();
if (end_pos >= value_length) {
Expand All @@ -774,7 +771,7 @@ bool Grep::get_bounds_of_next_potential_var(
if (is_escaped) {
is_escaped = false;

if (false == forward_lexer.is_delimiter(c)) {
if (false == lexer.is_delimiter(c)) {
// Found escaped non-delimiter, so reverse the index to retain the escape
// character
--begin_pos;
Expand All @@ -788,7 +785,7 @@ bool Grep::get_bounds_of_next_potential_var(
contains_wildcard = true;
break;
}
if (false == forward_lexer.is_delimiter(c)) {
if (false == lexer.is_delimiter(c)) {
break;
}
}
Expand All @@ -803,7 +800,7 @@ bool Grep::get_bounds_of_next_potential_var(
if (is_escaped) {
is_escaped = false;

if (forward_lexer.is_delimiter(c)) {
if (lexer.is_delimiter(c)) {
// Found escaped delimiter, so reverse the index to retain the escape character
--end_pos;
break;
Expand All @@ -814,7 +811,7 @@ bool Grep::get_bounds_of_next_potential_var(
} else {
if (is_wildcard(c)) {
contains_wildcard = true;
} else if (forward_lexer.is_delimiter(c)) {
} else if (lexer.is_delimiter(c)) {
// Found delimiter that's not also a wildcard
break;
}
Expand All @@ -832,7 +829,7 @@ bool Grep::get_bounds_of_next_potential_var(
}
}
SearchToken search_token;
if (has_wildcard_in_middle || (has_prefix_wildcard && has_suffix_wildcard)) {
if (has_wildcard_in_middle || has_prefix_wildcard) {
// DO NOTHING
} else {
StringReader string_reader;
Expand All @@ -844,43 +841,22 @@ bool Grep::get_bounds_of_next_potential_var(
// string, should be improved when adding a SearchParser to log_surgeon
string_reader.open(value.substr(begin_pos, end_pos - begin_pos - 1));
parser_input_buffer.read_if_safe(reader_wrapper);
forward_lexer.reset();
forward_lexer.scan_with_wildcard(
parser_input_buffer,
value[end_pos - 1],
search_token
);
} else if (has_prefix_wildcard) { // *text
std::string value_reverse
= value.substr(begin_pos + 1, end_pos - begin_pos - 1);
std::reverse(value_reverse.begin(), value_reverse.end());
string_reader.open(value_reverse);
parser_input_buffer.read_if_safe(reader_wrapper);
reverse_lexer.reset();
reverse_lexer.scan_with_wildcard(
parser_input_buffer,
value[begin_pos],
search_token
);
lexer.reset();
lexer.scan_with_wildcard(parser_input_buffer, value[end_pos - 1], search_token);
} else { // no wildcards
string_reader.open(value.substr(begin_pos, end_pos - begin_pos));
parser_input_buffer.read_if_safe(reader_wrapper);
forward_lexer.reset();
forward_lexer.scan(parser_input_buffer, search_token);
lexer.reset();
auto [err, token] = lexer.scan(parser_input_buffer);
if (log_surgeon::ErrorCode::Success != err) {
return false;
}
search_token = SearchToken{token.value()};
search_token.m_type_ids_set.insert(search_token.m_type_ids_ptr->at(0));
}
// TODO: use a set so its faster
// auto const& set = search_token.m_type_ids_set;
// if (set.find(static_cast<int>(log_surgeon::SymbolID::TokenUncaughtStringID))
// == set.end()
// && set.find(static_cast<int>(log_surgeon::SymbolID::TokenEndID))
// == set.end())
// {
// is_var = true;
// }
auto const& type = search_token.m_type_ids_ptr->at(0);
if (type != static_cast<int>(log_surgeon::SymbolID::TokenUncaughtStringID)
&& type != static_cast<int>(log_surgeon::SymbolID::TokenEndID))
if (type != static_cast<int>(log_surgeon::SymbolId::TokenUncaughtString)
&& type != static_cast<int>(log_surgeon::SymbolId::TokenEnd))
{
is_var = true;
}
Expand Down
12 changes: 4 additions & 8 deletions components/core/src/clp/Grep.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ class Grep {
* @param search_begin_ts
* @param search_end_ts
* @param ignore_case
* @param forward_lexer DFA for determining if input is in the schema
* @param reverse_lexer DFA for determining if reverse of input is in the schema
* @param lexer DFA for determining if input is in the schema
* @param use_heuristic
* @return Query if it may match a message, std::nullopt otherwise
*/
Expand All @@ -48,8 +47,7 @@ class Grep {
epochtime_t search_begin_ts,
epochtime_t search_end_ts,
bool ignore_case,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer,
log_surgeon::lexers::ByteLexer& lexer,
bool use_heuristic
);

Expand All @@ -76,17 +74,15 @@ class Grep {
* @param begin_pos Begin position of last token, changes to begin position of next token
* @param end_pos End position of last token, changes to end position of next token
* @param is_var Whether the token is definitely a variable
* @param forward_lexer DFA for determining if input is in the schema
* @param reverse_lexer DFA for determining if reverse of input is in the schema
* @param lexer DFA for determining if input is in the schema
* @return true if another potential variable was found, false otherwise
*/
static bool get_bounds_of_next_potential_var(
std::string const& value,
size_t& begin_pos,
size_t& end_pos,
bool& is_var,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer
log_surgeon::lexers::ByteLexer& lexer
);
/**
* Marks which sub-queries in each query are relevant to the given file
Expand Down
51 changes: 22 additions & 29 deletions components/core/src/clp/Utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

#include <boost/algorithm/string.hpp>
#include <boost/lexical_cast.hpp>
#include <log_surgeon/Constants.hpp>
#include <log_surgeon/SchemaParser.hpp>
#include <spdlog/spdlog.h>
#include <string_utils/string_utils.hpp>
Expand Down Expand Up @@ -120,12 +121,8 @@ ErrorCode read_list_of_paths(string const& list_path, vector<string>& paths) {
// TODO: duplicates code in log_surgeon/parser.tpp, should implement a
// SearchParser in log_surgeon instead and use it here. Specifically, initialization of
// lexer.m_symbol_id, contains_delimiter error, and add_rule logic.
void load_lexer_from_file(
std::string const& schema_file_path,
bool reverse,
log_surgeon::lexers::ByteLexer& lexer
) {
log_surgeon::SchemaParser sp;
void
load_lexer_from_file(std::string const& schema_file_path, log_surgeon::lexers::ByteLexer& lexer) {
std::unique_ptr<log_surgeon::SchemaAST> schema_ast
= log_surgeon::SchemaParser::try_schema_file(schema_file_path);
if (!lexer.m_symbol_id.empty()) {
Expand All @@ -134,52 +131,52 @@ void load_lexer_from_file(

// cTokenEnd and cTokenUncaughtString never need to be added as a rule to the lexer as they are
// not parsed
lexer.m_symbol_id[log_surgeon::cTokenEnd] = static_cast<int>(log_surgeon::SymbolID::TokenEndID);
lexer.m_symbol_id[log_surgeon::cTokenEnd] = static_cast<int>(log_surgeon::SymbolId::TokenEnd);
lexer.m_symbol_id[log_surgeon::cTokenUncaughtString]
= static_cast<int>(log_surgeon::SymbolID::TokenUncaughtStringID);
= static_cast<int>(log_surgeon::SymbolId::TokenUncaughtString);
// cTokenInt, cTokenFloat, cTokenFirstTimestamp, and cTokenNewlineTimestamp each have unknown
// rule(s) until specified by the user so can't be explicitly added and are done by looping over
// schema_vars (user schema)
lexer.m_symbol_id[log_surgeon::cTokenInt] = static_cast<int>(log_surgeon::SymbolID::TokenIntId);
lexer.m_symbol_id[log_surgeon::cTokenInt] = static_cast<int>(log_surgeon::SymbolId::TokenInt);
lexer.m_symbol_id[log_surgeon::cTokenFloat]
= static_cast<int>(log_surgeon::SymbolID::TokenFloatId);
= static_cast<int>(log_surgeon::SymbolId::TokenFloat);
lexer.m_symbol_id[log_surgeon::cTokenFirstTimestamp]
= static_cast<int>(log_surgeon::SymbolID::TokenFirstTimestampId);
= static_cast<int>(log_surgeon::SymbolId::TokenFirstTimestamp);
lexer.m_symbol_id[log_surgeon::cTokenNewlineTimestamp]
= static_cast<int>(log_surgeon::SymbolID::TokenNewlineTimestampId);
= static_cast<int>(log_surgeon::SymbolId::TokenNewlineTimestamp);
// cTokenNewline is not added in schema_vars and can be explicitly added as '\n' to catch the
// end of non-timestamped log messages
lexer.m_symbol_id[log_surgeon::cTokenNewline]
= static_cast<int>(log_surgeon::SymbolID::TokenNewlineId);
= static_cast<int>(log_surgeon::SymbolId::TokenNewline);

lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenEndID)] = log_surgeon::cTokenEnd;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenUncaughtStringID)]
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenEnd)] = log_surgeon::cTokenEnd;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenUncaughtString)]
= log_surgeon::cTokenUncaughtString;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenIntId)] = log_surgeon::cTokenInt;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenFloatId)]
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenInt)] = log_surgeon::cTokenInt;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenFloat)]
= log_surgeon::cTokenFloat;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenFirstTimestampId)]
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenFirstTimestamp)]
= log_surgeon::cTokenFirstTimestamp;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenNewlineTimestampId)]
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenNewlineTimestamp)]
= log_surgeon::cTokenNewlineTimestamp;
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolID::TokenNewlineId)]
lexer.m_id_symbol[static_cast<int>(log_surgeon::SymbolId::TokenNewline)]
= log_surgeon::cTokenNewline;

lexer.add_rule(
lexer.m_symbol_id["newLine"],
std::move(
std::make_unique<log_surgeon::finite_automata::RegexASTLiteral<
log_surgeon::finite_automata::RegexNFAByteState>>(
log_surgeon::finite_automata::ByteNfaState>>(
log_surgeon::finite_automata::RegexASTLiteral<
log_surgeon::finite_automata::RegexNFAByteState>('\n')
log_surgeon::finite_automata::ByteNfaState>('\n')
)
)
);

for (auto const& delimiters_ast : schema_ast->m_delimiters) {
auto* delimiters_ptr = dynamic_cast<log_surgeon::DelimiterStringAST*>(delimiters_ast.get());
if (delimiters_ptr != nullptr) {
lexer.add_delimiters(delimiters_ptr->m_delimiters);
lexer.set_delimiters(delimiters_ptr->m_delimiters);
}
}
vector<uint32_t> delimiters;
Expand All @@ -203,7 +200,7 @@ void load_lexer_from_file(
// transform '.' from any-character into any non-delimiter character
rule->m_regex_ptr->remove_delimiters_from_wildcard(delimiters);

bool is_possible_input[log_surgeon::cUnicodeMax] = {false};
std::array<bool, log_surgeon::cSizeOfUnicode> is_possible_input{};
rule->m_regex_ptr->set_possible_inputs_to_true(is_possible_input);
bool contains_delimiter = false;
uint32_t delimiter_name;
Expand Down Expand Up @@ -242,10 +239,6 @@ void load_lexer_from_file(
}
lexer.add_rule(lexer.m_symbol_id[rule->m_name], std::move(rule->m_regex_ptr));
}
if (reverse) {
lexer.generate_reverse();
} else {
lexer.generate();
}
lexer.generate();
}
} // namespace clp
6 changes: 2 additions & 4 deletions components/core/src/clp/Utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,11 @@ ErrorCode read_list_of_paths(std::string const& list_path, std::vector<std::stri
/**
* Loads a lexer from a file
* @param schema_file_path
* @param done
* @param forward_lexer_ptr
* @param lexer_ptr
*/
void load_lexer_from_file(
std::string const& schema_file_path,
bool done,
log_surgeon::lexers::ByteLexer& forward_lexer_ptr
log_surgeon::lexers::ByteLexer& lexer_ptr
);
} // namespace clp

Expand Down
Loading
Loading