Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic programming approach for schema search. #448

Draft
wants to merge 283 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
283 commits
Select commit Hold shift + click to select a range
889f2f7
updated log_surgeon submodule
SharafMohamed Aug 7, 2023
8e6594f
Fixed naming for StringReader and FileReader shared_ptrs
SharafMohamed Aug 9, 2023
d4f28ce
Made shared_ptr to Reader a reference in ReaderInterfaceWrapper
SharafMohamed Aug 9, 2023
96e5df2
Fixed ReaderInterfaceWrapper to correctly set Reader::read that was p…
SharafMohamed Aug 16, 2023
fee6fd4
Removed unneeded pos_processed_string var in get_bounds_of_next_poten…
SharafMohamed Aug 16, 2023
ed23d9e
Removed post_processed_search_string in Grep.cpp
SharafMohamed Aug 16, 2023
e6315ec
Updated to match the allowance of multiple delimiters lines in log_su…
SharafMohamed Aug 25, 2023
66cdf5c
Updated log-surgeon to the newest commit.
SharafMohamed Sep 11, 2023
23f7b61
Updated example log to have floats
SharafMohamed Sep 11, 2023
0861ce3
Merge remote-tracking branch 'upstream/main' into main
SharafMohamed Sep 13, 2023
a271e0c
Fixed double to float
SharafMohamed Sep 17, 2023
7386f5a
Fixed bug where first char of first token would become static text ev…
SharafMohamed Sep 17, 2023
f21b77f
Merge remote-tracking branch 'upstream/main' into main
SharafMohamed Sep 17, 2023
fa4dd3f
Pulled latest version of log-surgeon
SharafMohamed Sep 25, 2023
0e4a6b4
Merge remote-tracking branch 'upstream/main' into main
SharafMohamed Sep 29, 2023
d8ffc74
Fixed update_segment_indices to use the passed in parameter, this was…
SharafMohamed Oct 2, 2023
e3e6911
Removed some redundancies in grep
SharafMohamed Oct 2, 2023
120342a
Correctly use the type vector when checking search_token type in grep…
SharafMohamed Oct 2, 2023
6de8355
Merge remote-tracking branch 'upstream/main' into dfa-search
SharafMohamed Nov 13, 2023
47205ac
Starting to setup schema dfa-based search
SharafMohamed Nov 17, 2023
15ef079
temp
SharafMohamed Nov 22, 2023
bac9383
logtype_matrix now correct for simple cases, added m_ to Reader members
SharafMohamed Nov 27, 2023
b65fde4
added intersect
SharafMohamed Dec 7, 2023
79809cc
removed everything other than intersect for now
SharafMohamed Dec 7, 2023
d672056
fixed name prefixes to suffixes
SharafMohamed Dec 7, 2023
21cfacc
generate logtype from intersects
SharafMohamed Dec 11, 2023
0dd02a6
DFA search now considers var dictionary
SharafMohamed Dec 15, 2023
9473401
hacky way to handle wildcard <ints> and <floats>
SharafMohamed Dec 15, 2023
6876acb
fixed how static text is handled in search query; added sanitization …
SharafMohamed Dec 17, 2023
e39ef1e
only use highest prio for non-wildcard substrings in dfa-search
SharafMohamed Jan 12, 2024
ee79d88
added delim handling to dfa-search
SharafMohamed Jan 12, 2024
cc9a70c
hack for m_next_children_start to reset to 0 before each DFA is made
SharafMohamed Jan 12, 2024
96f18d5
Completely duplicate CLP to prepare for GLT
haiqi96 Jan 15, 2024
19dbadb
rename namespace in the duplicated codebase
haiqi96 Jan 15, 2024
fd94018
Rough support compression
haiqi96 Jan 16, 2024
1196327
Fix bugs in compression
haiqi96 Jan 16, 2024
9718d56
Rough support decompression
haiqi96 Jan 16, 2024
1cf9bac
Fix size calculation
haiqi96 Jan 16, 2024
693ad94
Preliminary support for non-optimized search
haiqi96 Jan 17, 2024
979b029
Preliminary support for optimized search
haiqi96 Jan 17, 2024
a6f2025
index magic to handle the fact var_position gets updated to placeholder
haiqi96 Jan 18, 2024
2b8c883
Fix GLT specific timestamp issue
haiqi96 Jan 18, 2024
6becc48
Add get variable info for now.
haiqi96 Jan 18, 2024
7366ed5
Run linter
haiqi96 Jan 18, 2024
a44ecad
Fix variable placeholder
haiqi96 Jan 18, 2024
8f41624
Update argument interface
haiqi96 Jan 18, 2024
46725f4
Some clean and linter
haiqi96 Jan 18, 2024
12f48b7
updated log-surgeon
SharafMohamed Jan 19, 2024
5b76807
Remove logsurgeon and unused libs
haiqi96 Jan 19, 2024
8ad0793
rearrange class variables methods
haiqi96 Jan 19, 2024
50b79ba
Mark TODOs
haiqi96 Jan 19, 2024
0617c48
Compress file dict
haiqi96 Jan 19, 2024
11fd9b7
linter fix
haiqi96 Jan 19, 2024
ee16463
updated log-surgeon
SharafMohamed Jan 19, 2024
eb86d6a
Finish search query conversion to regex that log-surgeon can use; No …
SharafMohamed Jan 19, 2024
c63cccb
Remove gltg and move search into glt binary
haiqi96 Jan 19, 2024
61b3eb8
Fix output method code and hide output method option from user.
haiqi96 Jan 19, 2024
66275da
Remove prematured optimization
haiqi96 Jan 19, 2024
6702c9d
Update readme
haiqi96 Jan 20, 2024
ff5b61f
Add comments and tokenization code
haiqi96 Jan 21, 2024
93808f0
commit find boundary function
haiqi96 Jan 22, 2024
02b4a30
support optimization. except that escape is not well supported yet
haiqi96 Jan 22, 2024
87880f8
Small fix and utilities
haiqi96 Jan 22, 2024
1e69b99
Fix include and indexing boundary case for find left boundary
haiqi96 Jan 22, 2024
e9fde16
Run linter
haiqi96 Jan 22, 2024
f12aa15
Handle a corner case where none of the token contains variable.
haiqi96 Jan 23, 2024
7db1315
Support escape properly
haiqi96 Jan 23, 2024
d698c01
Remove unused string utils
haiqi96 Jan 23, 2024
67195ca
Deals with shared wildcard between vars; Remove stray return true
SharafMohamed Jan 24, 2024
27b5e38
Refactor adding * before and after suffix when needed
SharafMohamed Jan 24, 2024
cb4242c
For int/floats to be imprecise, check if the var itself has wildcard …
SharafMohamed Jan 24, 2024
190cf41
Fix whats heuristic only and whats shared with the schema grep
SharafMohamed Jan 24, 2024
5c033f4
Merge branch 'main' into glt_optimization
haiqi96 Jan 26, 2024
843933d
No longer include timestamp in compressed message for search, TS comp…
SharafMohamed Jan 26, 2024
9c60bd5
refactor comments to make the PR less confusing
haiqi96 Jan 26, 2024
c68d6d9
only build DFA if there are delims; added profiling
SharafMohamed Jan 29, 2024
003fe21
Only leave needed profiling
SharafMohamed Jan 29, 2024
dae8f3d
stuff
SharafMohamed Feb 9, 2024
d7c0c8a
Don't rebuild query matrix every time
SharafMohamed Apr 17, 2024
777800d
switched log-surgeon submodule back to open source repo
SharafMohamed Apr 17, 2024
2d95a7c
Correctly checkout main from open source repo instead of fork for log…
SharafMohamed Apr 17, 2024
07622f2
merged main
SharafMohamed Apr 17, 2024
b08eadd
CLG now working after merge
SharafMohamed Apr 18, 2024
a04ae6c
GLT + Log-Surgeon compresses/decompresses
SharafMohamed Apr 18, 2024
cce3368
Merge branch 'GLT-PR' into dfa-search
SharafMohamed Apr 18, 2024
a36a3f4
Search should now work with GLT + Log-Surgeon
SharafMohamed Apr 19, 2024
1df2298
Fixed GLT to store schema in archive
SharafMohamed Apr 19, 2024
d71f304
GLT + LS should use boundaries correctly now
SharafMohamed Apr 19, 2024
39787df
Merge remote-tracking branch 'upstream/main' into dfa-search
SharafMohamed Jun 7, 2024
57f3d8f
Removed redundant utils.cmake
SharafMohamed Jun 7, 2024
465ab74
Removed duplicate files that were moved
SharafMohamed Jun 10, 2024
f69ea8a
-Reverted GLT changes for now
SharafMohamed Jun 10, 2024
08edc7c
Fixed up QueryLogtype class; Remove uneeded changes to spacing.
SharafMohamed Jun 14, 2024
e449751
fixed changed ts to nullptr repeatedly
SharafMohamed Jun 17, 2024
b14184d
reformatted Grep.hpp
SharafMohamed Jun 17, 2024
46ca422
Fromatted Grep.cpp
SharafMohamed Jun 17, 2024
7b60f33
Reformatted StringReader.hpp StringReader.cpp Query.hpp
SharafMohamed Jun 17, 2024
667f4e3
Remove unused get_bounds_of_next_potential_var() code for schmea-case…
SharafMohamed Jul 5, 2024
c55a26a
Split into functions and add comments; Minor changes to match code st…
SharafMohamed Jul 7, 2024
b84a354
Fixed QueryLogtype class to use setters/getters, declare functions in…
SharafMohamed Jul 8, 2024
ce7f6ee
Fixed stopwatch test
SharafMohamed Jul 8, 2024
00f4982
Fixed stopwatch test again
SharafMohamed Jul 8, 2024
53d6242
Autoformatted
SharafMohamed Jul 8, 2024
b3efd94
Optimized how current_string is created for each substring
SharafMohamed Jul 8, 2024
acd8819
get_bounds_of_next_potential_var tests changed back to test heuristic…
SharafMohamed Jul 8, 2024
86a5826
Schema search now handles '?' wildcard, and cancelled literals
SharafMohamed Jul 9, 2024
2159542
Fixed bug where start and end of substring were reversed in one place…
SharafMohamed Jul 10, 2024
ff830cd
Added back in bug fix for log_surgeon::NonTerminal::m_next_children_s…
SharafMohamed Jul 10, 2024
5f2de34
Autoformatted
SharafMohamed Jul 10, 2024
3e35c04
Fixed bug where variables weren't being used in schema search
SharafMohamed Jul 10, 2024
5447c27
Move getting location of wildcard and cancel characters into its own …
SharafMohamed Jul 11, 2024
90ee13e
Autoformatted
SharafMohamed Jul 11, 2024
4f06c18
Ran autoformatter again, somehow it didn't work first time
SharafMohamed Jul 11, 2024
d4e25ff
Removed spaces
SharafMohamed Jul 11, 2024
a8219d1
get_wildcard_and_escape_locations returns tuples; cancel -> escape; u…
SharafMohamed Jul 29, 2024
5213070
Fix constant == variable in grep.cpp
SharafMohamed Jul 29, 2024
f138f99
Update search prototype and docstring in clg.cpp
SharafMohamed Jul 29, 2024
2ce2ff7
initialize lexer_ptr
SharafMohamed Jul 29, 2024
fc184d1
Correct lexer initialization style
SharafMohamed Jul 29, 2024
19c3605
uint32_t -> size_t
SharafMohamed Jul 29, 2024
bbeca87
*_star -> *_char_is_star
SharafMohamed Jul 29, 2024
a0c2546
Removed unused var
SharafMohamed Jul 29, 2024
43b9a25
Fix usage of ByteLexer class vs object; Improve DFA naming
SharafMohamed Jul 29, 2024
cf6b14b
Remove reference from variables storing non-referenced return types
SharafMohamed Jul 29, 2024
30a88d4
Fix bug processed_search_string.back() -> processed_search_string.len…
SharafMohamed Jul 29, 2024
384354b
Fix is_escaped -> is_escape in structured binding; Fix errant ==
SharafMohamed Jul 29, 2024
864f355
Change Grep.hpp to match is_cancel -> is_escape change
SharafMohamed Jul 29, 2024
16d9cdc
Remove duplicate escape logic; Explain logic using escape characters …
SharafMohamed Jul 30, 2024
7b2ceba
Change i to end_idx
SharafMohamed Jul 30, 2024
092fce2
Change j to begin_idx
SharafMohamed Jul 30, 2024
8a189fa
Change k to idx
SharafMohamed Jul 30, 2024
ee8a11f
Make end_idx exclusive
SharafMohamed Jul 30, 2024
f42d608
Make substr_end exclusive; Change i to idx
SharafMohamed Jul 30, 2024
7b6d426
Change query_logtypes loop to treat it as a stack, deleting elements …
SharafMohamed Jul 30, 2024
6e4c5a3
Rename *is_dict_var to *is_encoded_with_wildcard as the name and its …
SharafMohamed Jul 30, 2024
e8f24ec
Comment out omition of sorrounding wildcard case, as well as removing…
SharafMohamed Jul 30, 2024
ef28c42
Skip redundant iterations for substrings that begin or end with wildc…
SharafMohamed Aug 1, 2024
23929a9
Move query logtypes into a vector instead of set so we can safely add…
SharafMohamed Aug 2, 2024
b033bd8
Remove redundant brackets; Move variable_types declaration to where i…
SharafMohamed Aug 2, 2024
639de8e
Use tuple return for get_substring_variable_types; Rename var for cla…
SharafMohamed Aug 2, 2024
ceb5d4d
Remove redundant code now that we skip substrings starting/ending with *
SharafMohamed Aug 2, 2024
db8e544
Move get_possible_substr_types() into its own function; Use vector in…
SharafMohamed Aug 2, 2024
a360cd8
Add comment explaiing alraedy_added_var
SharafMohamed Aug 2, 2024
71c6b23
Merge remote-tracking branch 'upstream/main' into dfa-search
SharafMohamed Aug 2, 2024
ebabea0
Add unit-tests; Make QueryLogtype more usable with catch2; Fix typo; …
SharafMohamed Aug 9, 2024
d016f17
add static-text to unit-tests where its not fully optimized yet; make…
SharafMohamed Aug 12, 2024
e7ca083
Fix structured binding so get_possible_substr_types() doesn't always …
SharafMohamed Aug 12, 2024
3314838
Have query logtypes generate for every archive (future will be only o…
SharafMohamed Aug 12, 2024
7f30aa7
Change to has_encoded_wildcard_var to true for unit-test cases where …
SharafMohamed Aug 12, 2024
22fca92
Fix bug where it never generates subqueries
SharafMohamed Aug 12, 2024
b0f2c41
Remove encoded var checks until refactor
SharafMohamed Aug 12, 2024
09731ec
Fix expected_results vector size
SharafMohamed Aug 12, 2024
16fee6e
Rename QueryLogtype to QueryInterpretation and move it into its own f…
SharafMohamed Aug 15, 2024
5d41bf2
Remove extra newline
SharafMohamed Aug 15, 2024
fda1fa0
Change QueryInterpretation class to use a vector of static and variab…
SharafMohamed Aug 19, 2024
67bf5ed
Remove redundant false check
SharafMohamed Aug 20, 2024
c35e2c1
Move handling multiplt logtypes for encoded wildcard variables into p…
SharafMohamed Aug 20, 2024
7f75a2b
Fix naming
SharafMohamed Aug 20, 2024
9eadd97
Early return to reduce indentation
SharafMohamed Aug 20, 2024
cee9e90
Fix comment wrap around lengths
SharafMohamed Aug 20, 2024
1851586
Use constexpr for int and float strings; Fix bug
SharafMohamed Aug 20, 2024
f059d01
Add SearchString and SearchStringView class to simplify indexing; Add…
SharafMohamed Aug 21, 2024
f76765c
Fix clang-tidy error related to current PR
SharafMohamed Aug 21, 2024
d8682d9
Move logtype string generation immediately before the the full query …
SharafMohamed Aug 26, 2024
6a97f58
No longer need to consider m_logtype_string in append() as its comput…
SharafMohamed Aug 26, 2024
a0af1f0
Only do logtype_generation and insertion into query_substr_interpreta…
SharafMohamed Aug 26, 2024
1b19e26
Set operator== to compare on only m_logtype for QueryInterpretation a…
SharafMohamed Aug 26, 2024
daf3b0b
Remove duplicate class name
SharafMohamed Aug 26, 2024
ebbff2d
Reserve size for m_logtype_string
SharafMohamed Aug 26, 2024
aba48b3
Resolve conflict
SharafMohamed Aug 26, 2024
fa6d602
Autoformat
SharafMohamed Aug 26, 2024
b952ff6
Switch back to std::replaces from std::ranges::replace for macos support
SharafMohamed Aug 26, 2024
c010c55
Remove old comment
SharafMohamed Aug 26, 2024
55ac74f
Added QueryInterpretation classes to clg and clo executables
SharafMohamed Aug 26, 2024
afabaef
Also switch to std::replace in SearchString for macos support
SharafMohamed Aug 26, 2024
a0e3265
Spacing fix
SharafMohamed Aug 26, 2024
151f362
Explicitly define < and > operators, instead of default <=> operator …
SharafMohamed Aug 26, 2024
4f09be3
Move short function into header and longer functions into cpp
SharafMohamed Aug 26, 2024
497794f
Update yscope-dev-utils; Change SearchStringView to contain a ptr to …
SharafMohamed Aug 27, 2024
d54c359
Refactor and rename surrounded_by_delims to surrounded_by_delims_or_w…
kirkrodrigues Sep 3, 2024
5a8d3a7
Refactor and fix OOB in surrounded_by_delims_or_wildcards.
kirkrodrigues Sep 3, 2024
fd0cee9
Rename surrounded_by_delims_or_wildcards test case.
kirkrodrigues Sep 3, 2024
5404421
Refactor and rename extend_to_adjacent_wildcards to extend_to_adjacen…
kirkrodrigues Sep 3, 2024
369c2ac
Refactor Grep::get_substring_variable_types.
kirkrodrigues Sep 3, 2024
a3470d7
Rename SearchString -> WildcardExpression.
kirkrodrigues Sep 3, 2024
73bb830
Merge remote-tracking branch 'upstream/main' into dfa-search
SharafMohamed Sep 3, 2024
0fcc017
Fix lint violation.
kirkrodrigues Sep 3, 2024
21f16d9
Refactor Grep::get_substring_variable_types to respect new WildcardEx…
kirkrodrigues Sep 4, 2024
95c5529
Remove duplicated get_substr_copy.
kirkrodrigues Sep 5, 2024
10d3358
Move WildcardExpression & WildcardExpressionView into their own file.
kirkrodrigues Sep 9, 2024
1bdd235
Remove WildcardExpression::create_view and WildcardExpressionView for…
kirkrodrigues Sep 9, 2024
5ad17c4
Switch WildcardExpression indices from uint32_t to size_t to avoid na…
kirkrodrigues Sep 9, 2024
c51509d
Merge branch 'main' into dfa-search
kirkrodrigues Sep 9, 2024
f3fa472
Fix some docstrings.
kirkrodrigues Sep 9, 2024
ca31075
Rename WildcardExpression methods.
kirkrodrigues Sep 9, 2024
e76a371
starts_or_ends_with_greedy_wildcard: Guard against empty views.
kirkrodrigues Sep 9, 2024
1a1f8c6
Fix docstring.
kirkrodrigues Sep 9, 2024
1334847
Rename WildcardExpressionView methods.
kirkrodrigues Sep 9, 2024
a44e50c
Rename WildcardExpressionView::get_substr_copy -> get_value.
kirkrodrigues Sep 10, 2024
db2e14f
Rename WildcardExpressionView::m_search_string_ptr -> m_expression.
kirkrodrigues Sep 10, 2024
a508bfb
Merge branch 'main' into dfa-search
kirkrodrigues Sep 11, 2024
8192425
For unit-testing, compare QueryIntepretations to an expected serializ…
SharafMohamed Sep 11, 2024
37fca8a
Merge branch 'dfa-search' of https://github.com/SharafMohamed/clp int…
SharafMohamed Sep 11, 2024
8680630
Fix comments in QueryInterpretatios unit-test
SharafMohamed Sep 11, 2024
0a3ac80
use enum_to_underlying_type in unit-tests for macos support
SharafMohamed Sep 11, 2024
28cf435
Rename Grep::get_substring_variable_types -> get_matching_variable_ty…
kirkrodrigues Sep 12, 2024
ce0684d
Fix clang-tidy warning in Grep::get_matching_variable_types.
kirkrodrigues Sep 12, 2024
256669b
Reorganize get_substring_variable_types test.
kirkrodrigues Sep 12, 2024
cb69a94
Rename get_substring_variable_types test to get_matching_variable_types.
kirkrodrigues Sep 12, 2024
fb688c9
get_matching_variables test: Remove unnecessary section.
kirkrodrigues Sep 12, 2024
a9d7bcc
get_matching_variables test: Edit comments.
kirkrodrigues Sep 12, 2024
b561deb
get_matching_variables test: Rename search_string -> wildcard_expr.
kirkrodrigues Sep 12, 2024
90b27f2
get_matching_variables test: Fix clang-tidy violations.
kirkrodrigues Sep 12, 2024
845bf14
get_possible_substr_types test: Rename search_string -> wildcard_expr.
kirkrodrigues Sep 12, 2024
a5e1b0b
get_possible_substr_types test: Rename query_logtypes -> interpretati…
kirkrodrigues Sep 12, 2024
e1b8ad5
get_possible_substr_types test: Add newlines.
kirkrodrigues Sep 12, 2024
9b22f6f
get_possible_substr_types test: Create QueryInterpretation before emp…
kirkrodrigues Sep 12, 2024
b97d8ac
get_possible_substr_types test: Remove unnecessary section.
kirkrodrigues Sep 12, 2024
21fbcee
get_possible_substr_types test: Fix clang-tidy violations.
kirkrodrigues Sep 12, 2024
6f70f3a
Treat isolated '?' wildcards as any other string
SharafMohamed Sep 12, 2024
4a6a041
Merge branch 'dfa-search' of https://github.com/SharafMohamed/clp int…
SharafMohamed Sep 12, 2024
79ef576
Shorten sorrounded_by_delims_or_wildcards header comment
SharafMohamed Sep 12, 2024
53cdc1e
use prefix decrement
SharafMohamed Sep 12, 2024
a7962b2
No longer need to replace '?' with '*' wildcards for schema search
SharafMohamed Sep 12, 2024
4722167
Correct WildCardExpressionView constructor docstring
SharafMohamed Sep 12, 2024
e3ee26a
Print m_id_symbols so variable ids can be decoded if unit-test fails
SharafMohamed Sep 12, 2024
8f302dc
Remove forward and reverse lexer from heuristic unit-test
SharafMohamed Sep 12, 2024
df42ca1
Refactor Grep::get_possible_substr_types: Rewrite docstring and renam…
kirkrodrigues Sep 16, 2024
a2124d8
Refactor Grep::get_possible_substr_types: Rename to get_interpretatio…
kirkrodrigues Sep 16, 2024
89af909
Refactor Grep::get_interpretations_for_whole_wildcard_expr: Extract s…
kirkrodrigues Sep 16, 2024
7ea6211
Refactor Grep::get_interpretations_for_whole_wildcard_expr: Rename ex…
kirkrodrigues Sep 16, 2024
d077b14
Refactor Grep::get_interpretations_for_whole_wildcard_expr: Rename va…
kirkrodrigues Sep 16, 2024
a0f6a52
Refactor Grep::get_interpretations_for_whole_wildcard_expr: Rename al…
kirkrodrigues Sep 16, 2024
389f48b
Refactor Grep::get_interpretations_for_whole_wildcard_expr: Use early…
kirkrodrigues Sep 16, 2024
7153c40
Merge branch 'main' into dfa-search
kirkrodrigues Sep 16, 2024
635e848
Undo unintentional change.
kirkrodrigues Sep 16, 2024
22d82a7
Add TODO about hardcoding encoded variable type names.
kirkrodrigues Sep 18, 2024
eb2ce26
Elaborate about why we need to track whether we've already added a di…
kirkrodrigues Sep 22, 2024
8e852bc
Merge branch 'main' into dfa-search
kirkrodrigues Sep 30, 2024
eb52a94
Rephrase explanation of why we need two query interpretations for wil…
kirkrodrigues Sep 30, 2024
5e07b89
Add non-greedy wildcard unit-test; Fix comment formatting; Improve r…
SharafMohamed Sep 30, 2024
1f93b9e
Merge branch 'dfa-search' of https://github.com/SharafMohamed/clp int…
SharafMohamed Sep 30, 2024
fabad21
Trying to simplify unit tests, currently doesn't work
SharafMohamed Oct 3, 2024
9060179
Add ExpectedInterpretation class to test-Grep.cpp to make testing mor…
SharafMohamed Oct 7, 2024
28735b6
Add TODO for possible bug to tests.
SharafMohamed Oct 7, 2024
5e473f9
Removed TODO as 100?00 is not encoded
SharafMohamed Oct 7, 2024
739e0d9
Remove TODOs in favor of letting unit-test fail until interpretation …
SharafMohamed Oct 7, 2024
4c1b8db
Fix typo.
SharafMohamed Oct 7, 2024
f2ac3b5
Run linter
SharafMohamed Oct 7, 2024
7a139ee
Add wildcard tests for get_matching_variable_types and get_interpreta…
SharafMohamed Oct 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions components/core/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,8 @@ set(SOURCE_FILES_unitTest
src/clp/Profiler.hpp
src/clp/Query.cpp
src/clp/Query.hpp
src/clp/QueryInterpretation.cpp
src/clp/QueryInterpretation.hpp
src/clp/ReaderInterface.cpp
src/clp/ReaderInterface.hpp
src/clp/ReadOnlyMemoryMappedFile.cpp
Expand Down Expand Up @@ -489,6 +491,8 @@ set(SOURCE_FILES_unitTest
src/clp/VariableDictionaryWriter.cpp
src/clp/VariableDictionaryWriter.hpp
src/clp/version.hpp
src/clp/WildcardExpression.cpp
src/clp/WildcardExpression.hpp
src/clp/WriterInterface.cpp
src/clp/WriterInterface.hpp
submodules/sqlite3/sqlite3.c
Expand Down
700 changes: 466 additions & 234 deletions components/core/src/clp/Grep.cpp

Large diffs are not rendered by default.

86 changes: 63 additions & 23 deletions components/core/src/clp/Grep.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,13 @@

#include "Defs.h"
#include "Query.hpp"
#include "QueryInterpretation.hpp"
#include "streaming_archive/reader/Archive.hpp"
#include "streaming_archive/reader/File.hpp"
#include "WildcardExpression.hpp"

namespace clp {

class Grep {
public:
// Types
Expand All @@ -37,8 +40,7 @@ class Grep {
* @param search_begin_ts
* @param search_end_ts
* @param ignore_case
* @param forward_lexer DFA for determining if input is in the schema
* @param reverse_lexer DFA for determining if reverse of input is in the schema
* @param lexer DFA for determining if input is in the schema
* @param use_heuristic
* @return Query if it may match a message, std::nullopt otherwise
*/
Expand All @@ -48,8 +50,7 @@ class Grep {
epochtime_t search_begin_ts,
epochtime_t search_end_ts,
bool ignore_case,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer,
log_surgeon::lexers::ByteLexer& lexer,
bool use_heuristic
);

Expand All @@ -69,25 +70,6 @@ class Grep {
bool& is_var
);

/**
* Returns bounds of next potential variable (either a definite variable or a token with
* wildcards)
* @param value String containing token
* @param begin_pos Begin position of last token, changes to begin position of next token
* @param end_pos End position of last token, changes to end position of next token
* @param is_var Whether the token is definitely a variable
* @param forward_lexer DFA for determining if input is in the schema
* @param reverse_lexer DFA for determining if reverse of input is in the schema
* @return true if another potential variable was found, false otherwise
*/
static bool get_bounds_of_next_potential_var(
std::string const& value,
size_t& begin_pos,
size_t& end_pos,
bool& is_var,
log_surgeon::lexers::ByteLexer& forward_lexer,
log_surgeon::lexers::ByteLexer& reverse_lexer
);
/**
* Marks which sub-queries in each query are relevant to the given file
* @param compressed_file
Expand Down Expand Up @@ -126,6 +108,7 @@ class Grep {
streaming_archive::reader::Message& compressed_msg,
std::string& decompressed_msg
);

/**
* Searches a file with the given query without outputting the results
* @param query
Expand All @@ -143,6 +126,63 @@ class Grep {
streaming_archive::reader::Archive& archive,
streaming_archive::reader::File& compressed_file
);

/**
* Generates all possible logtypes that can match each substr(0,n) of the search string.
* Requires that processed_search_string is valid, meaning that only wildcards are escaped
* and the string does not end with an escape character.
* @param processed_search_string
* @param lexer
* @return a vector of all QueryInterpretations that can match the query in
* processed_search_string.
*/
static std::set<QueryInterpretation> generate_query_substring_interpretations(
WildcardExpression const& processed_search_string,
log_surgeon::lexers::ByteLexer& lexer
);

Comment on lines +139 to +143
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the return type description in the documentation.

The function generate_query_substring_interpretations returns a std::set<QueryInterpretation>, but the documentation states it returns a vector. Please update the documentation to accurately reflect the return type.

/**
* Computes the tokens (static text or different types of variables) that the given wildcard
* expression (as a whole) could be interpreted as, generates a `QueryInterpretation` for each
* one, and returns the `QueryInterpretation`s.
* @param wildcard_expr
* @param lexer
* @return The `QueryInterpretation`s.
*/
static std::vector<QueryInterpretation> get_interpretations_for_whole_wildcard_expr(
WildcardExpressionView const& wildcard_expr,
log_surgeon::lexers::ByteLexer& lexer
);

/**
* Gets the variable types that the given wildcard expression could match.
* @param wildcard_expr
* @param lexer
* @return A tuple:
* - The set of variable types that the wildcard expression could match.
* - Whether the wildcard expression contains a wildcard.
*/
static std::tuple<std::set<uint32_t>, bool> get_matching_variable_types(
WildcardExpressionView const& wildcard_expr,
log_surgeon::lexers::ByteLexer const& lexer
);

/**
* Compare all possible query logtypes against the archive to determine all possible sub queries
* that can match against messages in the archive.
* @param query_interpretations
* @param archive
* @param lexer
* @param ignore_case
* @param sub_queries
*/
static void generate_sub_queries(
std::set<QueryInterpretation> const& query_interpretations,
streaming_archive::reader::Archive const& archive,
log_surgeon::lexers::ByteLexer& lexer,
bool ignore_case,
std::vector<SubQuery>& sub_queries
);
};
} // namespace clp

Expand Down
1 change: 1 addition & 0 deletions components/core/src/clp/Query.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ class SubQuery {
* @return true if matched, false otherwise
*/
bool matches_logtype(logtype_dictionary_id_t logtype) const;

/**
* Whether the given variables contain the subquery's variables in order (but not necessarily
* contiguously)
Expand Down
195 changes: 195 additions & 0 deletions components/core/src/clp/QueryInterpretation.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
#include "QueryInterpretation.hpp"

#include <algorithm>
#include <cstdint>
#include <ostream>
#include <string>
#include <utility>
#include <variant>

#include "Defs.h"
#include "EncodedVariableInterpreter.hpp"
#include "log_surgeon/Lexer.hpp"
#include "LogTypeDictionaryEntry.hpp"
#include "string_utils/string_utils.hpp"

using log_surgeon::lexers::ByteLexer;
using std::string;

namespace clp {
auto VariableQueryToken::operator<(VariableQueryToken const& rhs) const -> bool {
if (m_variable_type < rhs.m_variable_type) {
return true;
}
if (m_variable_type > rhs.m_variable_type) {
return false;
}
if (m_query_substring < rhs.m_query_substring) {
return true;
}
if (m_query_substring > rhs.m_query_substring) {
return false;
}
if (m_has_wildcard != rhs.m_has_wildcard) {
return rhs.m_has_wildcard;
}
if (m_is_encoded != rhs.m_is_encoded) {
return rhs.m_is_encoded;
}
return false;
}
Comment on lines +20 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical error in 'operator<' implementation for 'VariableQueryToken'

In the operator< function, when m_has_wildcard != rhs.m_has_wildcard, the function returns rhs.m_has_wildcard. This directly returns the value of rhs.m_has_wildcard, which may not correctly reflect whether this object is less than rhs. The intended behaviour should compare the boolean values to determine ordering.

Similarly, when comparing m_is_encoded, the function returns rhs.m_is_encoded, which might not yield the correct comparison result.

Please revise the comparison logic to ensure it accurately represents the intended ordering of VariableQueryToken objects.


auto VariableQueryToken::operator>(VariableQueryToken const& rhs) const -> bool {
if (m_variable_type > rhs.m_variable_type) {
return true;
}
if (m_variable_type < rhs.m_variable_type) {
return false;
}
if (m_query_substring > rhs.m_query_substring) {
return true;
}
if (m_query_substring < rhs.m_query_substring) {
return false;
}
if (m_has_wildcard != rhs.m_has_wildcard) {
return m_has_wildcard;
}
if (m_is_encoded != rhs.m_is_encoded) {
return m_is_encoded;
}
return false;
}
Comment on lines +42 to +62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical error in 'operator>' implementation for 'VariableQueryToken'

In the operator> function, when m_has_wildcard != rhs.m_has_wildcard, the function returns m_has_wildcard. Returning m_has_wildcard directly may not correctly indicate whether this object is greater than rhs. The comparison should determine the ordering based on the values of the booleans.

Please re-evaluate the logic in this operator to ensure it behaves as expected and provides a correct ordering between VariableQueryToken instances.


void QueryInterpretation::append_logtype(QueryInterpretation& suffix) {
auto const& first_new_token = suffix.m_logtype[0];
if (auto& prev_token = m_logtype.back();
false == m_logtype.empty() && std::holds_alternative<StaticQueryToken>(prev_token)
&& false == suffix.m_logtype.empty()
&& std::holds_alternative<StaticQueryToken>(first_new_token))
{
Comment on lines +66 to +70
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Potential undefined behaviour due to accessing 'm_logtype.back()' without checking for emptiness

In the append_logtype method, the code initializes prev_token with m_logtype.back() before checking if m_logtype is empty. If m_logtype is empty, accessing m_logtype.back() results in undefined behaviour.

To prevent this, you should check if m_logtype is not empty before accessing m_logtype.back(). Consider restructuring the condition to ensure m_logtype.back() is only accessed when m_logtype is not empty.

Apply this diff to fix the issue:

 void QueryInterpretation::append_logtype(QueryInterpretation& suffix) {
-    if (auto& prev_token = m_logtype.back();
-        false == m_logtype.empty() && std::holds_alternative<StaticQueryToken>(prev_token)
+    if (false == m_logtype.empty() && std::holds_alternative<StaticQueryToken>(m_logtype.back())
         && false == suffix.m_logtype.empty()
         && std::holds_alternative<StaticQueryToken>(suffix.m_logtype[0]))
     {
+        auto& prev_token = m_logtype.back();
         std::get<StaticQueryToken>(prev_token).append(std::get<StaticQueryToken>(first_new_token));
         m_logtype.insert(m_logtype.end(), suffix.m_logtype.begin() + 1, suffix.m_logtype.end());
     } else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (auto& prev_token = m_logtype.back();
false == m_logtype.empty() && std::holds_alternative<StaticQueryToken>(prev_token)
&& false == suffix.m_logtype.empty()
&& std::holds_alternative<StaticQueryToken>(first_new_token))
{
if (false == m_logtype.empty() && std::holds_alternative<StaticQueryToken>(m_logtype.back())
&& false == suffix.m_logtype.empty()
&& std::holds_alternative<StaticQueryToken>(suffix.m_logtype[0]))
{
auto& prev_token = m_logtype.back();

std::get<StaticQueryToken>(prev_token).append(std::get<StaticQueryToken>(first_new_token));
m_logtype.insert(m_logtype.end(), suffix.m_logtype.begin() + 1, suffix.m_logtype.end());
} else {
m_logtype.insert(m_logtype.end(), suffix.m_logtype.begin(), suffix.m_logtype.end());
}
}

void QueryInterpretation::generate_logtype_string(ByteLexer& lexer) {
// Convert each query logtype into a set of logtype strings. Logtype strings are used in the
// sub query as they have the correct format for comparing against the archive. Also, a
// single query logtype might represent multiple logtype strings. While static text converts
// one-to-one, wildcard variables that may be encoded have different logtype strings when
// comparing against the dictionary than they do when comparing against the segment.

// Reserve size for m_logtype_string
uint32_t logtype_string_size = 0;
for (uint32_t i = 0; i < get_logtype_size(); i++) {
if (auto const& logtype_token = get_logtype_token(i);
std::holds_alternative<StaticQueryToken>(logtype_token))
{
logtype_string_size
+= std::get<StaticQueryToken>(logtype_token).get_query_substring().size();
} else {
logtype_string_size++;
}
}
m_logtype_string.reserve(logtype_string_size);

for (uint32_t i = 0; i < get_logtype_size(); i++) {
if (auto const& logtype_token = get_logtype_token(i);
std::holds_alternative<StaticQueryToken>(logtype_token))
{
m_logtype_string += std::get<StaticQueryToken>(logtype_token).get_query_substring();
} else {
auto const& variable_token = std::get<VariableQueryToken>(logtype_token);
auto const variable_type = variable_token.get_variable_type();
auto const& raw_string = variable_token.get_query_substring();
auto const is_encoded_with_wildcard = variable_token.get_is_encoded_with_wildcard();
auto const var_has_wildcard = variable_token.get_has_wildcard();
auto& schema_type = lexer.m_id_symbol[variable_type];
encoded_variable_t encoded_var = 0;
if (is_encoded_with_wildcard) {
if (cIntVarName == schema_type) {
LogTypeDictionaryEntry::add_int_var(m_logtype_string);
} else if (cFloatVarName == schema_type) {
LogTypeDictionaryEntry::add_float_var(m_logtype_string);
}
} else if (false == var_has_wildcard && cIntVarName == schema_type
&& EncodedVariableInterpreter::convert_string_to_representable_integer_var(
raw_string,
encoded_var
))
{
LogTypeDictionaryEntry::add_int_var(m_logtype_string);
} else if (false == var_has_wildcard && cFloatVarName == schema_type
&& EncodedVariableInterpreter::convert_string_to_representable_float_var(
raw_string,
encoded_var
))
{
LogTypeDictionaryEntry::add_float_var(m_logtype_string);
} else {
LogTypeDictionaryEntry::add_dict_var(m_logtype_string);
}
}
}
}

auto QueryInterpretation::operator<(QueryInterpretation const& rhs) const -> bool {
if (m_logtype.size() < rhs.m_logtype.size()) {
return true;
}
if (m_logtype.size() > rhs.m_logtype.size()) {
return false;
}
for (uint32_t i = 0; i < m_logtype.size(); i++) {
if (m_logtype[i] < rhs.m_logtype[i]) {
return true;
}
if (m_logtype[i] > rhs.m_logtype[i]) {
return false;
}
}
return false;
}

auto operator<<(std::ostream& os, QueryInterpretation const& query_logtype) -> std::ostream& {
os << "logtype='";
for (uint32_t idx = 0; idx < query_logtype.get_logtype_size(); idx++) {
if (auto const& query_token = query_logtype.get_logtype_token(idx);
std::holds_alternative<StaticQueryToken>(query_token))
{
os << std::get<StaticQueryToken>(query_token).get_query_substring();
} else {
auto const& variable_token = std::get<VariableQueryToken>(query_token);
os << "<" << variable_token.get_variable_type() << ">("
<< variable_token.get_query_substring() << ")";
}
}
os << "', has_wildcard='";
for (uint32_t idx = 0; idx < query_logtype.get_logtype_size(); idx++) {
if (auto const& query_token = query_logtype.get_logtype_token(idx);
std::holds_alternative<StaticQueryToken>(query_token))
{
os << 0;
} else {
auto const& variable_token = std::get<VariableQueryToken>(query_token);
os << variable_token.get_has_wildcard();
}
}
os << "', is_encoded_with_wildcard='";
for (uint32_t idx = 0; idx < query_logtype.get_logtype_size(); idx++) {
if (auto const& query_token = query_logtype.get_logtype_token(idx);
std::holds_alternative<StaticQueryToken>(query_token))
{
os << 0;
} else {
auto const& variable_token = std::get<VariableQueryToken>(query_token);
os << variable_token.get_is_encoded_with_wildcard();
}
}
os << "', logtype_string='" << query_logtype.get_logtype_string() << "'";
return os;
}
} // namespace clp
Loading
Loading