Add comprehensive string algorithms (KMP, Z-algorithm, Aho-Corasick, suffix structures, Manacher, edit distance)#37
Conversation
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. By default, CodeRabbit skips reviewing PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. To trigger a single review, invoke the You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a comprehensive string-algorithms suite (single- and multi-pattern search, suffix structures, palindromes, edit-distance), an umbrella header, many examples and tests, new counting/radix sorting APIs and concepts (including a duplicated concept), test helpers, minor utility tweaks, and CMake/install list updates. Changes
Sequence Diagram(s)Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Pull request overview
This pull request introduces comprehensive string algorithm functionality to the Aleph-w library, including pattern matching, suffix structures, palindrome detection, and edit distance algorithms.
Changes:
- Added new string algorithm headers:
String_Search.H,Aho_Corasick.H,Suffix_Structures.H,String_Palindromes.H,String_DP.H, and umbrella headerString_Algorithms.H - Enhanced sorting utilities with counting sort and radix sort implementations for integer types
- Added integer sorting concept
IntegerSortableValuetoah-concepts.Hto enforce valid types for linear-time integer sorting - Fixed spelling and grammatical errors in comments across multiple files
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tpl_sort_utils.H | Added counting_sort_indices, counting_sort, and radix_sort implementations with support for multiple container types |
| ah-string-utils.H | Fixed typo in comment ("sringficated" → "stringified") |
| ah-dry.H | Corrected multiple grammatical errors in comments and changed copy assignment operator to default |
| ah-concepts.H | Added IntegerSortableValue concept for type-safe integer sorting |
| Tests/suffix_structures_test.cc | New test suite for suffix array, LCP, suffix tree, and suffix automaton |
| Tests/string_search_test.cc | New test suite for KMP, Z-algorithm, Horspool, and Rabin-Karp |
| Tests/string_palindromes_test.cc | New test suite for Manacher's algorithm |
| Tests/string_dp_test.cc | New test suite for edit distance and LCS algorithms |
| Tests/string_algorithms_test.cc | Integration tests for string algorithm interoperability |
| Tests/sort_utils.cc | Added comprehensive tests for counting sort and radix sort |
| Tests/aho_corasick_test.cc | New test suite for Aho-Corasick multi-pattern matching |
| Suffix_Structures.H | New header with suffix array, LCP, suffix tree, and suffix automaton implementations |
| String_Search.H | New header with KMP, Z-algorithm, Horspool, and Rabin-Karp implementations |
| String_Palindromes.H | New header with Manacher's algorithm |
| String_DP.H | New header with Levenshtein, Damerau-Levenshtein, LCS, and longest common substring |
| String_Algorithms.H | Umbrella header including all string algorithm headers |
| README.md | Updated documentation with string algorithms section and examples |
| README.es.md | Updated Spanish documentation with string algorithms references |
| Examples/*.cc | Added 12 new example files demonstrating string algorithms |
| Examples/min_cost_matching_example.cc | Added const qualifier to loop variable |
| Examples/CMakeLists.txt | Registered new string algorithm examples |
| CMakeLists.txt | Added new string algorithm headers to build system |
| Aho_Corasick.H | New header with Aho-Corasick multi-pattern matching implementation |
Comments suppressed due to low confidence (2)
ah-string-utils.H:1
- Corrected spelling of 'sringficated' to 'stringified'.
ah-dry.H:1 - The copy assignment operator should return *this when not defaulted. Since it's now defaulted, this is correct, but if it were custom, it should follow the pattern of returning *this.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aec96eee49
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Tests/sort_utils.cc (1)
40-122:⚠️ Potential issue | 🟡 MinorUse
std::ranges::sortinstead ofstd::sortfor consistency with coding guidelines.The file uses
std::sortat multiple locations (lines 110, 120, 273, 274, 303, 304, 313, 324, 460, etc.). Per the coding guidelines, preferstd::ranges::algorithms over classicstd::algorithms. Replace these calls withstd::ranges::sort.Note: The
std::vectorusage withreserve()is compliant with guidelines (e.g., lines 254–275 usereserve()when output size is known a priori, which is the documented exception). However, the helper functions at lines 105–122 that returnstd::vector<uintptr_t>withoutreserve()could be migrated toDynArray<uintptr_t>for consistency, if feasible.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 40 - 122, The PR uses std::sort in multiple places; update all calls to use std::ranges::sort instead (e.g., replace std::sort(ret.begin(), ret.end()) in dynlist_node_addresses and dyndlist_node_addresses and every other occurrence). Ensure the file includes the proper headers for ranges-algorithms if needed and keep the same iterator arguments (std::ranges::sort accepts the same iterator ranges). Optionally, for dynlist_node_addresses and dyndlist_node_addresses consider returning DynArray<uintptr_t> (or reserve the std::vector before push_back) to match project conventions; if you switch to DynArray, adapt push_back logic to indexing and reserve appropriately.
🧹 Nitpick comments (7)
Tests/aho_corasick_test.cc (1)
45-216: Solid test suite with good coverage of the Aho-Corasick API.The tests cover the classic dictionary example, overlapping patterns, error handling (pre-build search, empty pattern), lifecycle (clear/rebuild), large pattern sets, long texts, single-char patterns, and binary data.
Consider adding two small edge-case tests for completeness:
- Empty text —
ac.search("")should return no matches.- No matches found — search a text that contains none of the registered patterns.
These would round out the boundary coverage.
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/aho_corasick_test.cc` around lines 45 - 216, Add two small unit tests to Tests/aho_corasick_test.cc: implement TEST(AhoCorasick, EmptyText) that creates Aho_Corasick, adds a couple patterns via add_pattern, calls build(), calls search("") and asserts no matches (matches.size()==0 or matches.empty()), and implement TEST(AhoCorasick, NoMatchesFound) that adds patterns, builds, searches a text with none of the patterns (e.g., "xyzxyz"), and asserts no matches; reference Aho_Corasick, add_pattern, build, and search in the new tests and use the existing EXPECT_EQ/EXPECT_TRUE assertions style.Examples/z_algorithm_example.cc (1)
52-62: Preferstd::formatand a ranges-based printer helper.Line 52-62 uses chained iostreams and two index loops. The guidelines prefer
std::formatand algorithmic/range-based iteration, and a small helper also removes duplicate printing logic.Proposed refactor
-# include <iostream> -# include <string> +# include <algorithm> +# include <format> +# include <iostream> +# include <string> @@ - std::cout << "Z-Algorithm Example\n"; - std::cout << "Text : " << text << "\n"; - std::cout << "Pattern: " << pattern << "\n\n"; + std::cout << std::format("Z-Algorithm Example\nText : {}\nPattern: {}\n\n", + text, pattern); + + const auto print_seq = [](const auto& seq) { + bool first = true; + std::ranges::for_each(seq, [&](const auto& v) { + if (!first) std::cout << ' '; + std::cout << v; + first = false; + }); + std::cout << '\n'; + }; @@ - std::cout << "Z-array: "; - for (size_t i = 0; i < z.size(); ++i) - std::cout << z[i] << (i + 1 == z.size() ? '\n' : ' '); + std::cout << "Z-array: "; + print_seq(z); @@ - std::cout << "Matches at positions: "; - for (size_t i = 0; i < matches.size(); ++i) - std::cout << matches[i] << (i + 1 == matches.size() ? '\n' : ' '); + std::cout << "Matches at positions: "; + print_seq(matches);As per coding guidelines “Use
std::formatfor type-safe string formatting (prefer over printf/streams)”, “Use range-based for loops for simple iterations”, and “Use algorithms over hand-rolled loops for readability and correctness”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Examples/z_algorithm_example.cc` around lines 52 - 62, Replace the iostream-based printing with std::format and a small ranges-based printer helper: create a helper function (e.g., print_range) that takes a range (used for z and matches) and a separator/terminator, then use std::format to produce the header lines for "Z-Algorithm Example", "Text", and "Pattern" (referencing variables text and pattern) and call print_range(z, ' ') and print_range(matches, ' ') to print the sequences; also use range-based iteration or std::ranges algorithms inside the helper to avoid manual index loops and ensure consistent spacing/termination.Examples/rabin_karp_example.cc (1)
51-57: Preferstd::format+ ranges-based iteration for output.Line 51-57 uses chained iostreams and an index loop. The C++ guidelines prefer
std::formatand algorithmic/range-based loops for simple iterations. Also, the current loop does not emit a newline whenmatchesis empty.Proposed refactor
-# include <iostream> -# include <string> +# include <algorithm> +# include <format> +# include <iostream> +# include <string> @@ - std::cout << "Rabin-Karp Example\n"; - std::cout << "Text : " << text << "\n"; - std::cout << "Pattern: " << pattern << "\n\n"; + std::cout << std::format("Rabin-Karp Example\nText : {}\nPattern: {}\n\n", + text, pattern); @@ - std::cout << "Matches at positions: "; - for (size_t i = 0; i < matches.size(); ++i) - std::cout << matches[i] << (i + 1 == matches.size() ? '\n' : ' '); + std::cout << "Matches at positions: "; + bool first = true; + std::ranges::for_each(matches, [&](const auto& pos) { + if (!first) std::cout << ' '; + std::cout << pos; + first = false; + }); + std::cout << '\n';As per coding guidelines “Use
std::formatfor type-safe string formatting (prefer over printf/streams)”, “Use range-based for loops for simple iterations”, and “Use algorithms over hand-rolled loops for readability and correctness”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Examples/rabin_karp_example.cc` around lines 51 - 57, Replace the chained iostream output with type-safe std::format-based messages for the header and the text/pattern lines (use std::format and include <format>), and replace the index loop over matches with a range-based iteration (for (auto idx : matches)) or use a join-style/algorithmic approach to print space-separated positions; ensure you emit a newline even when matches is empty by printing a terminating '\n' after the sequence or handling the empty case explicitly; reference variables: text, pattern, matches.Tests/string_search_test.cc (2)
168-198: Consider strengthening the planted-pattern assertion fromGEtoEQ.Since the text is all
'a'characters and"XYZW"contains none of{a}, there cannot be any spurious matches. UsingEXPECT_EQ(kmp.size(), 6u)would make the test stricter and catch false positives from the algorithms.Proposed fix
- // Must find at least the planted positions - EXPECT_GE(kmp.size(), 6u); + // Exactly the planted positions (no character in "XYZW" matches the 'a' background) + EXPECT_EQ(kmp.size(), 6u);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/string_search_test.cc` around lines 168 - 198, The test uses EXPECT_GE(kmp.size(), 6u) which allows extra (incorrect) matches; change that assertion to EXPECT_EQ(kmp.size(), 6u) in the TEST(StringSearch, StressLongRandomStringAllAlgorithmsAgree) so the number of matches reported by kmp_search (and by earlier cross-checks with z_search, boyer_moore_horspool_search, rabin_karp_search) must equal the six planted positions; update the single EXPECT_GE line to EXPECT_EQ and keep the rest of the equality checks as-is to ensure strict agreement.
43-54:expect_array_eqhelper is duplicated across test files.This identical helper also appears in
Tests/suffix_structures_test.cc. Consider extracting it into a shared test utility header (e.g.,Tests/test_helpers.H) to reduce duplication across the new string test suites.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/string_search_test.cc` around lines 43 - 54, Extract the duplicated expect_array_eq helper into a shared header (e.g., Tests/test_helpers.H): move the implementation of expect_array_eq(const Array<size_t>&, std::initializer_list<size_t>) out of the anonymous namespace and into that header, ensure the header includes the necessary headers (gtest, Array definition, <initializer_list>), and make the function either inline or placed in a named namespace (e.g., test_helpers::expect_array_eq) to avoid ODR/multiple definition issues; then include Tests/test_helpers.H from both Tests/string_search_test.cc and Tests/suffix_structures_test.cc and remove the local duplicate definitions.Tests/suffix_structures_test.cc (1)
184-203: UseEXPECT_LTinstead ofEXPECT_LEfor suffix-array ordering verification.All suffixes of a string are distinct (they have different lengths), so the suffix array must be strictly ordered.
EXPECT_LEmasks a potential bug where the SA contains duplicate indices or compares equal erroneously. The same applies to theSuffixArrayLongerTexttest at Line 127, which correctly usesEXPECT_LT.Proposed fix
- EXPECT_LE(prev_suffix, curr_suffix); + EXPECT_LT(prev_suffix, curr_suffix);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/suffix_structures_test.cc` around lines 184 - 203, In the TEST named StressLongerSuffixArray replace the non-strict comparison EXPECT_LE(prev_suffix, curr_suffix) with EXPECT_LT(prev_suffix, curr_suffix) to assert strict ordering of distinct suffixes; locate the assertion in the test that builds text, computes sa via suffix_array(text), and compares prev_suffix and curr_suffix (variables prev_suffix, curr_suffix, and sa) and change to EXPECT_LT, mirroring the existing behavior used in the SuffixArrayLongerText test.Tests/string_palindromes_test.cc (1)
107-112: Test nameManacherPalindromeInMiddleis slightly misleading.
"xxxracecarxxx"is itself a palindrome (symmetric around center), so the test validates whole-string detection rather than a palindrome embedded in non-palindromic surroundings. Consider renaming or using a string like"abcracecarxyz"if the intent is to test a palindrome strictly in the middle.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/string_palindromes_test.cc` around lines 107 - 112, The test named ManacherPalindromeInMiddle uses "xxxracecarxxx" which is a full-string palindrome; either rename the TEST to reflect whole-string detection (e.g., ManacherWholeStringPalindrome) or change the input to a true embedded case like "abcracecarxyz" and update expectations accordingly (call manacher("abcracecarxyz") and assert r.longest_palindrome == "racecar" and r.longest_length == 7u); update the TEST identifier or the string and expected values in Tests/string_palindromes_test.cc, referencing the TEST declaration and the manacher(...) result fields r.longest_palindrome and r.longest_length.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Aho_Corasick.H`:
- Around line 120-129: Update the Doxygen blocks for the public methods
add_pattern, search, and pattern to include `@throws` tags documenting the
exceptions raised by the Aleph error macros; for example, add_pattern should
include `@throws` std::domain_error (or the specific Aleph exception) for
empty-pattern insertion, search should document exceptions it can raise when the
automaton is not built or on invalid input, and pattern (accessor) should list
exceptions for out-of-range or invalid id access—ensure each public API has
`@brief`, `@param`, `@return`, and a clear `@throws` line describing the exception type
and the condition that triggers it.
In `@README.es.md`:
- Around line 505-513: Reemplaza los términos en inglés por equivalentes en
español para mantener consistencia: cambia "Headers principales:" por "Cabeceras
principales:" (o "Encabezados principales:") y reemplaza "include paraguas" por
una expresión en español como "archivo paraguas" o "cabecera paraguas" en la
línea que lista `String_Algorithms.H`; conserva los nombres de archivos
(`String_Search.H`, `Aho_Corasick.H`, `Suffix_Structures.H`,
`String_Palindromes.H`, `String_DP.H`, `String_Algorithms.H`) tal cual y
actualiza solo las etiquetas descriptivas alrededor de esos símbolos.
In `@Suffix_Structures.H`:
- Around line 401-739: Add complete Doxygen for each public API by augmenting
existing brief comments for Naive_Suffix_Tree methods (contains, find_all,
node_count, text_size, nodes, build, constructor) and Suffix_Automaton methods
(extend, build, contains, distinct_substring_count, longest_common_substring,
state_count, states, clear, mark_terminals if made public) to include `@param` for
each parameter (e.g., text, pattern, other), `@return` describing return values
(including empty/edge cases), `@throws` for any precondition failures or
exceptions (or state that none are thrown), and `@par` Complexity with Big-O
notation; keep descriptions concise and reference semantics in the existing
comments (e.g., mention original_size_ for text_size, that find_all returns
sorted suffix positions and behavior on empty pattern), ensuring each public
method has `@brief`, `@param`, `@return`, `@throws`, and complexity entries added to its
docblock.
In `@tpl_sort_utils.H`:
- Around line 4539-4576: Remove the std::vector public overloads: delete or move
counting_sort(std::vector<T>&) and its helper
counting_sort_impl(std::vector<IntT>&) out of the core public header so the core
API only exposes Aleph containers; instead implement the sorting logic as (or
reuse) counting_sort_impl that accepts Aleph containers (e.g., DynArray<IntT>&
or Array<IntT>&) and update callers to use the Aleph adapter (or provide a
separate adapter-layer function converting std::vector to DynArray outside the
core header) so the core header no longer exposes std::vector in its public API.
- Around line 4358-4364: The template parameter KeyFn on counting_sort_indices
must be constrained to ensure it is callable with a size_t and returns something
convertible to int; update the template declaration for counting_sort_indices to
add a C++20 requires clause or concept that enforces std::invocable<KeyFn,
size_t> and that std::invoke_result_t<KeyFn, size_t> is std::convertible_to<int>
(or create a small concept alias like KeyFnForIndex). This change targets the
counting_sort_indices template signature so the compiler enforces the callable
input type and return convertibility at compile time.
---
Outside diff comments:
In `@Tests/sort_utils.cc`:
- Around line 40-122: The PR uses std::sort in multiple places; update all calls
to use std::ranges::sort instead (e.g., replace std::sort(ret.begin(),
ret.end()) in dynlist_node_addresses and dyndlist_node_addresses and every other
occurrence). Ensure the file includes the proper headers for ranges-algorithms
if needed and keep the same iterator arguments (std::ranges::sort accepts the
same iterator ranges). Optionally, for dynlist_node_addresses and
dyndlist_node_addresses consider returning DynArray<uintptr_t> (or reserve the
std::vector before push_back) to match project conventions; if you switch to
DynArray, adapt push_back logic to indexing and reserve appropriately.
---
Nitpick comments:
In `@Examples/rabin_karp_example.cc`:
- Around line 51-57: Replace the chained iostream output with type-safe
std::format-based messages for the header and the text/pattern lines (use
std::format and include <format>), and replace the index loop over matches with
a range-based iteration (for (auto idx : matches)) or use a
join-style/algorithmic approach to print space-separated positions; ensure you
emit a newline even when matches is empty by printing a terminating '\n' after
the sequence or handling the empty case explicitly; reference variables: text,
pattern, matches.
In `@Examples/z_algorithm_example.cc`:
- Around line 52-62: Replace the iostream-based printing with std::format and a
small ranges-based printer helper: create a helper function (e.g., print_range)
that takes a range (used for z and matches) and a separator/terminator, then use
std::format to produce the header lines for "Z-Algorithm Example", "Text", and
"Pattern" (referencing variables text and pattern) and call print_range(z, ' ')
and print_range(matches, ' ') to print the sequences; also use range-based
iteration or std::ranges algorithms inside the helper to avoid manual index
loops and ensure consistent spacing/termination.
In `@Tests/aho_corasick_test.cc`:
- Around line 45-216: Add two small unit tests to Tests/aho_corasick_test.cc:
implement TEST(AhoCorasick, EmptyText) that creates Aho_Corasick, adds a couple
patterns via add_pattern, calls build(), calls search("") and asserts no matches
(matches.size()==0 or matches.empty()), and implement TEST(AhoCorasick,
NoMatchesFound) that adds patterns, builds, searches a text with none of the
patterns (e.g., "xyzxyz"), and asserts no matches; reference Aho_Corasick,
add_pattern, build, and search in the new tests and use the existing
EXPECT_EQ/EXPECT_TRUE assertions style.
In `@Tests/string_palindromes_test.cc`:
- Around line 107-112: The test named ManacherPalindromeInMiddle uses
"xxxracecarxxx" which is a full-string palindrome; either rename the TEST to
reflect whole-string detection (e.g., ManacherWholeStringPalindrome) or change
the input to a true embedded case like "abcracecarxyz" and update expectations
accordingly (call manacher("abcracecarxyz") and assert r.longest_palindrome ==
"racecar" and r.longest_length == 7u); update the TEST identifier or the string
and expected values in Tests/string_palindromes_test.cc, referencing the TEST
declaration and the manacher(...) result fields r.longest_palindrome and
r.longest_length.
In `@Tests/string_search_test.cc`:
- Around line 168-198: The test uses EXPECT_GE(kmp.size(), 6u) which allows
extra (incorrect) matches; change that assertion to EXPECT_EQ(kmp.size(), 6u) in
the TEST(StringSearch, StressLongRandomStringAllAlgorithmsAgree) so the number
of matches reported by kmp_search (and by earlier cross-checks with z_search,
boyer_moore_horspool_search, rabin_karp_search) must equal the six planted
positions; update the single EXPECT_GE line to EXPECT_EQ and keep the rest of
the equality checks as-is to ensure strict agreement.
- Around line 43-54: Extract the duplicated expect_array_eq helper into a shared
header (e.g., Tests/test_helpers.H): move the implementation of
expect_array_eq(const Array<size_t>&, std::initializer_list<size_t>) out of the
anonymous namespace and into that header, ensure the header includes the
necessary headers (gtest, Array definition, <initializer_list>), and make the
function either inline or placed in a named namespace (e.g.,
test_helpers::expect_array_eq) to avoid ODR/multiple definition issues; then
include Tests/test_helpers.H from both Tests/string_search_test.cc and
Tests/suffix_structures_test.cc and remove the local duplicate definitions.
In `@Tests/suffix_structures_test.cc`:
- Around line 184-203: In the TEST named StressLongerSuffixArray replace the
non-strict comparison EXPECT_LE(prev_suffix, curr_suffix) with
EXPECT_LT(prev_suffix, curr_suffix) to assert strict ordering of distinct
suffixes; locate the assertion in the test that builds text, computes sa via
suffix_array(text), and compares prev_suffix and curr_suffix (variables
prev_suffix, curr_suffix, and sa) and change to EXPECT_LT, mirroring the
existing behavior used in the SuffixArrayLongerText test.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
Aho_Corasick.HCMakeLists.txtExamples/CMakeLists.txtExamples/aho_corasick_example.ccExamples/damerau_levenshtein_example.ccExamples/edit_distance_example.ccExamples/horspool_example.ccExamples/kmp_example.ccExamples/lcs_longest_common_substring_example.ccExamples/manacher_example.ccExamples/min_cost_matching_example.ccExamples/rabin_karp_example.ccExamples/suffix_array_lcp_example.ccExamples/suffix_automaton_example.ccExamples/suffix_tree_example.ccExamples/z_algorithm_example.ccREADME.es.mdREADME.mdString_Algorithms.HString_DP.HString_Palindromes.HString_Search.HSuffix_Structures.HTests/aho_corasick_test.ccTests/sort_utils.ccTests/string_algorithms_test.ccTests/string_dp_test.ccTests/string_palindromes_test.ccTests/string_search_test.ccTests/suffix_structures_test.ccah-concepts.Hah-dry.Hah-string-utils.Htpl_sort_utils.H
Refactor string utility functions, improve `Aho_Corasick` and `Suffix_Structures` documentation and robustness. Introduce `test_helpers.H` for better test readability. Add new stress tests for `Manacher` and `Rabin-Karp` algorithms to ensure correctness with complex and adversarial inputs. Update examples to use modern C++ formatting.
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @lrleon. * #37 (comment) The following files were modified: * `Examples/damerau_levenshtein_example.cc` * `Examples/edit_distance_example.cc` * `Examples/horspool_example.cc` * `Examples/kmp_example.cc` * `Examples/lcs_longest_common_substring_example.cc` * `Examples/manacher_example.cc` * `Examples/rabin_karp_example.cc` * `Examples/suffix_array_lcp_example.cc` * `Examples/suffix_automaton_example.cc` * `Examples/suffix_tree_example.cc` * `Examples/z_algorithm_example.cc` * `Tests/ah-string-utils.cc` * `Tests/sort_utils.cc`
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (2)
README.es.md (1)
505-513: Past review suggestions addressed — "Cabeceras principales" and "cabecera paraguas" are now in place.The Spanish terminology is consistent. The LanguageTool punctuation warnings on the colon-prefixed markdown list items are false positives for this formatting style.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.es.md` around lines 505 - 513, The review indicates the Spanish README update already addresses previous suggestions and that the terminology and punctuation are acceptable; no code changes are necessary—keep the section headings and items as written (String_Search.H, Aho_Corasick.H, Suffix_Structures.H, String_Palindromes.H, String_DP.H, String_Algorithms.H) and proceed to merge/approve the PR.tpl_sort_utils.H (1)
4358-4364: ConstrainKeyFnwith C++20 concepts (std::invocable) for clearer compile-time guarantees.
counting_sort_indicesrelies onkey_of(size_t)returning anint-like key. ConstrainingKeyFnimproves diagnostics and prevents accidental misuse.Proposed change
-template <class KeyFn> +template <std::invocable<size_t> KeyFn> + requires std::convertible_to<std::invoke_result_t<KeyFn, size_t>, int> void counting_sort_indices(Array<size_t> & sa, Array<size_t> & tmp, const size_t n, const int min_key, const int max_key, const KeyFn & key_of)#!/bin/bash set -euo pipefail # Verify whether <concepts> / std::invocable helpers are available via current includes/toolchain fd -a '^ah-concepts\.H$' . -x sed -n '1,120p' {} fd -a '^tpl_sort_utils\.H$' . -x rg -n --no-heading -C2 'template <class KeyFn>|counting_sort_indices' {}As per coding guidelines: Use concepts (C++20) to constrain template parameters; avoid SFINAE when possible
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 4358 - 4364, Constrain the KeyFn template using C++20 concepts: add the necessary includes (<concepts> and <type_traits> or <utility>) and change the template declaration for counting_sort_indices to require that KeyFn is invocable with a size_t and that its result is convertible to int (e.g., use requires std::invocable<KeyFn, size_t> && std::convertible_to<std::invoke_result_t<KeyFn, size_t>, int>). This keeps the function signature and body (counting_sort_indices, parameters sa/tmp/n/min_key/max_key/key_of) intact but enforces at compile time the expected callable shape and return type.
🧹 Nitpick comments (6)
Aho_Corasick.H (2)
80-92: Consider usingstd::array<int, 256>andfillinstead of C-style array with manual loop.Per coding guidelines, prefer
std::arraywhen size is known at compile-time.♻️ Suggested change
+ `#include` <array> + struct Node { - int next[256]; + std::array<int, 256> next; int fail = 0; Array<size_t> output; - Node() - { - for (size_t i = 0; i < 256; ++i) - next[i] = -1; - } + Node() : next() + { + next.fill(-1); + } };As per coding guidelines, "Use
std::arrayinstead of C-style arrays when size is known at compile-time."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Aho_Corasick.H` around lines 80 - 92, The Node struct currently uses a C-style int next[256] and manual loop in Node() to initialize it; change next to a std::array<int, 256> (keep the name Node::next) and replace the manual initialization loop with std::fill or std::array's fill method in the constructor (or initialize inline with {-1} if preferred) so the fixed-size buffer uses std::array and is zero/ -1-initialized per guidelines; update any access to next[i] to continue working unchanged.
255-258:contains_anyperforms a full search instead of short-circuiting on first match.The current implementation collects all matches via
search()just to check if any exist. For large texts with many matches this is wasteful. A dedicated early-exit loop would be more efficient.♻️ Proposed optimization
[[nodiscard]] bool contains_any(const std::string_view text) const { - return not search(text).is_empty(); + ah_runtime_error_unless(built_) + << "Aho_Corasick::contains_any(): call build() before contains_any()"; + + size_t state = 0; + for (size_t i = 0; i < text.size(); ++i) + { + const unsigned char c = static_cast<unsigned char>(text[i]); + state = static_cast<size_t>(nodes_[state].next[c]); + if (not nodes_[state].output.is_empty()) + return true; + } + return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Aho_Corasick.H` around lines 255 - 258, contains_any currently calls search(text).is_empty(), which forces a full search; change it to iterate the search results and return true on the first match to enable early exit. Replace the body of contains_any to obtain the search result (via search(text)) and loop over its matches (using the result's iterator/next/for_each style API) returning true immediately when a match is seen and false if the iterator finishes; reference contains_any and search so you locate and modify the existing implementation.Tests/string_search_test.cc (1)
45-53: Duplicatedexpect_array_eq— prefer the shared helper fromtest_helpers.H.This local definition is identical to the one in
Tests/test_helpers.H. Using the shared version avoids divergence if the helper is updated later.♻️ Suggested change
Replace the local anonymous namespace helper with:
# include <String_Search.H> +# include "test_helpers.H" using namespace Aleph; +using namespace Aleph::test_helpers; -namespace -{ - void expect_array_eq(const Array<size_t> & a, - std::initializer_list<size_t> expected) - { - ASSERT_EQ(a.size(), expected.size()); - size_t i = 0; - for (const size_t value : expected) - EXPECT_EQ(a[i++], value); - } -}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/string_search_test.cc` around lines 45 - 53, The test defines a local duplicate helper expect_array_eq(const Array<size_t>&, std::initializer_list<size_t>) that is identical to the shared one in Tests/test_helpers.H; remove the local anonymous-namespace definition of expect_array_eq from Tests/string_search_test.cc and instead include or import the shared helper by adding the appropriate include for test_helpers.H (or using the existing test helper header) so the test uses the single canonical expect_array_eq implementation.Suffix_Structures.H (1)
268-283: Consider allowing'\0'as a terminal marker (or clarify why it’s excluded).
choose_terminal()only considers byte values[1..255]. For binary inputs containing all bytes1..255but not0, construction currently throws even though0would be available. If excluding0is intentional (e.g., to avoid C-string pitfalls elsewhere), it’d be good to document that rationale explicitly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Suffix_Structures.H` around lines 268 - 283, The choose_terminal function currently skips byte value 0 and throws if 1..255 are present; either allow '\0' by including byte 0 in the search or explicitly document/guard the decision. To fix, update choose_terminal (the function choose_terminal in Naive_Suffix_Tree / Suffix_Structures.H) to consider used[0] as a valid terminal (start the for-loop at i = 0) when '\0' is acceptable, or if '\0' must be excluded keep the current loop but add a clear comment and/or an assertion explaining why '\0' is forbidden and how callers must sanitize inputs; adjust error message accordingly to reflect whether only 1..255 were considered or all 256 bytes.Tests/aho_corasick_test.cc (1)
127-187: Consider a larger deterministic stress case as a performance-regression guard.
LargePatternSet/LongTextSearchare a good start, but a “stress” test (e.g., a few thousand patterns + a few hundred KB of text) can better catch accidental O(n·m) regressions while staying deterministic (avoid wall-clock timing asserts).
Based on learnings: Applies to Tests/**/*_test.cc : Include performance regression tests for critical algorithms🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/aho_corasick_test.cc` around lines 127 - 187, Add a deterministic stress test (e.g., TEST(AhoCorasick, DeterministicStress)) in Tests/aho_corasick_test.cc that programmatically generates a few thousand patterns (use a fixed algorithm/seed to produce unique strings via Aho_Corasick::add_pattern), calls Aho_Corasick::build(), constructs a few-hundred-KB deterministic text that contains known inserted patterns at fixed offsets, runs Aho_Corasick::search(), and asserts correctness by checking match counts and verifying each match via Aho_Corasick::pattern(match.pattern_id) and text.substr(match.position, pat.size()) == pat; avoid any wall-clock timing asserts so this serves as a deterministic performance-regression guard.tpl_sort_utils.H (1)
4642-4685: Addrequires radix_integer<IntT>to bothradix_sort_imploverloads for consistency.The
radix_sort_impl(DynList<IntT>&)andradix_sort_impl(DynDlist<IntT>&)overloads already haverequires radix_integer<IntT>, but the genericradix_sort_impl(C<IntT>&)andradix_sort_impl(IntT*, size_t)overloads do not. This mirrors the pattern incounting_sort_impl, where all overloads are constrained. Adding the constraint provides better compile-time error messages if these impl functions are reused elsewhere and aligns with the coding guideline to use concepts for template parameter constraints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 4642 - 4685, The two unconstrained template overloads radix_sort_impl(C<IntT>&) and radix_sort_impl(IntT*, size_t) must be constrained with requires radix_integer<IntT> to match the other overloads (the DynList/DynDlist variants) and provide clearer compile-time errors; update the template declarations for the generic container overload (template <template <typename> class C, typename IntT>) and the raw-pointer overload (template <typename IntT>) to add requires radix_integer<IntT>.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ah-string-utils.H`:
- Around line 518-523: The boolean expression initializing const bool invalid
mixes C-style tokens (||, &&, !) with alternative tokens (or, and, not); update
the expression in the invalid initializer (the const bool invalid in this file)
to use the alternative tokens consistently (replace || with or, && with and, and
! with not) so all logical operations use and/or/not and preserve the same
grouping and semantics.
In `@Suffix_Structures.H`:
- Around line 76-89: The current insertion_sort(Array<size_t>& a) in
Suffix_Structures.H is O(k²) and is used by Naive_Suffix_Tree::find_all() via
insertion_sort(matches), conflicting with the documented O(k log k); replace the
hand-rolled insertion_sort implementation with the existing O(k log k) sorter
(e.g., call quicksort_op(matches) from tpl_sort_utils.H or std::sort on the
underlying container) so find_all() uses an O(k log k) sort; update the function
body of insertion_sort (and any direct callers like Naive_Suffix_Tree::find_all)
to invoke the chosen efficient sort to preserve API but ensure O(k log k)
complexity.
In `@Tests/aho_corasick_test.cc`:
- Around line 169-174: The comment above the test in LongTextSearch is
inaccurate: it says the text is built with "hay" repeated but the loop appends
"stack" 2000 times; update the comment to accurately describe the test data
(e.g., "Build a 10K text with 'stack' repeated and a few 'needle' insertions")
so readers aren’t misled by the mismatch between the comment and the code in
Tests/aho_corasick_test.cc within the LongTextSearch test.
- Around line 45-62: In the TEST named ClassicDictionary, change the size
assertion on the search result from EXPECT_EQ(matches.size(), 4u) to
ASSERT_EQ(matches.size(), 4u) so the test aborts immediately if matches has the
wrong length and prevents out-of-bounds indexing of matches[0]..matches[3];
locate the Aho_Corasick ac build and const auto matches = ac.search("ahishers")
lines and update that single assertion accordingly.
In `@Tests/sort_utils.cc`:
- Around line 1637-1647: The test named
SortUtilsCountingSortGeneric.std_vector_support is misleading because it uses
make_dynarray (DynArray) instead of std::vector; either rename the test to
reflect DynArray usage (e.g., change the TEST name from std_vector_support to
dynarray_support) or add a separate test that constructs a std::vector (e.g.,
std::vector<int> v = {9, -1, 3, 0, -1, 2}; counting_sort(v); ...) to exercise
the std::vector overload of counting_sort; update assertions to target the new
variable names accordingly and keep the existing assertions for ordering and
size.
In `@Tests/test_helpers.H`:
- Around line 1-22: Add the standard MIT license header and a Doxygen file-level
comment with an `@file` directive and brief description to the top of
test_helpers.H (above the include guard), matching the format used in other .H
headers; keep the existing include guard and namespace/expect_array_eq
implementation intact. Then remove the duplicate expect_array_eq definition from
string_search_test.cc and replace it with an `#include` "test_helpers.H" so the
test uses the shared expect_array_eq helper.
In `@tpl_sort_utils.H`:
- Around line 4381-4397: In counting_sort_indices() the expressions
static_cast<size_t>(key - min_key) can trigger signed int overflow before the
cast; compute the difference in a widened signed or unsigned type first (e.g.,
long or ptrdiff_t) then cast to size_t, and apply this to both uses (the
increment ++count[...] loop and the idx computation in the final loop where
tmp[--count[idx]] = sa[...] is used). Keep the existing ah_out_of_range_error_if
check, but replace key - min_key with a safe widened subtraction (e.g., auto
diff = static_cast<long>(key) - static_cast<long>(min_key); size_t idx =
static_cast<size_t>(diff);) to avoid UB and ensure correct bucket indexing.
- Around line 4795-4967: Several public overloads of counting_sort and
radix_sort (e.g., counting_sort(DynArray<T>&), counting_sort(Array<T>&),
counting_sort(DynList<T>&), counting_sort(DynDlist<T>&), counting_sort(T* a,
size_t n), counting_sort(T (&a)[N]) and the analogous radix_sort overloads) lack
required Doxygen tags; update each public template docblock to include `@tparam`
T, `@param` descriptions (e.g., "a" or "list" or "n" for array length), and for
pointer overloads add `@exception` or `@throws` runtime_error when a is null and
n>0, and keep the existing `@par` Complexity/@brief text — ensure the tag names
match the rest of the file’s style and apply these same additions to all listed
overloads so Doxygen warnings are resolved.
---
Duplicate comments:
In `@README.es.md`:
- Around line 505-513: The review indicates the Spanish README update already
addresses previous suggestions and that the terminology and punctuation are
acceptable; no code changes are necessary—keep the section headings and items as
written (String_Search.H, Aho_Corasick.H, Suffix_Structures.H,
String_Palindromes.H, String_DP.H, String_Algorithms.H) and proceed to
merge/approve the PR.
In `@tpl_sort_utils.H`:
- Around line 4358-4364: Constrain the KeyFn template using C++20 concepts: add
the necessary includes (<concepts> and <type_traits> or <utility>) and change
the template declaration for counting_sort_indices to require that KeyFn is
invocable with a size_t and that its result is convertible to int (e.g., use
requires std::invocable<KeyFn, size_t> &&
std::convertible_to<std::invoke_result_t<KeyFn, size_t>, int>). This keeps the
function signature and body (counting_sort_indices, parameters
sa/tmp/n/min_key/max_key/key_of) intact but enforces at compile time the
expected callable shape and return type.
---
Nitpick comments:
In `@Aho_Corasick.H`:
- Around line 80-92: The Node struct currently uses a C-style int next[256] and
manual loop in Node() to initialize it; change next to a std::array<int, 256>
(keep the name Node::next) and replace the manual initialization loop with
std::fill or std::array's fill method in the constructor (or initialize inline
with {-1} if preferred) so the fixed-size buffer uses std::array and is zero/
-1-initialized per guidelines; update any access to next[i] to continue working
unchanged.
- Around line 255-258: contains_any currently calls search(text).is_empty(),
which forces a full search; change it to iterate the search results and return
true on the first match to enable early exit. Replace the body of contains_any
to obtain the search result (via search(text)) and loop over its matches (using
the result's iterator/next/for_each style API) returning true immediately when a
match is seen and false if the iterator finishes; reference contains_any and
search so you locate and modify the existing implementation.
In `@Suffix_Structures.H`:
- Around line 268-283: The choose_terminal function currently skips byte value 0
and throws if 1..255 are present; either allow '\0' by including byte 0 in the
search or explicitly document/guard the decision. To fix, update choose_terminal
(the function choose_terminal in Naive_Suffix_Tree / Suffix_Structures.H) to
consider used[0] as a valid terminal (start the for-loop at i = 0) when '\0' is
acceptable, or if '\0' must be excluded keep the current loop but add a clear
comment and/or an assertion explaining why '\0' is forbidden and how callers
must sanitize inputs; adjust error message accordingly to reflect whether only
1..255 were considered or all 256 bytes.
In `@Tests/aho_corasick_test.cc`:
- Around line 127-187: Add a deterministic stress test (e.g., TEST(AhoCorasick,
DeterministicStress)) in Tests/aho_corasick_test.cc that programmatically
generates a few thousand patterns (use a fixed algorithm/seed to produce unique
strings via Aho_Corasick::add_pattern), calls Aho_Corasick::build(), constructs
a few-hundred-KB deterministic text that contains known inserted patterns at
fixed offsets, runs Aho_Corasick::search(), and asserts correctness by checking
match counts and verifying each match via
Aho_Corasick::pattern(match.pattern_id) and text.substr(match.position,
pat.size()) == pat; avoid any wall-clock timing asserts so this serves as a
deterministic performance-regression guard.
In `@Tests/string_search_test.cc`:
- Around line 45-53: The test defines a local duplicate helper
expect_array_eq(const Array<size_t>&, std::initializer_list<size_t>) that is
identical to the shared one in Tests/test_helpers.H; remove the local
anonymous-namespace definition of expect_array_eq from
Tests/string_search_test.cc and instead include or import the shared helper by
adding the appropriate include for test_helpers.H (or using the existing test
helper header) so the test uses the single canonical expect_array_eq
implementation.
In `@tpl_sort_utils.H`:
- Around line 4642-4685: The two unconstrained template overloads
radix_sort_impl(C<IntT>&) and radix_sort_impl(IntT*, size_t) must be constrained
with requires radix_integer<IntT> to match the other overloads (the
DynList/DynDlist variants) and provide clearer compile-time errors; update the
template declarations for the generic container overload (template <template
<typename> class C, typename IntT>) and the raw-pointer overload (template
<typename IntT>) to add requires radix_integer<IntT>.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
Aho_Corasick.HExamples/rabin_karp_example.ccExamples/z_algorithm_example.ccREADME.es.mdSuffix_Structures.HTests/ah-string-utils.ccTests/aho_corasick_test.ccTests/sort_utils.ccTests/string_palindromes_test.ccTests/string_search_test.ccTests/suffix_structures_test.ccTests/test_helpers.Hah-string-utils.Htpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (2)
- Examples/z_algorithm_example.cc
- Tests/string_palindromes_test.cc
Refactored sort_utils.cc with modern C++ features and expanded test coverage for various sorting algorithms, including introsort, counting sort, and radix sort. Also includes a fix for Suffix Automaton cloning link invariant.
Docstrings generation was requested by @lrleon. * #37 (comment) The following files were modified: * `Examples/damerau_levenshtein_example.cc` * `Examples/edit_distance_example.cc` * `Examples/horspool_example.cc` * `Examples/kmp_example.cc` * `Examples/lcs_longest_common_substring_example.cc` * `Examples/manacher_example.cc` * `Examples/rabin_karp_example.cc` * `Examples/suffix_array_lcp_example.cc` * `Examples/suffix_automaton_example.cc` * `Examples/suffix_tree_example.cc` * `Examples/z_algorithm_example.cc` * `Tests/ah-string-utils.cc` * `Tests/sort_utils.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
Tests/suffix_structures_test.cc (1)
197-216: Nice to have: add a deterministic performance regression check for critical string primitives.
StressLongerSuffixArrayvalidates correctness at scale, but it does not guard runtime regressions over time.Based on learnings, "Applies to Tests/**/*_test.cc : Include performance regression tests for critical algorithms".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/suffix_structures_test.cc` around lines 197 - 216, Augment the existing StressLongerSuffixArray test to include a deterministic performance regression check: before calling suffix_array(text) record a high-resolution timestamp (std::chrono::steady_clock), call suffix_array(text), record the end time and compute elapsed milliseconds, then assert the elapsed time is below a chosen threshold with an EXPECT_LE/ASSERT_LT so the test fails on regressions; keep the workload deterministic (reuse the current text generation) and include a brief comment explaining the threshold and rationale so future reviewers see why the timing assertion is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Suffix_Structures.H`:
- Around line 176-178: Update the Doxygen for lcp_array_kasai() to accurately
reflect runtime checks: add a `@throws` entry stating that a domain_error (or
appropriate exception) is thrown when sa contains duplicate indices (in addition
to the existing sa[i] >= text.size() case), and ensure the function's docblock
includes `@brief`, `@param`, `@return` and the expanded `@throws` tags so the
documentation matches the behavior performed in the function (lcp_array_kasai).
In `@Tests/suffix_structures_test.cc`:
- Around line 60-73: Add a unit test to verify lcp_array_kasai rejects suffix
array entries that are out-of-range: create a test (e.g., TEST(SuffixStructures,
LcpKasaiRejectsOutOfRange)) that builds an SA containing at least one index >=
text.size() (for example for "abc" use {0,1,3}) and assert that calling
lcp_array_kasai("abc", sa) throws std::out_of_range; ensure you reference
lcp_array_kasai and the SA construction (Array<size_t> / build_array<size_t>)
consistent with the other tests.
---
Nitpick comments:
In `@Tests/suffix_structures_test.cc`:
- Around line 197-216: Augment the existing StressLongerSuffixArray test to
include a deterministic performance regression check: before calling
suffix_array(text) record a high-resolution timestamp
(std::chrono::steady_clock), call suffix_array(text), record the end time and
compute elapsed milliseconds, then assert the elapsed time is below a chosen
threshold with an EXPECT_LE/ASSERT_LT so the test fails on regressions; keep the
workload deterministic (reuse the current text generation) and include a brief
comment explaining the threshold and rationale so future reviewers see why the
timing assertion is present.
Rethink exception handling for lcp_array_kasai, adding a check for duplicate indices in the suffix array. Add new test cases to cover out-of-range and duplicate index errors in lcp_array_kasai.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
Tests/suffix_structures_test.cc (2)
143-150: Nice to have: extract duplicated suffix-order assertion into a helper.The lexicographic-order verification loop appears twice; factoring it into one helper will reduce drift and simplify future updates.
As per coding guidelines, "Refactor duplicated code into reusable functions or templates".
Also applies to: 225-232
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/suffix_structures_test.cc` around lines 143 - 150, Extract the duplicated lexicographic-order check into a single helper (e.g., assert_suffixes_sorted or template assert_suffixes_sorted_view) that takes the text and the suffix array (sa) and performs the loop comparing adjacent suffixes using std::string_view and EXPECT_LT; replace both occurrences (the loop using prev_suffix/curr_suffix around EXPECT_LT) with calls to that helper to remove duplication and centralize behavior.
48-271: Should fix: add seeded randomized/property cross-checks for the core string structures.For this complexity level, add deterministic randomized tests (fixed seed) that cross-check:
suffix_arrayordering against naive suffix sorting,lcp_array_kasaiagainst naive LCP computation,Suffix_Automaton::containsagainststd::string::findon sampled substrings.Based on learnings: Applies to
Tests/**/*_test.cc: Use property-based testing or randomized tests for complex algorithms.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/suffix_structures_test.cc` around lines 48 - 271, Add deterministic randomized cross-check tests (use a fixed seed) that validate suffix_array ordering against a naive sorted-suffix implementation, lcp_array_kasai against a naive LCP computation, and Suffix_Automaton::contains against std::string::find on sampled substrings; implement three new TESTs (e.g., RandomizedSuffixArray, RandomizedLcpKasai, RandomizedSuffixAutomatonContains) which generate random texts (bounded length), compute the library result via suffix_array/lcp_array_kasai/Suffix_Automaton::build+contains, compute the expected result via straightforward naive routines (sort all suffixes or compute LCP by direct comparisons, and check substring presence with std::string::find), assert equality for each sample, and ensure the RNG uses a fixed seed so tests are deterministic and repeatable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Tests/suffix_structures_test.cc`:
- Around line 39-41: The test contains a hard wall-clock assertion that enforces
duration < 500ms (the timing check around the duration measurement in
Tests/suffix_structures_test.cc); replace this flaky default-path check by
gating it behind an opt-in mechanism: read an env var (e.g. ENABLE_SLOW_TESTS or
RUN_PERF_TESTS) and only perform the <500ms assertion when that var is set,
otherwise skip the timing assertion (or convert it to a non-failing log). Update
the code around the duration measurement and the assertion (the test case that
measures duration and currently asserts < 500ms) to consult the env flag before
asserting, or move the check into a separate benchmark-only target.
---
Nitpick comments:
In `@Tests/suffix_structures_test.cc`:
- Around line 143-150: Extract the duplicated lexicographic-order check into a
single helper (e.g., assert_suffixes_sorted or template
assert_suffixes_sorted_view) that takes the text and the suffix array (sa) and
performs the loop comparing adjacent suffixes using std::string_view and
EXPECT_LT; replace both occurrences (the loop using prev_suffix/curr_suffix
around EXPECT_LT) with calls to that helper to remove duplication and centralize
behavior.
- Around line 48-271: Add deterministic randomized cross-check tests (use a
fixed seed) that validate suffix_array ordering against a naive sorted-suffix
implementation, lcp_array_kasai against a naive LCP computation, and
Suffix_Automaton::contains against std::string::find on sampled substrings;
implement three new TESTs (e.g., RandomizedSuffixArray, RandomizedLcpKasai,
RandomizedSuffixAutomatonContains) which generate random texts (bounded length),
compute the library result via
suffix_array/lcp_array_kasai/Suffix_Automaton::build+contains, compute the
expected result via straightforward naive routines (sort all suffixes or compute
LCP by direct comparisons, and check substring presence with std::string::find),
assert equality for each sample, and ensure the RNG uses a fixed seed so tests
are deterministic and repeatable.
Refactor sort utility error handling and update documentation for Aho-Corasick and String_Palindromes.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tpl_sort_utils.H`:
- Around line 5063-5074: The Doxygen `@throws` entry for the radix_sort(T* a,
size_t n) pointer overload is incorrect: the code uses ah_invalid_argument_if
(which throws std::invalid_argument) but the comment lists std::runtime_error;
update the documentation to match the implementation by changing the `@throws`
line to std::invalid_argument for the null-pointer check in the radix_sort
template (referencing the radix_sort function and its ah_invalid_argument_if
call).
- Around line 4943-4955: The Doxygen `@throws` is incorrect: counting_sort uses
ah_invalid_argument_if (which throws std::invalid_argument) for the null-pointer
check; update the documentation for counting_sort to list std::invalid_argument
(not std::runtime_error) for the "a is null and n > 0" case, referencing the
counting_sort template and the ah_invalid_argument_if check so readers and docs
match the implementation.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Aho_Corasick.HString_Palindromes.HTests/sort_utils.ccTests/suffix_structures_test.cctpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (1)
- Tests/suffix_structures_test.cc
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 39 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (4)
tpl_avl.H:1
- The
do { ... } while (p != Node::NullPtr);structure executes the body once even whenp == Node::NullPtr(e.g., empty tree). With the new unconditional__builtin_prefetch(LLINK(p))/RLINK(p), this can dereference a null/sentinel pointer via theLLINK/RLINKmacro expansion and trigger UB. Restructure the loop to only prefetch after verifyingp != Node::NullPtr(e.g., use awhile (p != Node::NullPtr)loop), or guard the prefetch calls soLLINK/RLINKare never evaluated for a null node.
tpl_avlRk.H:1 - Same issue as in
tpl_avl.H: thisdo { ... } while (p != Node::NullPtr);executes once even ifpstarts asNode::NullPtr, and the new unconditional prefetch evaluatesLLINK(p)/RLINK(p)before any null check, which can be UB for an empty tree. Convert to awhile (p != Node::NullPtr)descent (or guard prefetch) so child-link expressions are never evaluated whenpis null.
Tests/suffix_structures_test.cc:1 - This test file later uses
std::getenvandstd::cout(perf-gated logging), but the corresponding standard headers (<cstdlib>and<iostream>) are not included here. Relying on transitive includes can be toolchain-dependent; add explicit includes for the standard facilities used in this TU.
ah-dry.H:1 - The inline code formatting is broken on this line (missing closing backtick after
insert()), and the sentence reads ungrammatically ('If all on the container...'). Please fix the backticks and rephrase the sentence for clarity (e.g., 'insert()of the container. If all of the container andT...').
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Suffix_Structures.H (1)
711-718: Pre-reserve SAM state capacity inbuild()to avoid repeated reallocations.At Line 711,
extend()is called per character and can trigger repeated growth/copies. Reserving up front makes the linear build path tighter on large inputs.♻️ Proposed change
void build(const std::string_view text) { clear(); + states_.reserve(2 * text.size() + 1); for (const char c: text) extend(c); mark_terminals(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Suffix_Structures.H` around lines 711 - 718, The build(const std::string_view text) path should pre-reserve SAM state capacity to avoid repeated reallocations: after clear() call states.reserve(2 * text.size()) (and similarly reserve any per-state transition container capacity if available) before the for-loop that calls extend(c); update references to the container holding SAM states (e.g., the vector named states or the class SuffixAutomaton's states member) so extend() can push without reallocation during linear construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Suffix_Structures.H`:
- Around line 711-718: The build(const std::string_view text) path should
pre-reserve SAM state capacity to avoid repeated reallocations: after clear()
call states.reserve(2 * text.size()) (and similarly reserve any per-state
transition container capacity if available) before the for-loop that calls
extend(c); update references to the container holding SAM states (e.g., the
vector named states or the class SuffixAutomaton's states member) so extend()
can push without reallocation during linear construction.
Summary by CodeRabbit
New Features
Documentation
Examples
Tests