Add combinatorics, number theory, and streaming algorithm utilities' or 'Add ah-comb, NTT, modular arithmetic, and streaming algorithms#40
Conversation
Add combinatorics enumeration, number theory, advanced sorting (bucket, timsort), and probabilistic streaming algorithms (reservoir sampling, Count-Min sketch, HyperLogLog, MinHash, SimHash). Update `ah-comb.H` for permutations/combinations and `tpl_sort_utils.H` for sorting algorithms. Integrate new features with examples and tests. Refactor `ah-dry.H` for `EqualSequenceMethod`.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds many new header-only algorithms (combinatorics, Gray code, sorting), number‑theory modules (modular arithmetic, primality, Pollard‑Rho, NTT, modular combinatorics/linalg), streaming sketches and demo, multiple new examples, and extensive tests and README updates. Several new public classes/functions are introduced. Changes
Sequence Diagram(s)sequenceDiagram
participant Stream as Stream
participant Demo as StreamingDemo
participant RS as Reservoir_Sampler
participant CMS as Count_Min_Sketch
participant HLL as HyperLogLog
participant MH as MinHash
participant SH as SimHash
rect rgba(200,200,255,0.5)
Stream->>Demo: provide token stream
end
rect rgba(200,255,200,0.5)
Demo->>RS: update(items)
RS-->>Demo: get_sample()
end
rect rgba(255,230,200,0.5)
Demo->>CMS: update(items)
CMS-->>Demo: estimate(freq)
end
rect rgba(255,200,255,0.5)
Demo->>HLL: update(items)
HLL-->>Demo: estimate(cardinality)
end
rect rgba(200,255,255,0.5)
Demo->>MH: update(sets)
MH-->>Demo: similarity_estimate()
Demo->>SH: update(docs)
SH-->>Demo: fingerprint_similarity()
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces significant enhancements to the Aleph-w library with advanced algorithms and utilities. The implementation adds two new major sorting algorithms (Bucket Sort and Timsort), improves the random selection algorithm with 3-way partitioning, and provides a comprehensive suite of number theory and streaming algorithms.
Changes:
- Implemented Bucket Sort and Timsort with support for multiple container types
- Enhanced random_select with iterative 3-way partitioning for better duplicate handling
- Added probabilistic streaming algorithms (HyperLogLog, MinHash, SimHash, Count-Min Sketch, Reservoir Sampling)
- Introduced number theory toolkit (modular arithmetic, primality testing, NTT, Pollard's Rho factorization)
- Extended combinatorics utilities with lexicographic enumeration and Gray code functions
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tpl_sort_utils.H | Core sorting utilities: Bucket Sort, Timsort implementations, and improved random_select with 3-way partitioning |
| tpl_dynArray.H | Updated inheritance from EqualToMethod to EqualSequenceMethod |
| tpl_array.H | Updated inheritance from EqualToMethod to EqualSequenceMethod |
| simhash.H | New SimHash class for cosine similarity estimation via bitwise fingerprints |
| reservoir-sampling.H | New Reservoir_Sampler for random sampling from streams |
| primality.H | Miller-Rabin deterministic primality testing for 64-bit integers |
| pollard_rho.H | Pollard's rho algorithm for integer factorization |
| ntt.H | Number Theoretic Transform for fast polynomial multiplication |
| modular_linalg.H | Modular matrix operations (determinant, inverse) over finite fields |
| modular_combinatorics.H | Modular binomial coefficients with precomputed factorials and Lucas theorem |
| modular_arithmetic.H | Safe 64-bit modular arithmetic primitives (multiplication, exponentiation, CRT) |
| minhash.H | MinHash for Jaccard similarity estimation between sets |
| hyperloglog.H | HyperLogLog cardinality estimator for counting distinct elements |
| count-min-sketch.H | Count-Min Sketch for frequency estimation in streams |
| ah-dry.H | New EqualSequenceMethod class for sequence container equality testing |
| ah-comb.H | Extended combinatorics: lexicographic permutation/combination enumeration, Gray codes |
| Tests/streaming_algorithms_test.cc | Comprehensive tests for all streaming algorithms |
| Tests/sort_utils.cc | Tests for random_select improvements, Bucket Sort, and Timsort |
| Tests/math_nt_test.cc | Tests for number theory algorithms |
| Tests/ah-comb.cc | Tests for new combinatorics utilities |
| Tests/CMakeLists.txt | Added new test programs |
| README.md | Documentation for new algorithms and features |
| README.es.md | Spanish documentation for combinatorics enumeration |
| Examples/*.cc | Example programs demonstrating new features |
| Examples/CMakeLists.txt | Added new example programs |
Comments suppressed due to low confidence (4)
tpl_sort_utils.H:2378
- The function parameter names
landrare ambiguous single-letter names. Consider renaming them toleftandrightfor clarity.
const T &__random_select(T *a, const long i, long l, long r,
ah-comb.H:54
- The documentation clarifies the dual meaning of 'permutation' but could be clearer by explicitly stating this distinction earlier in the file header to avoid confusion for users unfamiliar with the codebase.
@note `traverse_perm`/`for_each_perm`/`build_perms` use "permutation" in the
cartesian-product sense. `next_permutation` is the classical
lexicographic permutation of a single sequence.
tpl_sort_utils.H:4908
- Inconsistent spacing in range-based for loop. The colon should have spaces on both sides (
bucket : buckets) to match the project's style elsewhere in the file.
for (auto & bucket: buckets)
tpl_sort_utils.H:4939
- Inconsistent spacing in range-based for loop. The colon should have spaces on both sides (
bucket : buckets) to match the project's style elsewhere in the file.
for (auto & bucket: buckets)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 049e044d46
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 19
🧹 Nitpick comments (9)
simhash.H (2)
108-116: Inconsistent array access: prefer uncheckedoperator()for internal loops.Line 112 uses bounds-checked
v_[i]whileupdate()uses uncheckedv_(i). Since the loop bounds are known to be valid (0 to 63), usev_(i)consistently for better performance.Proposed fix
[[nodiscard]] uint64_t get_fingerprint() const noexcept { uint64_t fingerprint = 0; for (int i = 0; i < 64; ++i) - if (v_[i] > 0) + if (v_(i) > 0) fingerprint |= (static_cast<uint64_t>(1) << i); return fingerprint; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simhash.H` around lines 108 - 116, The get_fingerprint() method uses bounds-checked array access v_[i] inside the loop while update() uses the unchecked accessor v_(i); replace v_[i] with v_(i) in get_fingerprint() so the internal loop consistently uses the unchecked operator() for indices 0..63 (keeping the existing loop bounds and noexcept signature) to avoid the extra bounds checks and match update().
72-75: Consider using a named constant for the fingerprint size.The value
64appears in multiple places (constructor, update, get_fingerprint, clear). A static constexpr member would improve maintainability.Proposed improvement
template <typename T> class SimHash { + static constexpr size_t kBits = 64; Array<double> v_; /**< Accumulator vector for each bit. */ public: /** `@brief` Construct SimHash accumulator. */ SimHash() { - v_ = Array<double>(64, 0.0); + v_ = Array<double>(kBits, 0.0); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@simhash.H` around lines 72 - 75, Introduce a single named constant for the fingerprint size (e.g., static constexpr size_t FINGERPRINT_SIZE = 64) and replace all magic literals `64` in the SimHash class with that constant; update the SimHash() constructor (v_ = Array<double>(FINGERPRINT_SIZE, 0.0)), and any other methods referenced in the diff such as update, get_fingerprint, and clear to use SimHash::FINGERPRINT_SIZE instead of the literal so the size is maintained in one place.Tests/sort_utils.cc (1)
2103-2120: Add regression cases for zero buckets and descending comparator in BucketSort tests.Current coverage is strong, but these two scenarios are still missing and would lock down the contract for the new overloads.
💡 Suggested additions
+TEST(BucketSort, custom_bucket_key_zero_buckets_throws) +{ + int a[] = {3, 1, 2}; + auto key = [](const int & v) -> size_t { return static_cast<size_t>(v); }; + EXPECT_THROW((bucket_sort(a, size_t{3}, size_t{0}, key)), std::domain_error); +} + +TEST(BucketSort, float_descending_comparator) +{ + float a[] = {0.1f, 2.0f, -1.5f, 0.3f, 2.0f}; + bucket_sort(a, size_t{5}, Aleph::greater<float>()); + EXPECT_TRUE(std::is_sorted(std::begin(a), std::end(a), std::greater<float>())); +}As per coding guidelines: "Tests must cover edge cases: empty inputs, single element, large inputs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 2103 - 2120, Add two regression tests for BucketSort: one TEST named BucketSort.descending_comparator that calls bucket_sort with a descending comparator (e.g., std::greater<...>) on a small descending input and asserts the result is sorted ascending according to that comparator; and one TEST named BucketSort.zero_buckets_contract that exercises the new overload taking an explicit buckets count—assert that bucket_sort(nullptr, 0) does not throw (already present) and that calling bucket_sort on a non-null buffer with buckets == 0 throws std::invalid_argument to lock down the zero-buckets contract. Reference the bucket_sort symbol and the TEST macro names when adding these cases.Examples/random_select_example.cc (1)
50-53: Nice to have: deduplicateprint_rule()across examples.This helper is repeated in multiple example files; pulling it from
print_rule.Hkeeps examples DRY.♻️ Proposed refactor
`#include` <iostream> `#include` <iomanip> `#include` <tpl_sort_utils.H> `#include` <tpl_dynArray.H> +#include <print_rule.H> @@ namespace { - void print_rule() - { - std::cout << "--------------------------------------------------------\n"; - } - template <typename Container> void print_array(const char * label, const Container & a) {As per coding guidelines "Refactor duplicated code into reusable functions or templates."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Examples/random_select_example.cc` around lines 50 - 53, The helper function print_rule() is duplicated across examples; remove the local definition in Examples/random_select_example.cc and instead include the shared declaration by adding `#include` "print_rule.H" (or the header that defines/declares print_rule()) and rely on the single shared implementation; ensure the example no longer defines print_rule() and that the build links against the shared implementation so Examples::random_select_example uses the centralized print_rule() function.Tests/streaming_algorithms_test.cc (1)
50-160: Should fix: add explicit edge-case tests for stream APIs.Please add focused cases for empty/small inputs (e.g.,
k=0reservoir, tiny HLL cardinalities, empty MinHash/SimHash updates) to lock in contracts on boundaries.As per coding guidelines "Tests must cover edge cases: empty inputs, single element, large inputs."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/streaming_algorithms_test.cc` around lines 50 - 160, Add explicit edge-case unit tests for the streaming APIs: for reservoir_sample add tests for k=0 with non-empty input and empty input iterators (verify sample.size()==0 and no crash), for Count_Min_Sketch add updates/estimates on empty sketch and single-element sketches (using Count_Min_Sketch::from_error_bounds or the constructor used in CMSSketchMerge) and assert estimates are 0 for unseen keys and correct for single updates, for HyperLogLog construct tiny/register-low HLLs (e.g., small p) and verify estimate for empty set is 0 and for single-element updates is ~1, and for MinHash and SimHash add tests that call update zero times (empty) and once (single element) and ensure similarity/estimate/fingerprint calls do not crash and return sensible deterministic values (e.g., similarity between identical single-element sketches is ~1 and empty comparisons behave consistently). Ensure tests reference reservoir_sample, Count_Min_Sketch, HyperLogLog, MinHash, and SimHash so the behavior at boundaries is explicitly asserted.Examples/streaming_demo.cc (1)
84-87: Nice to have: compute “Actual” frequencies dynamically.The hardcoded “Actual” numbers are easy to desync if the demo data changes; deriving them from
streamwill keep the example self-consistent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Examples/streaming_demo.cc` around lines 84 - 87, Replace the hardcoded "Actual" counts with values computed from the demo input `stream`: build a frequency map by iterating over the existing `stream` container (e.g., an unordered_map<string,int> or map) and count occurrences, then use that map when printing the estimated frequencies (refer to `stream` and `cms.estimate`) so the printed "(Actual: ...)" values come from the computed map (e.g., map["apple"], map["banana"], map["word_0"]) instead of fixed literals.Tests/math_nt_test.cc (2)
85-111: Should fix: add seeded property-style checks for primality/factorization.These tests are solid examples, but
miller_rabin/pollard_rhowould be safer with deterministic randomized cases (fixed seed) that assert invariants like “product of factors == n” and “each factor is prime.”Based on learnings: "Use property-based testing or randomized tests for complex algorithms" and "Tests must verify algorithmic correctness, not just absence of crashes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/math_nt_test.cc` around lines 85 - 111, Add deterministic randomized property-style checks to the MillerRabin and PollardRho tests: in TEST(MathNT, MillerRabin) and TEST(MathNT, PollardRho) initialize a fixed-seed RNG, generate a set of random n values (including composites and primes), and for each assert miller_rabin(n) matches a known-primality oracle (or compare with a slow deterministic test for small n); for pollard_rho, for each random composite n call pollard_rho(n) and assert the product of returned factors equals n and that every returned factor passes miller_rabin (is prime); ensure tests are deterministic by using the fixed seed and normalize factor ordering (e.g., sort) before comparisons.
98-123: Should fix: add a bounded performance regression check for critical paths.Please add at least one stable regression test case (large, deterministic input set) for hot algorithms like factorization or NTT to catch accidental complexity regressions.
Based on learnings: "Include performance regression tests for critical algorithms".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/math_nt_test.cc` around lines 98 - 123, Add a bounded performance regression test in Tests/math_nt_test.cc that times a large deterministic input for the hot algorithms (use pollard_rho for a large composite like product of two known 32/64-bit primes and NTT::multiply for large power-of-two vectors), measuring elapsed time with a steady clock and asserting it stays below a fixed threshold (e.g., EXPECT_LE(duration_ms, <threshold_ms>)) to catch complexity regressions; ensure the inputs are deterministic (hard-coded primes/arrays) and reference the pollard_rho and NTT<>::multiply calls so the test fails if factorization or NTT slows unexpectedly.modular_arithmetic.H (1)
64-84: Consider documenting or validating thatm > 0.When
m == 0, the 128-bit path causes division by zero (undefined behavior), and the fallback path would have issues with the modulus operations. While callers typically ensurem > 0, a@pretag or runtime check would make the contract explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modular_arithmetic.H` around lines 64 - 84, The mod_mul function lacks a precondition for m > 0; add an explicit check and/or documented `@pre` to avoid division-by-zero in the __uint128_t path and incorrect behavior in the fallback loop. Update the mod_mul declaration to document the contract (e.g., "@pre m > 0") and add a runtime validation at the start of mod_mul (assert or throw/abort) that fails fast when m == 0 so callers get a clear error instead of undefined behavior; reference the mod_mul(uint64_t a, uint64_t b, uint64_t m) function when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ah-dry.H`:
- Around line 1571-1592: The method equal_to is marked noexcept but calls
element equality which may throw; change its exception specification to either
remove noexcept or make it conditional using the element-type equality trait:
replace the trailing noexcept with noexcept(noexcept(std::declval<const typename
Container::Item_Type&>() == std::declval<const typename
Container::Item_Type&>())) so the function is noexcept only when the item
operator== is noexcept; ensure you include <utility> and <type_traits> if needed
and keep the rest of equal_to (const_me, get_it, iterator loop) unchanged.
In `@count-min-sketch.H`:
- Around line 143-152: The merge implementation in Count_Min_Sketch currently
only checks w_ and d_ but must also verify seeds_ match; update
Count_Min_Sketch::merge to compare this->seeds_ with other.seeds_ (element-wise
or via equality operator for the container) and raise the ah_domain_error_if
when they differ, before modifying table_ and total_count_, so incompatible
sketches cannot be merged.
In `@Examples/streaming_demo.cc`:
- Around line 55-63: Replace all uses of std::vector<std::string> for the
variables stream, set1, and set2 with the Aleph container DynList<std::string>;
update any push_back/iteration usage to the DynList API accordingly (e.g., use
DynList's append/push semantics and iterators) and apply the same replacement at
the other occurrences mentioned (around the later 108-109 usage). Locate
declarations named stream, set1, set2 and change their types to
DynList<std::string>, then adjust the loop that extends stream to use DynList's
add/append method and any subsequent set operations to the Aleph equivalents.
In `@hyperloglog.H`:
- Around line 82-86: The constructor currently computes m_ with an unchecked
shift (m_(static_cast<size_t>(1) << b)) before the range check, which can UB for
large b; fix by validating b before performing the shift: add a static helper
function (e.g., compute_m_or_throw(uint8_t b)) that uses the existing
ah_domain_error_if check to validate b in range [4,16] and then returns
(static_cast<size_t>(1) << b), and call that helper in the initializer list as
m_(compute_m_or_throw(b)); this keeps b_ initialization and the error behavior
but ensures the shift only happens after validation in the helper.
In `@modular_arithmetic.H`:
- Around line 95-107: The function mod_exp currently returns 1 when m == 1
because res is initialized to 1 and never reduced; update mod_exp(uint64_t base,
uint64_t exp, const uint64_t m) to handle this edge by returning 0 immediately
when m == 1 (or alternatively document that m must be >1), i.e., add an early
check for m == 1 before using base %= m and the main loop so all outputs become
base^exp mod 1 == 0.
- Around line 170-181: The loop computing prod can overflow and mod_inv is
called with uint64_t values that may exceed int64_t; fix by computing the full
product in a wider type (e.g., use a __int128 accumulator or check for overflow
on prod *= m and bail/return an error if the product would exceed UINT64_MAX),
then cast or reduce p before calling mod_inv to match its signed parameter
(e.g., compute p = prod / mod[i] in the wider type, reduce p_mod = (uint64_t)(p
% mod[i]) and pass a safe cast to mod_inv, or better, change mod_inv's signature
to accept uint64_t consistently), and use the wider accumulator when calling
mod_mul/mod_inv to avoid intermediate overflows (update references to prod, p,
mod_inv, mod_mul, rem, result accordingly).
In `@modular_combinatorics.H`:
- Around line 65-81: The constructor ModularCombinatorics(size_t max_n, uint64_t
p) must validate the modulus before building fact_ and invFact_; add an early
check that p > 1 (and optionally that max_n > 0) and fail fast (throw
std::invalid_argument or use assert) if not, so methods that call
mod_mul/mod_inv never run with an invalid modulus; perform this validation at
the start of the constructor (before any use of mod_, fact_, invFact_) and
document the precondition in the constructor comment or exception message.
- Around line 89-116: The nCk and lucas_nCk functions can index fact_ and
invFact_ out of range; add explicit bounds checks against the precomputed table
size (e.g., fact_.size() or a stored max_n) inside nCk and at the point
lucas_nCk extracts ni/ki, and when the requested n (or ni) exceeds that max
either expand/ensure the factorial tables (provide/ call a
ensure_capacity(max_needed) helper) or fall back to a safe multiplicative
modular-combination routine that does not index fact_/invFact_ (iterative
numerator/denominator with modular inverses). In practice: inside nCk(uint64_t
n, uint64_t k) validate n <= max_precomputed and either expand tables or compute
via fallback; inside lucas_nCk(uint64_t n, uint64_t k) validate ni <
max_precomputed before calling nCk(ni,ki) (and similarly ensure recursive n/mod_
steps are guarded). Ensure you reference and use fact_, invFact_, nCk and
lucas_nCk symbols when implementing these checks and fallbacks.
In `@modular_linalg.H`:
- Around line 69-74: The code handling Array<Array<uint64_t>> in
modular_linalg.H doesn't validate that all inner arrays have the same length
before performing in-place modulus and later elimination/inversion; add an
explicit ragged-matrix check at the start of the Array<Array<uint64_t>> branch
(where MatrixT == Array<Array<uint64_t>> and mat is iterated) that verifies
every mat[i].size() equals mat[0].size(), and if not return/throw a clear error
(or set an error flag) to reject ragged input before any mat(i)(j) %= mod or any
elimination/inversion routines run.
- Around line 65-83: The constructor Modular_Matrix(const MatrixT & m, const
uint64_t p) must validate the modulus before doing any normalization: at the
start of the constructor check that p is a valid modulus (e.g. p > 1, or at
least p != 0 depending on project constraints) and throw a descriptive exception
(std::invalid_argument) if invalid; only after this validation assign mod and
proceed to normalize mat in both the Array<Array<uint64_t>> branch and the
DynMatrix branch that uses mat.read_ne and mat.write. Ensure the check occurs
before any use of mod and reference the constructor name Modular_Matrix, member
variables mat and mod, and the normalization code paths (Array<Array<uint64_t>>
loop and the DynMatrix read_ne/write calls).
- Line 160: The current calls to mod_inv (e.g., where const uint64_t inv =
mod_inv(static_cast<int64_t>(diag), static_cast<int64_t>(mod)); in
modular_linalg.H and other callers in modular_arithmetic.H) are forcing
narrowing casts from uint64_t to int64_t; fix by updating the mod_inv API to
accept unsigned 64-bit values (either change signature to inline int64_t
mod_inv(const uint64_t a, const uint64_t m) or add an overload inline int64_t
mod_inv(const uint64_t a, const uint64_t m) that internally handles unsigned
arithmetic) and then remove the static_casts at the call sites (e.g., references
to diag and mod) so no narrowing conversion occurs; ensure the implementation
correctly handles unsigned inputs and preserves return semantics.
In `@ntt.H`:
- Around line 82-90: Validate input size at the start of
transform(Array<uint64_t>& a, bool invert): check that n = a.size() is > 0 and
is a power of two (use bit test or loop) and that (MOD - 1) % n == 0 so the root
exponent (MOD-1)/n is integral; if either check fails, do not proceed to
bit_reverse or mod_exp(ROOT, (MOD-1)/len, MOD). Replace any raw throw usage with
the Aleph exception macros from ah-errors.H (use the project-standard macro to
report precondition failures with a clear message mentioning transform, n and
MOD). Apply the same validation and macro-based error reporting to the other
transform-related block around lines 124-131 where similar assumptions are made.
In `@pollard_rho.H`:
- Around line 71-73: The lambda f that computes the Pollard step uses
(mod_mul(val, val, n) + c) % n which can overflow before the modulus; update f
(the lambda named f) to perform the addition safely under modulus by either
using a modular-add helper (e.g., mod_add) or by promoting the sum to a wider
integer type (unsigned __int128) when adding c to the result of mod_mul(val,
val, n) and then taking % n, ensuring you only reference mod_mul, c, and n
inside f and preserve the original semantics.
In `@README.md`:
- Around line 3632-3664: The new streaming subsection (anchor
readme-streaming-algorithms and its code block referencing reservoir-sampling.H,
count-min-sketch.H, hyperloglog.H, minhash.H, simhash.H, and the example using
Aleph::Count_Min_Sketch and Aleph::HyperLogLog) was inserted in the middle of an
existing markdown table and broke its row continuity (so later rows such as the
combinatorics_enumeration_example.cc entry no longer render as table rows); fix
this by moving the entire Probabilistic Streaming Algorithms section (heading,
table snippet and code block) so it appears after the existing table finishes,
or alternatively wrap the streaming content as a single table cell if it must
stay inline, ensuring the original table row for "Lexicographic
permutation/combination enumeration" and the following rows are restored exactly
as table syntax (pipe-delimited) so the markdown table renders correctly.
In `@reservoir-sampling.H`:
- Around line 78-86: The constructor Reservoir_Sampler currently calls
gsl_rng_alloc(...) and immediately uses rng_ via gsl_rng_set without checking
for null; add a null check after gsl_rng_alloc(gsl_rng_mt19937) and handle
allocation failure (e.g., throw or call ah_domain_error_if with a descriptive
message) before calling gsl_rng_set, ensuring rng_ is only used when non-null;
update the Reservoir_Sampler constructor (rng_, gsl_rng_alloc, gsl_rng_set
references) to perform the check and fail fast if allocation returns nullptr.
In `@Tests/streaming_algorithms_test.cc`:
- Around line 84-90: The test incorrectly asserts that
cms.estimate("dragonfruit") == 0 which is not guaranteed by Count-Min Sketch due
to hash collisions; remove the EXPECT_EQ(cms.estimate("dragonfruit"), 0u) line
(or replace it with a relaxed, deterministic check such as
EXPECT_GE(cms.estimate("dragonfruit"), 0u) or an assertion that it is below a
known noise bound if you can compute one from the CMS parameters) so the test no
longer expects an exact zero for unseen keys; look for the
cms.estimate("dragonfruit") assertion in the test and delete or relax it
accordingly.
In `@tpl_sort_utils.H`:
- Around line 5260-5284: bucket_sort_impl currently assumes num_buckets > 0 and
unconditionally writes offsets[0], which will OOB when num_buckets == 0; guard
the whole bucket computation early: in bucket_sort_impl check if num_buckets ==
0 (or n <= 1) and return immediately, or handle the zero-bucket case before
creating/using Array<size_t> offsets and accessing offsets[0]; update any
related logic that creates counts/offsets and computes prefix sums (references:
bucket_sort_impl, counts, offsets, bucket_key, num_buckets) so no reads/writes
occur when num_buckets is zero.
- Around line 6047-6053: The overload timsort(T* a, long l, long r, ...)
currently dereferences a + l without validating a; add a null-pointer contract
check at the start of timsort(T*, long, long, ...) to guard non-empty ranges: if
a == nullptr and r >= l then either throw a std::invalid_argument (or call
assert) with a clear message and return, otherwise proceed to call
timsort_detail::timsort_impl(a + l, static_cast<size_t>(r - l + 1), cmp); ensure
the check references the function name timsort and the callee
timsort_detail::timsort_impl so the change is easy to locate.
- Around line 5395-5403: The key lambda ignores the provided comparator (cmp)
and always maps values from min→max; detect whether ordering is descending by
evaluating cmp(min_val, max_val) and, if descending, invert the computed bucket
index before returning it. Concretely, inside the key lambda (captures min_val,
range, num_buckets, and cmp), compute b as you already do, then if cmp(min_val,
max_val) is false (meaning a descending comparator like greater was provided)
set b = (num_buckets - 1) - b; then return the clamped b and call
bucket_sort_detail::bucket_sort_impl(a, n, num_buckets, key, cmp) as before.
---
Nitpick comments:
In `@Examples/random_select_example.cc`:
- Around line 50-53: The helper function print_rule() is duplicated across
examples; remove the local definition in Examples/random_select_example.cc and
instead include the shared declaration by adding `#include` "print_rule.H" (or the
header that defines/declares print_rule()) and rely on the single shared
implementation; ensure the example no longer defines print_rule() and that the
build links against the shared implementation so Examples::random_select_example
uses the centralized print_rule() function.
In `@Examples/streaming_demo.cc`:
- Around line 84-87: Replace the hardcoded "Actual" counts with values computed
from the demo input `stream`: build a frequency map by iterating over the
existing `stream` container (e.g., an unordered_map<string,int> or map) and
count occurrences, then use that map when printing the estimated frequencies
(refer to `stream` and `cms.estimate`) so the printed "(Actual: ...)" values
come from the computed map (e.g., map["apple"], map["banana"], map["word_0"])
instead of fixed literals.
In `@modular_arithmetic.H`:
- Around line 64-84: The mod_mul function lacks a precondition for m > 0; add an
explicit check and/or documented `@pre` to avoid division-by-zero in the
__uint128_t path and incorrect behavior in the fallback loop. Update the mod_mul
declaration to document the contract (e.g., "@pre m > 0") and add a runtime
validation at the start of mod_mul (assert or throw/abort) that fails fast when
m == 0 so callers get a clear error instead of undefined behavior; reference the
mod_mul(uint64_t a, uint64_t b, uint64_t m) function when making these changes.
In `@simhash.H`:
- Around line 108-116: The get_fingerprint() method uses bounds-checked array
access v_[i] inside the loop while update() uses the unchecked accessor v_(i);
replace v_[i] with v_(i) in get_fingerprint() so the internal loop consistently
uses the unchecked operator() for indices 0..63 (keeping the existing loop
bounds and noexcept signature) to avoid the extra bounds checks and match
update().
- Around line 72-75: Introduce a single named constant for the fingerprint size
(e.g., static constexpr size_t FINGERPRINT_SIZE = 64) and replace all magic
literals `64` in the SimHash class with that constant; update the SimHash()
constructor (v_ = Array<double>(FINGERPRINT_SIZE, 0.0)), and any other methods
referenced in the diff such as update, get_fingerprint, and clear to use
SimHash::FINGERPRINT_SIZE instead of the literal so the size is maintained in
one place.
In `@Tests/math_nt_test.cc`:
- Around line 85-111: Add deterministic randomized property-style checks to the
MillerRabin and PollardRho tests: in TEST(MathNT, MillerRabin) and TEST(MathNT,
PollardRho) initialize a fixed-seed RNG, generate a set of random n values
(including composites and primes), and for each assert miller_rabin(n) matches a
known-primality oracle (or compare with a slow deterministic test for small n);
for pollard_rho, for each random composite n call pollard_rho(n) and assert the
product of returned factors equals n and that every returned factor passes
miller_rabin (is prime); ensure tests are deterministic by using the fixed seed
and normalize factor ordering (e.g., sort) before comparisons.
- Around line 98-123: Add a bounded performance regression test in
Tests/math_nt_test.cc that times a large deterministic input for the hot
algorithms (use pollard_rho for a large composite like product of two known
32/64-bit primes and NTT::multiply for large power-of-two vectors), measuring
elapsed time with a steady clock and asserting it stays below a fixed threshold
(e.g., EXPECT_LE(duration_ms, <threshold_ms>)) to catch complexity regressions;
ensure the inputs are deterministic (hard-coded primes/arrays) and reference the
pollard_rho and NTT<>::multiply calls so the test fails if factorization or NTT
slows unexpectedly.
In `@Tests/sort_utils.cc`:
- Around line 2103-2120: Add two regression tests for BucketSort: one TEST named
BucketSort.descending_comparator that calls bucket_sort with a descending
comparator (e.g., std::greater<...>) on a small descending input and asserts the
result is sorted ascending according to that comparator; and one TEST named
BucketSort.zero_buckets_contract that exercises the new overload taking an
explicit buckets count—assert that bucket_sort(nullptr, 0) does not throw
(already present) and that calling bucket_sort on a non-null buffer with buckets
== 0 throws std::invalid_argument to lock down the zero-buckets contract.
Reference the bucket_sort symbol and the TEST macro names when adding these
cases.
In `@Tests/streaming_algorithms_test.cc`:
- Around line 50-160: Add explicit edge-case unit tests for the streaming APIs:
for reservoir_sample add tests for k=0 with non-empty input and empty input
iterators (verify sample.size()==0 and no crash), for Count_Min_Sketch add
updates/estimates on empty sketch and single-element sketches (using
Count_Min_Sketch::from_error_bounds or the constructor used in CMSSketchMerge)
and assert estimates are 0 for unseen keys and correct for single updates, for
HyperLogLog construct tiny/register-low HLLs (e.g., small p) and verify estimate
for empty set is 0 and for single-element updates is ~1, and for MinHash and
SimHash add tests that call update zero times (empty) and once (single element)
and ensure similarity/estimate/fingerprint calls do not crash and return
sensible deterministic values (e.g., similarity between identical single-element
sketches is ~1 and empty comparisons behave consistently). Ensure tests
reference reservoir_sample, Count_Min_Sketch, HyperLogLog, MinHash, and SimHash
so the behavior at boundaries is explicitly asserted.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
Examples/CMakeLists.txtExamples/combinatorics_enumeration_example.ccExamples/gray_code_example.ccExamples/math_nt_example.ccExamples/print_rule.HExamples/random_select_example.ccExamples/streaming_demo.ccREADME.es.mdREADME.mdTests/CMakeLists.txtTests/ah-comb.ccTests/math_nt_test.ccTests/sort_utils.ccTests/streaming_algorithms_test.ccah-comb.Hah-dry.Hcount-min-sketch.Hhyperloglog.Hminhash.Hmodular_arithmetic.Hmodular_combinatorics.Hmodular_linalg.Hntt.Hpollard_rho.Hprimality.Hreservoir-sampling.Hsimhash.Htpl_array.Htpl_dynArray.Htpl_sort_utils.H
Enhances several sorting algorithms (mergesort, quicksort, introsort, heapsort, counting sort, radix sort, bucket sort) and adds new tests for streaming algorithms (Count-Min Sketch, HyperLogLog). Refines modular arithmetic operations and improves noexcept specifications.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (5)
modular_arithmetic.H (1)
207-215:⚠️ Potential issue | 🔴 CriticalMust fix:
crtneeds modulus validation and overflow checks forprod.Line 209 can silently overflow, and Line 214 can divide by zero if any modulus is 0.
🧩 Suggested fix
uint64_t prod = 1; - for (const uint64_t m : mod) - prod *= m; + for (const uint64_t m : mod) + { + ah_invalid_argument_if(m < 2) << "crt: modulus must be >= 2"; + ah_runtime_error_if(prod > std::numeric_limits<uint64_t>::max() / m) + << "crt: product of moduli overflows uint64_t"; + prod *= m; + }As per coding guidelines: "Validate input preconditions" and "Check boundary conditions and off-by-one errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modular_arithmetic.H` around lines 207 - 215, The crt implementation must validate inputs and prevent overflow: first check rem.size() == mod.size() and that every mod[i] > 0 (reject/throw on zero) before computing prod; compute the product using a wider type or checked multiplication (e.g., use __int128 for intermediate multiplies) and detect if prod exceeds UINT64_MAX (return error/throw) to avoid silent overflow; when computing p = prod / mod[i] you can assume mod[i] > 0 after validation, and pass correctly sized/checked values into mod_inv; update symbols referenced: crt, prod, p, mod, rem, mod_inv.modular_combinatorics.H (2)
96-125:⚠️ Potential issue | 🔴 CriticalMust fix:
nCk/lucas_nCkcan read beyond precomputed tables.Line 99 and Line 124 index
fact_/invFact_without verifying requested range, which can cause out-of-bounds access. If you add Aleph range checks, these methods should no longer benoexcept.🧩 Suggested fix
- [[nodiscard]] uint64_t nCk(uint64_t n, uint64_t k) const noexcept + [[nodiscard]] uint64_t nCk(uint64_t n, uint64_t k) const { if (k > n) return 0; + ah_range_error_if(n >= fact_.size()) + << "nCk: n exceeds precomputed range"; const uint64_t num = fact_[n]; const uint64_t den = mod_mul(invFact_[k], invFact_[n - k], mod_); return mod_mul(num, den, mod_); } @@ - [[nodiscard]] uint64_t lucas_nCk(uint64_t n, uint64_t k) const noexcept + [[nodiscard]] uint64_t lucas_nCk(uint64_t n, uint64_t k) const { if (k > n) return 0; if (k == 0) return 1; const uint64_t ni = n % mod_; const uint64_t ki = k % mod_; + ah_range_error_if(ni >= fact_.size()) + << "lucas_nCk: ni exceeds precomputed range"; if (ki > ni) return 0; // Lucas theorem propertyAs per coding guidelines: "Check boundary conditions and off-by-one errors" and "Validate input preconditions".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modular_combinatorics.H` around lines 96 - 125, nCk and lucas_nCk currently index fact_ and invFact_ without bounds checks which can OOB; update both to validate that the requested n (and ni in lucas_nCk) is within the precomputed range (e.g. compare n and ni against fact_.size() or a max_n_ member) before accessing fact_/invFact_, and handle out-of-range inputs (throw an exception or return 0 per project convention) instead of undefined behavior; also remove noexcept from both nCk and lucas_nCk signatures so they can propagate errors. Ensure references to fact_, invFact_, nCk, lucas_nCk and mod_ are updated accordingly.
68-72:⚠️ Potential issue | 🔴 CriticalMust fix: validate modulus before any table construction.
Line 70 can underflow
mod_ - 1whenp == 0, and later operations can hit invalid modulo paths.🛡️ Suggested fix
ModularCombinatorics(size_t max_n, uint64_t p) : mod_(p) { + ah_invalid_argument_if(mod_ < 2) + << "ModularCombinatorics: modulus must be >= 2"; + if (max_n >= mod_) max_n = static_cast<size_t>(mod_ - 1);As per coding guidelines: "Validate input preconditions" and "ALWAYS use Aleph exception macros from
ah-errors.Hinstead of rawthrowstatements".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modular_combinatorics.H` around lines 68 - 72, Validate the incoming modulus p at the start of the ModularCombinatorics(size_t max_n, uint64_t p) constructor before any use of mod_ or arithmetic like mod_ - 1 (e.g., check p == 0 and reject it), and if invalid raise an Aleph-style exception using the macros from ah-errors.H (use the project’s ALEPH/ah-errors macro rather than raw throw); only after successful validation adjust max_n relative to mod_ and proceed with table construction.ntt.H (1)
92-99:⚠️ Potential issue | 🔴 CriticalMust fix: validate transform size constraints before butterfly loops.
Without checks, non-power-of-two lengths can access past bounds (Line 112), and unsupported sizes produce invalid roots.
n == 0is also invalid for inverse scaling.🛡️ Suggested fix
+#include <ah-errors.H> @@ static void transform(Array<uint64_t>& a, bool invert) { - int n = a.size(); + const size_t n = a.size(); + ah_invalid_argument_if(n == 0) + << "NTT::transform: n must be > 0"; + ah_invalid_argument_if((n & (n - 1)) != 0) + << "NTT::transform: n must be a power of two"; + ah_invalid_argument_if(((MOD - 1) % n) != 0) + << "NTT::transform: n must divide MOD-1"; bit_reverse(a); @@ - for (int len = 2; len <= n; len <<= 1) + for (size_t len = 2; len <= n; len <<= 1) @@ - for (int i = 0; i < n; i += len) + for (size_t i = 0; i < n; i += len) @@ - for (int j = 0; j < len / 2; j++) + for (size_t j = 0; j < len / 2; j++)As per coding guidelines: "Validate input preconditions" and "ALWAYS use Aleph exception macros from
ah-errors.Hinstead of rawthrowstatements".Also applies to: 106-124
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ntt.H` around lines 92 - 99, The transform function must validate input size and root-compatibility before running butterfly loops: check that n = a.size() > 0 and is a power of two (e.g., verify (n & (n-1)) == 0) and raise an Aleph error macro from ah-errors.H if not; before computing wlen with mod_exp(ROOT, (MOD-1)/len, MOD) ensure (MOD-1) % len == 0 (or otherwise fail with an Aleph error) so roots are valid; also guard loops so len never exceeds n; replace any raw throw with the appropriate Aleph exception macro; ensure the inverse branch handles n==0 validation prior to scaling (e.g., require valid n before calling modular inverse).Tests/streaming_algorithms_test.cc (1)
89-89:⚠️ Potential issue | 🟠 MajorMust fix: remove the hard upper bound for unseen Count-Min Sketch keys.
Line 89 still assumes low collision noise (
< 5), which is not guaranteed by CMS and can become flaky across hash layouts/implementations.💡 Deterministic alternative
- EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low + const auto unseen_before = cms.estimate("dragonfruit"); + cms.update("dragonfruit", 1); + EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/streaming_algorithms_test.cc` at line 89, Remove the flaky hard upper bound check on Count-Min Sketch collisions: delete the EXPECT_LT(cms.estimate("dragonfruit"), 5u) assertion and instead assert a deterministic property such as non-negativity (e.g., use EXPECT_GE(cms.estimate("dragonfruit"), 0u)) or otherwise remove the check; locate the check by the test variable/name cms and its estimate("dragonfruit") call in the streaming_algorithms_test.cc test and update that assertion accordingly.
🧹 Nitpick comments (4)
ah-dry.H (1)
1539-1609: Well-designed equality mixin with proper exception safety.The
EqualSequenceMethodclass is correctly implemented:
- Self-comparison optimization (line 1575-1576)
- Size short-circuit before element-wise comparison
- Using
not (a == b)instead ofa != bis a good choice—it only requiresoperator==on the element type- Conditional
noexceptproperly propagates the exception specification from the element'soperator==Minor consistency note:
const_me()at line 1555 could be markednoexceptto match similar helpers elsewhere in the file (e.g., lines 169, 557).,
Optional: Add noexcept to const_me()
- const Container * const_me() const + const Container * const_me() const noexcept { return static_cast<const Container*>(this); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ah-dry.H` around lines 1539 - 1609, Mark the helper method const_me() in class EqualSequenceMethod as noexcept to match similar helpers and improve consistency; update the signature of const_me() (the const helper that returns static_cast<const Container*>) to be declared noexcept while leaving its body unchanged so exception-specification propagation for equal_to() remains consistent.Tests/math_nt_test.cc (1)
117-140: Should fix: add explicit failure-path tests for NTT and modular-combinatorics contracts.Please add
EXPECT_THROWcoverage for invalid NTT lengths and out-of-range combinatorics requests (after bounds checks are added), plus minimal empty/singleton edge cases.Based on learnings: "Applies to Tests//*_test.cc : Tests must cover edge cases: empty inputs, single element, large inputs" and "Applies to Tests//*_test.cc : Tests must verify algorithmic correctness, not just absence of crashes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/math_nt_test.cc` around lines 117 - 140, Add explicit edge-case and failure-path assertions: extend the NTT tests to include empty and singleton inputs (call NTT<>::multiply with size 0 and size 1 arrays and assert correct results) and add EXPECT_THROW checks when calling NTT<>::multiply with invalid lengths (e.g., non-power-of-two or lengths that the implementation documents as invalid). For ModularCombinatorics, add tests that call nCk with out-of-range parameters (k>n and negative/invalid n/k where applicable) and assert EXPECT_EQ(..., 0) or EXPECT_THROW depending on the new bounds-checking behavior, plus tests for nCk with n=0 and n=1 and lucas_nCk on small edge cases (e.g., n or k = 0 or 1) to verify correct outputs.Tests/streaming_algorithms_test.cc (1)
50-169: Should fix: add edge-case contract tests for the new streaming APIs.Current coverage is mostly happy-path. Please add deterministic edge-case tests (empty/singleton input, invalid constructor bounds, and invalid merge/precondition cases) for each algorithm class.
Based on learnings: "Applies to Tests//*_test.cc : Tests must cover edge cases: empty inputs, single element, large inputs" and "Applies to Tests//*_test.cc : Tests must be deterministic and repeatable".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/streaming_algorithms_test.cc` around lines 50 - 169, Add deterministic edge-case unit tests for each streaming algorithm: for reservoir_sample call reservoir_sample with k=0, k=1 and empty input and a fixed seed to assert sizes and bounds; for Count_Min_Sketch add tests that from_error_bounds throws for invalid eps/delta (already partially present) plus tests for update/estimate on empty/singleton streams and a deterministic merge precondition test that merge throws or fails when dimensions differ (use Count_Min_Sketch::from_error_bounds and merge with mismatched width/depth); for HyperLogLog add constructor validation tests (e.g., invalid register bits) and tests for estimate on empty and singleton streams using HyperLogLog::update and estimate; for MinHash add tests for empty sets, singleton sets and deterministic similarity for identical singletons using MinHash::update and similarity; for SimHash add tests that empty documents and singleton-word docs produce deterministic fingerprints via SimHash::update/get_fingerprint and that similarity behaves for identical vs disjoint inputs; ensure all tests use fixed seeds or deterministic inputs and assert exact sizes/expected behavior or thrown std::domain_error where preconditions are violated.Tests/sort_utils.cc (1)
2220-2233:random_largeis useful, but consider reducing unit-test load or reclassifying as stress/perf.
N = 100000in the default unit suite may noticeably increase CI runtime. Consider a smaller unit size plus a dedicated performance/stress target for the large case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 2220 - 2233, The Timsort::random_large test uses N = 100000 which is heavy for unit CI; change the test in TEST(Timsort, random_large) to use a smaller N (e.g., 1000 or 10000) for the default unit suite and move or duplicate the large N case into a new perf/stress test target (e.g., TEST(Timsort, perf_large) or a separate benchmark file) that runs timsort on N = 100000; update any test registration or build/test config so only the smaller test runs in fast CI while the large case runs in the performance/stress job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modular_arithmetic.H`:
- Around line 64-71: The modular routines must validate the modulus before
performing any % operations: add an early precondition check in mod_mul (and the
other modular primitive around lines 95-99) to ensure m > 0 and handle
violations (e.g., throw std::invalid_argument("modulus must be > 0") or assert)
so no code path does (x % m) with m==0; place the check at the start of the
functions (mod_mul and the other modular function) and return/throw immediately
if the check fails.
In `@ntt.H`:
- Around line 141-144: The loop that sets n = 1; while (n < a.size() + b.size()
- 1) n <<= 1; must be bounded by the maximum NTT length supported by (MOD,
ROOT); compute max_ntt_len as the largest power-of-two divisor of MOD-1 (i.e.
largest n such that n | (MOD-1)) and after computing the desired n check if n >
max_ntt_len; if so either reduce n to max_ntt_len and handle/truncate inputs
accordingly or return/throw a clear error; update the convolution/ntt entry
point (the function using n, a.size(), b.size(), MOD, ROOT) to validate and
handle the oversized-input case.
In `@Tests/sort_utils.cc`:
- Around line 2172-2181: Update the TEST named BucketSort.zero_buckets to assert
post-conditions beyond crash-safety: after calling bucket_sort(a, n,
num_buckets, key) verify that the contents of the input array `a` remain
unchanged (or match the documented behavior) — e.g., compare each element of `a`
to its original values (3,1,2) using the existing test framework assertions
rather than only EXPECT_NO_THROW; reference the test's local variables `a`, `n`,
`num_buckets`, `key`, and the function `bucket_sort` when adding these
assertions.
---
Duplicate comments:
In `@modular_arithmetic.H`:
- Around line 207-215: The crt implementation must validate inputs and prevent
overflow: first check rem.size() == mod.size() and that every mod[i] > 0
(reject/throw on zero) before computing prod; compute the product using a wider
type or checked multiplication (e.g., use __int128 for intermediate multiplies)
and detect if prod exceeds UINT64_MAX (return error/throw) to avoid silent
overflow; when computing p = prod / mod[i] you can assume mod[i] > 0 after
validation, and pass correctly sized/checked values into mod_inv; update symbols
referenced: crt, prod, p, mod, rem, mod_inv.
In `@modular_combinatorics.H`:
- Around line 96-125: nCk and lucas_nCk currently index fact_ and invFact_
without bounds checks which can OOB; update both to validate that the requested
n (and ni in lucas_nCk) is within the precomputed range (e.g. compare n and ni
against fact_.size() or a max_n_ member) before accessing fact_/invFact_, and
handle out-of-range inputs (throw an exception or return 0 per project
convention) instead of undefined behavior; also remove noexcept from both nCk
and lucas_nCk signatures so they can propagate errors. Ensure references to
fact_, invFact_, nCk, lucas_nCk and mod_ are updated accordingly.
- Around line 68-72: Validate the incoming modulus p at the start of the
ModularCombinatorics(size_t max_n, uint64_t p) constructor before any use of
mod_ or arithmetic like mod_ - 1 (e.g., check p == 0 and reject it), and if
invalid raise an Aleph-style exception using the macros from ah-errors.H (use
the project’s ALEPH/ah-errors macro rather than raw throw); only after
successful validation adjust max_n relative to mod_ and proceed with table
construction.
In `@ntt.H`:
- Around line 92-99: The transform function must validate input size and
root-compatibility before running butterfly loops: check that n = a.size() > 0
and is a power of two (e.g., verify (n & (n-1)) == 0) and raise an Aleph error
macro from ah-errors.H if not; before computing wlen with mod_exp(ROOT,
(MOD-1)/len, MOD) ensure (MOD-1) % len == 0 (or otherwise fail with an Aleph
error) so roots are valid; also guard loops so len never exceeds n; replace any
raw throw with the appropriate Aleph exception macro; ensure the inverse branch
handles n==0 validation prior to scaling (e.g., require valid n before calling
modular inverse).
In `@Tests/streaming_algorithms_test.cc`:
- Line 89: Remove the flaky hard upper bound check on Count-Min Sketch
collisions: delete the EXPECT_LT(cms.estimate("dragonfruit"), 5u) assertion and
instead assert a deterministic property such as non-negativity (e.g., use
EXPECT_GE(cms.estimate("dragonfruit"), 0u)) or otherwise remove the check;
locate the check by the test variable/name cms and its estimate("dragonfruit")
call in the streaming_algorithms_test.cc test and update that assertion
accordingly.
---
Nitpick comments:
In `@ah-dry.H`:
- Around line 1539-1609: Mark the helper method const_me() in class
EqualSequenceMethod as noexcept to match similar helpers and improve
consistency; update the signature of const_me() (the const helper that returns
static_cast<const Container*>) to be declared noexcept while leaving its body
unchanged so exception-specification propagation for equal_to() remains
consistent.
In `@Tests/math_nt_test.cc`:
- Around line 117-140: Add explicit edge-case and failure-path assertions:
extend the NTT tests to include empty and singleton inputs (call NTT<>::multiply
with size 0 and size 1 arrays and assert correct results) and add EXPECT_THROW
checks when calling NTT<>::multiply with invalid lengths (e.g., non-power-of-two
or lengths that the implementation documents as invalid). For
ModularCombinatorics, add tests that call nCk with out-of-range parameters (k>n
and negative/invalid n/k where applicable) and assert EXPECT_EQ(..., 0) or
EXPECT_THROW depending on the new bounds-checking behavior, plus tests for nCk
with n=0 and n=1 and lucas_nCk on small edge cases (e.g., n or k = 0 or 1) to
verify correct outputs.
In `@Tests/sort_utils.cc`:
- Around line 2220-2233: The Timsort::random_large test uses N = 100000 which is
heavy for unit CI; change the test in TEST(Timsort, random_large) to use a
smaller N (e.g., 1000 or 10000) for the default unit suite and move or duplicate
the large N case into a new perf/stress test target (e.g., TEST(Timsort,
perf_large) or a separate benchmark file) that runs timsort on N = 100000;
update any test registration or build/test config so only the smaller test runs
in fast CI while the large case runs in the performance/stress job.
In `@Tests/streaming_algorithms_test.cc`:
- Around line 50-169: Add deterministic edge-case unit tests for each streaming
algorithm: for reservoir_sample call reservoir_sample with k=0, k=1 and empty
input and a fixed seed to assert sizes and bounds; for Count_Min_Sketch add
tests that from_error_bounds throws for invalid eps/delta (already partially
present) plus tests for update/estimate on empty/singleton streams and a
deterministic merge precondition test that merge throws or fails when dimensions
differ (use Count_Min_Sketch::from_error_bounds and merge with mismatched
width/depth); for HyperLogLog add constructor validation tests (e.g., invalid
register bits) and tests for estimate on empty and singleton streams using
HyperLogLog::update and estimate; for MinHash add tests for empty sets,
singleton sets and deterministic similarity for identical singletons using
MinHash::update and similarity; for SimHash add tests that empty documents and
singleton-word docs produce deterministic fingerprints via
SimHash::update/get_fingerprint and that similarity behaves for identical vs
disjoint inputs; ensure all tests use fixed seeds or deterministic inputs and
assert exact sizes/expected behavior or thrown std::domain_error where
preconditions are violated.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
Examples/streaming_demo.ccTests/math_nt_test.ccTests/sort_utils.ccTests/streaming_algorithms_test.ccah-dry.Hcount-min-sketch.Hhyperloglog.Hmodular_arithmetic.Hmodular_combinatorics.Hmodular_linalg.Hntt.Hreservoir-sampling.Htpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (4)
- modular_linalg.H
- count-min-sketch.H
- hyperloglog.H
- reservoir-sampling.H
Refactor modular arithmetic, Pollard Rho, and various streaming algorithms for improved robustness, clarity, and error handling. Updates include noexcept specifiers, precondition checks, and optimized calculations. Also updates MinHash to include a merge method. Test updates to reflect these changes and add performance regression tests. README.md also updated to reflect new example structure.
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
Tests/streaming_algorithms_test.cc (1)
89-90:⚠️ Potential issue | 🟠 MajorAvoid collision-sensitive bound for unseen-key CMS assertion.
EXPECT_LT(..., 5u)is still brittle for a sketch; assert a deterministic monotonic property instead.✅ Deterministic alternative
- EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low + const auto unseen_before = cms.estimate("dragonfruit"); + cms.update("dragonfruit", 1); + EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1u);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/streaming_algorithms_test.cc` around lines 89 - 90, The brittle numeric bound on unseen-key ("dragonfruit") should be replaced with a deterministic monotonicity assertion: record size_t before = cms.estimate("dragonfruit"); perform a known update to that same key using the test's existing update method (e.g., cms.add or cms.increment — use the actual update function used elsewhere in this test file), then assert cms.estimate("dragonfruit") >= before (and optionally non-negative). Update the EXPECT_LT(cms.estimate("dragonfruit"), 5u) line to these monotonic checks referencing cms.estimate and the test's existing update method.
🧹 Nitpick comments (3)
pollard_rho.H (1)
105-127: Consider limiting recursion depth or using iteration for very large inputs.The
extract_prime_factorsfunction uses recursion, and for numbers with many small prime factors, this could lead to deep recursion. However, for practical 64-bit integers, the recursion depth is bounded by log₂(n) ≤ 64, which is safe.The error handling via
ah_runtime_error_unless(false)at line 123 is unconventional but functionally correct—it always throws with the given message.💡 Optional: Consider using ah_runtime_error() for unconditional errors
- ah_runtime_error_unless(false) + ah_runtime_error() << "pollard_rho: failed to find a factor of " << n << " after " << max_attempts << " attempts";This is clearer in intent as an unconditional error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pollard_rho.H` around lines 105 - 127, The unconditional error at the end of find_any_factor uses ah_runtime_error_unless(false) which is confusing in intent; replace that call with ah_runtime_error() and keep the same streamed message so the function unconditionally throws with a clear API (refer to find_any_factor and the error message block). Also review extract_prime_factors for deep recursion on pathological inputs and consider converting it to an iterative loop or adding a shallow recursion guard if you expect many small factors, but the immediate change is to swap ah_runtime_error_unless(false) to ah_runtime_error() in find_any_factor.tpl_sort_utils.H (1)
5930-5933: Unsigned arithmetic relies on wraparound in edge case.When
hint == 0and searching left,ofscan reachmax_ofs = 1. Line 5931 then computeshint - ofs = 0 - 1, which underflows forsize_t. The algorithm still produces correct results because line 5936's++last_ofswraps it back to 0, but this relies on implicit unsigned wraparound semantics.Consider adding a comment documenting this intentional reliance on wraparound, or refactoring to avoid the subtlety:
// Note: When hint == 0 and ofs == 1, last_ofs underflows to SIZE_MAX. // This is intentional: the subsequent ++last_ofs wraps it to 0, // yielding the correct binary search bounds. const size_t tmp = last_ofs; last_ofs = hint - ofs; ofs = hint - tmp;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 5930 - 5933, The subtraction last_ofs = hint - ofs can intentionally underflow for size_t when hint == 0 and ofs == 1; either document this behavior with a clear comment referring to hint, ofs, last_ofs and tmp explaining that the subsequent ++last_ofs wraps to 0, or eliminate the reliance on unsigned wraparound by guarding the subtraction (e.g. if (hint < ofs) set last_ofs to SIZE_MAX - (ofs - hint) + 1 or switch to a signed difference) so the binary-search bounds computed by tmp, last_ofs, hint and ofs are correct without implicit wraparound.Tests/sort_utils.cc (1)
2235-2254: Performance test threshold may be environment-sensitive.The 500ms threshold for sorting 100k integers may pass/fail inconsistently across different CI runners or developer machines. Consider either:
- Using a relative benchmark (compare against a baseline algorithm)
- Documenting the expected hardware characteristics
- Making the threshold configurable via environment variable
That said, 500ms for 100k integers is quite generous, so this is likely fine in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 2235 - 2254, The perf test TEST(Timsort, perf_large) uses a fixed 500ms threshold which can be flaky across environments; change the test to use a configurable threshold: read an environment variable (e.g., TIMSORT_MAX_MS) with a sensible default of 500, or compute a relative threshold by timing a baseline sort (e.g., std::sort on the same array) and asserting timsort is within a multiplier; update the assertion that uses duration and keep the correctness loop (ASSERT_LE on a(i-1), a(i)) unchanged; reference TEST(Timsort, perf_large), the N constant, the timsort call and the duration check when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@count-min-sketch.H`:
- Around line 67-68: The Count_Min_Sketch template currently allows negative
counters and silent arithmetic wraparound; update() and merge() must be
hardened: first constrain the CounterT template to an unsigned integral (add a
static_assert or use std::unsigned_integral for CounterT and/or change the
default to size_t) and then add runtime protections in
Count_Min_Sketch::update(...) and Count_Min_Sketch::merge(const
Count_Min_Sketch&) to reject or clamp negative deltas and to detect overflow
before performing additions (e.g., check if current_value >
std::numeric_limits<CounterT>::max() - delta and handle by capping to max or
returning an error/throwing), and replicate the same checks where counters are
incremented in the class (references: Count_Min_Sketch, update(), merge()).
- Around line 81-88: In the Count_Min_Sketch constructor (Count_Min_Sketch and
related overloads) validate the computed dimensions before casting/allocating:
ensure computed product (w_ * d_ or the values derived from epsilon/delta) is
finite, non-zero, and does not overflow size_t (i.e., check against
numeric_limits<size_t>::max() / other_dim), and reject or clamp inputs with a
clear ah_domain_error_if when invalid; only after these checks allocate table_ =
Array<CounterT>(w_ * d_, ...) and reserve seeds_. Apply the same pre-allocation
validation to the other constructor/logic referenced (the one converting
epsilon/delta to w/d) so floating-point-to-size_t conversions are checked for
non-finite, negative, zero, or overflow before casting or memory allocation.
In `@Examples/random_select_example.cc`:
- Around line 59-63: There is an unmatched extra closing brace after the end of
print_array that causes a compilation error; remove the stray `}` following the
print_array closing brace so the function's scope ends correctly and main()
starts immediately (verify the braces around print_array and main are balanced
and that the only closing brace for print_array is the one currently at the end
of that function).
In `@Examples/streaming_demo.cc`:
- Line 39: Replace uses of std::unordered_map / std::unordered_set in
Examples/streaming_demo.cc with the Aleph equivalents (DynHashMap and
DynHashSet): add the Aleph header include(s) for DynHashMap/DynHashSet, change
the type declarations and any construction/usage sites (e.g., the include at the
top that currently references <unordered_map> and all usages around the blocks
noted (around lines referenced in the review) to use DynHashMap<Key,Value> and
DynHashSet<Key> instead, and update any namespace qualifiers if needed) so the
example follows the project container policy; ensure template parameters and
APIs are adapted if Aleph's containers use different constructor or method
names.
In `@Tests/math_nt_test.cc`:
- Around line 232-251: The test MathNT.ModularLinalg currently computes products
using inv_mat and the original matrix entries (inv_mat * m_data), so the comment
"Check A * A^-1 = I" is misleading; update the comment to "Check A^-1 * A = I"
or change the product order to multiply the original matrix (mat or m_data) on
the left and inv_mat on the right so the code actually verifies A * A^-1 = I;
adjust the four EXPECT_EQ lines accordingly to reference the correct
multiplication order involving inv_mat and m_data (or mat) to match the chosen
comment.
In `@Tests/streaming_algorithms_test.cc`:
- Around line 37-39: Replace uses of std::vector and std::set in
Tests/streaming_algorithms_test.cc with the Aleph equivalents across the noted
locations (lines around 37-39, 52-53, 68-70, and 138-140): find every
declaration/usage of std::vector and std::set in this test file and change them
to the corresponding Aleph container types (e.g., Aleph::vector / Aleph::set or
the project-specific container names used elsewhere in the repo), adjust any
construction/iteration APIs if needed to match Aleph semantics, and run the
tests to ensure compilation and behavior remain unchanged.
---
Duplicate comments:
In `@Tests/streaming_algorithms_test.cc`:
- Around line 89-90: The brittle numeric bound on unseen-key ("dragonfruit")
should be replaced with a deterministic monotonicity assertion: record size_t
before = cms.estimate("dragonfruit"); perform a known update to that same key
using the test's existing update method (e.g., cms.add or cms.increment — use
the actual update function used elsewhere in this test file), then assert
cms.estimate("dragonfruit") >= before (and optionally non-negative). Update the
EXPECT_LT(cms.estimate("dragonfruit"), 5u) line to these monotonic checks
referencing cms.estimate and the test's existing update method.
---
Nitpick comments:
In `@pollard_rho.H`:
- Around line 105-127: The unconditional error at the end of find_any_factor
uses ah_runtime_error_unless(false) which is confusing in intent; replace that
call with ah_runtime_error() and keep the same streamed message so the function
unconditionally throws with a clear API (refer to find_any_factor and the error
message block). Also review extract_prime_factors for deep recursion on
pathological inputs and consider converting it to an iterative loop or adding a
shallow recursion guard if you expect many small factors, but the immediate
change is to swap ah_runtime_error_unless(false) to ah_runtime_error() in
find_any_factor.
In `@Tests/sort_utils.cc`:
- Around line 2235-2254: The perf test TEST(Timsort, perf_large) uses a fixed
500ms threshold which can be flaky across environments; change the test to use a
configurable threshold: read an environment variable (e.g., TIMSORT_MAX_MS) with
a sensible default of 500, or compute a relative threshold by timing a baseline
sort (e.g., std::sort on the same array) and asserting timsort is within a
multiplier; update the assertion that uses duration and keep the correctness
loop (ASSERT_LE on a(i-1), a(i)) unchanged; reference TEST(Timsort, perf_large),
the N constant, the timsort call and the duration check when making the change.
In `@tpl_sort_utils.H`:
- Around line 5930-5933: The subtraction last_ofs = hint - ofs can intentionally
underflow for size_t when hint == 0 and ofs == 1; either document this behavior
with a clear comment referring to hint, ofs, last_ofs and tmp explaining that
the subsequent ++last_ofs wraps to 0, or eliminate the reliance on unsigned
wraparound by guarding the subtraction (e.g. if (hint < ofs) set last_ofs to
SIZE_MAX - (ofs - hint) + 1 or switch to a signed difference) so the
binary-search bounds computed by tmp, last_ofs, hint and ofs are correct without
implicit wraparound.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
Examples/random_select_example.ccExamples/streaming_demo.ccREADME.mdTests/ah-comb.ccTests/math_nt_test.ccTests/sort_utils.ccTests/streaming_algorithms_test.ccah-dry.Hcount-min-sketch.Hhyperloglog.Hminhash.Hmodular_arithmetic.Hmodular_combinatorics.Hmodular_linalg.Hntt.Hpollard_rho.Hprimality.Hreservoir-sampling.Hsimhash.Htpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (4)
- Tests/ah-comb.cc
- hyperloglog.H
- ah-dry.H
- modular_combinatorics.H
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
tpl_sort_utils.H:5600
- The floating-point
bucket_sortlambda capturesmax_val, but it is never used inside the lambda. This can trigger-Wunused-lambda-captureunder stricter warning settings; remove the unused capture (or use it if intended).
auto key = [min_val, max_val, range, num_buckets, cmp](const T & val) -> size_t
{
auto b = static_cast<size_t>((val - min_val) / range *
static_cast<T>(num_buckets - 1));
if (b >= num_buckets)
…ng tests Refactor sorting algorithms in tpl_sort_utils.H to improve clarity, performance, and add new overloads. Enhance Count-Min Sketch for robustness against overflow and add new error bounds validation. Update reservoir sampling for better type handling. Convert vector and set usage to DynArray and DynHashSet in streaming algorithm tests. Adjust random_select example and modular_linalg for type consistency. - Refactor `Count_Min_Sketch` to handle counter overflow, improve error bounds validation, and use `std::unsigned_integral` for `CounterT`. - Update `Reservoir_Sampler` to use `std::iterator_traits` for better iterator compatibility. - Improve `tpl_sort_utils.H` with more detailed Doxygen comments and add new `timsort` overloads for different container types (DynArray, Array, C-array, DynList) and range-based sorting. - Add overloads for `counting_sort` and `radix_sort` to support various container types (DynArray, Array, DynList, DynDlist, raw pointers, C-arrays) with proper constraint checks. - Enhance `binary_search` and `build_index` utilities with additional overloads for pointer types and improved documentation. - Update `sort_utils_test.cc` to use `DynArray` and `DynHashSet` instead of `std::vector` and `std::set` in streaming tests. - Fix logic for checking inverse in `math_nt_test.cc` from `A * A^-1 = I` to `A^-1 * A = I`. - Update `random_select_example.cc` and `streaming_demo.cc` to reflect API changes and use `DynArray` and `DynHashMap`.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
Tests/streaming_algorithms_test.cc (1)
90-90:⚠️ Potential issue | 🟡 MinorAvoid absolute upper-bound assertions for unseen CMS keys.
EXPECT_LT(cms.estimate("dragonfruit"), 5u)still depends on collision/layout specifics and is brittle. Prefer a contract-based deterministic check.💡 Proposed fix
- EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low + const auto unseen_before = cms.estimate("dragonfruit"); + cms.update("dragonfruit", 1); + EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1u);As per coding guidelines: "Tests must verify algorithmic correctness, not just absence of crashes" and "Tests must be deterministic and repeatable".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/streaming_algorithms_test.cc` at line 90, The test's brittle absolute upper-bound should be replaced with a deterministic contract: assert that cms.estimate("dragonfruit") honors CMS invariants (it must be >= the true count 0 and <= the total number of inserted items). Replace the EXPECT_LT(cms.estimate("dragonfruit"), 5u) check with two assertions using cms.estimate — e.g., EXPECT_GE(cms.estimate("dragonfruit"), 0u) and EXPECT_LE(cms.estimate("dragonfruit"), total_inserted_count) — where total_inserted_count is the known sum of all increments used earlier in the test; this uses cms.estimate as the unique symbol to locate the assertion to change.
🧹 Nitpick comments (3)
tpl_hash.H (1)
134-135: Optional: preserve full template configurability in the aliases.If you want
Dyn*aliases to fully mirrorHashSet/HashMapflexibility, expose the table/comparator template parameters too.♻️ Suggested refactor
- template <typename T> - using DynHashSet = HashSet<T>; + template <typename T, + template <typename, class> class HashSetTable = ODhashTable, + class Cmp = Aleph::equal_to<T>> + using DynHashSet = HashSet<T, HashSetTable, Cmp>; - template <typename K, typename V> - using DynHashMap = HashMap<K, V>; + template <typename K, typename V, + template <typename, typename, class> class HashMapTable = MapODhash, + class Cmp = Aleph::equal_to<K>> + using DynHashMap = HashMap<K, V, HashMapTable, Cmp>;Also applies to: 138-139
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_hash.H` around lines 134 - 135, The Dyn* type aliases currently fix template parameters by aliasing template <typename T> using DynHashSet = HashSet<T>; (and the similar DynHashMap aliases at the other spots), which loses HashSet/HashMap's table/comparator configurability; change these aliases to forward all template parameters (e.g., template <typename K, typename Table = /*default*/, typename Cmp = /*default*/> using DynHashSet = HashSet<K, Table, Cmp> and similarly for DynHashMap) so the aliases mirror the original templates’ full parameter lists (update the aliases named DynHashSet and the DynHashMap aliases referenced near the other lines).tpl_sort_utils.H (1)
5880-5902: Nice to have: deduplicate adjacent Doxygen blocks for the same symbol.There are repeated back-to-back doc blocks for
gallop_left,gallop_right, andtimsort_impl. Keeping one canonical block per symbol will reduce doc maintenance noise.As per coding guidelines: "Doxygen must generate no warnings for changed files".
Also applies to: 5952-5970, 6174-6199
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 5880 - 5902, The file contains duplicated adjacent Doxygen blocks for the same symbols (e.g., gallop_left, gallop_right, timsort_impl); remove the redundant consecutive comment blocks and keep a single canonical Doxygen block per symbol that includes all required tags (`@tparam`, `@param`, `@return`) and the full descriptive text. Locate the duplicate blocks around the symbols named gallop_left, gallop_right and timsort_impl and merge their content into one doc comment each (preserve the most complete wording), then delete the extra block(s) so only one Doxygen comment immediately precedes each function declaration/definition to avoid duplicate-doc warnings.Examples/streaming_demo.cc (1)
105-105: Use a semantically accurate name for the set.
unique_mapis aDynHashSet<string>. Renaming tounique_setimproves readability.As per coding guidelines: "Use descriptive variable and function names."♻️ Proposed rename
- DynHashSet<string> unique_map; + DynHashSet<string> unique_set; @@ - unique_map.insert(s); + unique_set.insert(s); @@ - cout << "Estimated unique elements: " << fixed << setprecision(1) << hll.estimate() << " (Actual: " << unique_map.size() << ")\n"; + cout << "Estimated unique elements: " << fixed << setprecision(1) << hll.estimate() << " (Actual: " << unique_set.size() << ")\n";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Examples/streaming_demo.cc` at line 105, Rename the variable unique_map (a DynHashSet<string>) to a semantically accurate name such as unique_set throughout the file and any related scopes; update all references to unique_map to the new identifier to reflect that it is a set (e.g., change declarations, usages, and any comments referencing unique_map) so the symbol DynHashSet<string> is paired with a clear, descriptive name like unique_set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Examples/streaming_demo.cc`:
- Around line 112-113: The cout stream is left in fixed/precision mode after
printing HyperLogLog (hll.estimate()), which corrupts later MinHash/other
numeric outputs; after the line that prints hll.estimate() (and before
subsequent prints around the MinHash output), restore the stream state by either
calling cout.unsetf(ios::floatfield) and resetting precision to the previous
value or by saving/restoring the stream state with ios::fmtflags and precision;
locate the HyperLogLog print (uses hll.estimate()) and the MinHash print sites
to wrap or restore the formatting so subsequent numeric output is not rounded.
In `@modular_linalg.H`:
- Around line 203-205: The subtraction (get_val(...)+mod-term)%mod in the
elimination step overflows for large uint64_t moduli; update the code paths that
compute 'term' via mod_mul and then set_val(a, j, k, ...) to perform
branch-based modular subtraction instead: load current = get_val(a, j, k), if
current >= term then new = current - term else new = current + mod - term, and
pass new to set_val; apply the same change to all similar sites (the elimination
loops using get_val/set_val and term from mod_mul) to ensure overflow-safe
results.
In `@Tests/math_nt_test.cc`:
- Around line 152-184: Remove or disable hard wall-clock assertions in the
TEST(MathNT, PerformanceRegression) test: keep the correctness checks that call
pollard_rho(n) and NTT<>::multiply(a,b) and their EXPECT_EQ/EXPECT_TRUE
assertions, but remove the two EXPECT_LE(duration, ...) timing assertions (the
ones after pollard_rho and after NTT<>::multiply) or relocate them to a
dedicated benchmark/perf test; alternatively, guard those timing checks behind
an explicit opt-in (e.g., an environment flag like RUN_PERF_TESTS) so normal CI
runs omit the flaky wall-clock checks.
- Around line 123-133: The test assumes pollard_rho returns factors in a fixed
order but it can validly return factors in any order; update the assertions in
Tests/math_nt_test.cc to compare order-insensitively by converting the result of
pollard_rho(...) and the expected Array<uint64_t> (exp12, exp97, exp_big) into
comparable multisets or by sorting both containers before calling EXPECT_EQ (or
using EXPECT_EQ on sorted vectors) so the checks for pollard_rho(12),
pollard_rho(97), and pollard_rho(1000000007ULL * 97ULL) pass regardless of
factor order.
In `@Tests/sort_utils.cc`:
- Around line 2236-2252: The TEST(Timsort, perf_large) currently asserts a hard
wall-clock (EXPECT_LE(duration, 500)) which makes CI flaky; change the test to
make the performance check opt-in: remove or disable the hard assertion and
instead gate the timing assertion behind an environment/config flag (e.g. check
getenv("ENABLE_PERF_TESTS") or a command-line test parameter) so that when the
flag is set you assert EXPECT_LE(duration, 500) for timsort otherwise just log
or SUCCEED() with the measured duration; ensure you reference the same symbols
(TEST(Timsort, perf_large), N, timsort, duration) when implementing the
conditional so behavior is unchanged when the opt-in flag is active.
In `@tpl_sort_utils.H`:
- Around line 6312-6323: The timsort function currently does pointer arithmetic
with a + l before validating that l and r are non-negative; add a bounds check
at the start of timsort(T *a, const long l, const long r, const Compare & cmp)
that rejects negative indices (e.g., if (l < 0 || r < 0)
ah_invalid_argument_if(true) << "timsort(): negative bounds ...") and ensure
this check runs before any use of a or a + l; keep the existing empty-range
check (l >= r) and the existing null-pointer check (ah_invalid_argument_if(a ==
nullptr)) but move or add the negative-range validation so
timsort_detail::timsort_impl is only called with safe, non-negative size_t
length.
- Around line 5800-5807: The listed Timsort helper templates (reverse_range,
binary_insertion_sort, gallop_left, gallop_right and other timsort_detail
helpers that call user cmp, std::move or std::swap) must not be marked
unconditional noexcept; change their exception specification to a conditional
one that reflects whether the comparator and moves/swaps are nothrow (e.g.
noexcept(noexcept(cmp(std::declval<T&>(), std::declval<T&>())) &&
std::is_nothrow_swappable_v<T>()) for functions using cmp+swap, or include
std::is_nothrow_move_constructible_v<T> when using std::move), and remove the
plain noexcept from reverse_range and the other identified functions so
exception propagation is preserved.
---
Duplicate comments:
In `@Tests/streaming_algorithms_test.cc`:
- Line 90: The test's brittle absolute upper-bound should be replaced with a
deterministic contract: assert that cms.estimate("dragonfruit") honors CMS
invariants (it must be >= the true count 0 and <= the total number of inserted
items). Replace the EXPECT_LT(cms.estimate("dragonfruit"), 5u) check with two
assertions using cms.estimate — e.g., EXPECT_GE(cms.estimate("dragonfruit"), 0u)
and EXPECT_LE(cms.estimate("dragonfruit"), total_inserted_count) — where
total_inserted_count is the known sum of all increments used earlier in the
test; this uses cms.estimate as the unique symbol to locate the assertion to
change.
---
Nitpick comments:
In `@Examples/streaming_demo.cc`:
- Line 105: Rename the variable unique_map (a DynHashSet<string>) to a
semantically accurate name such as unique_set throughout the file and any
related scopes; update all references to unique_map to the new identifier to
reflect that it is a set (e.g., change declarations, usages, and any comments
referencing unique_map) so the symbol DynHashSet<string> is paired with a clear,
descriptive name like unique_set.
In `@tpl_hash.H`:
- Around line 134-135: The Dyn* type aliases currently fix template parameters
by aliasing template <typename T> using DynHashSet = HashSet<T>; (and the
similar DynHashMap aliases at the other spots), which loses HashSet/HashMap's
table/comparator configurability; change these aliases to forward all template
parameters (e.g., template <typename K, typename Table = /*default*/, typename
Cmp = /*default*/> using DynHashSet = HashSet<K, Table, Cmp> and similarly for
DynHashMap) so the aliases mirror the original templates’ full parameter lists
(update the aliases named DynHashSet and the DynHashMap aliases referenced near
the other lines).
In `@tpl_sort_utils.H`:
- Around line 5880-5902: The file contains duplicated adjacent Doxygen blocks
for the same symbols (e.g., gallop_left, gallop_right, timsort_impl); remove the
redundant consecutive comment blocks and keep a single canonical Doxygen block
per symbol that includes all required tags (`@tparam`, `@param`, `@return`) and the
full descriptive text. Locate the duplicate blocks around the symbols named
gallop_left, gallop_right and timsort_impl and merge their content into one doc
comment each (preserve the most complete wording), then delete the extra
block(s) so only one Doxygen comment immediately precedes each function
declaration/definition to avoid duplicate-doc warnings.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
Examples/random_select_example.ccExamples/streaming_demo.ccTests/math_nt_test.ccTests/sort_utils.ccTests/streaming_algorithms_test.cccount-min-sketch.Hmodular_linalg.Hreservoir-sampling.Htpl_hash.Htpl_sort_utils.H
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ah-dry.H (1)
377-383:⚠️ Potential issue | 🟡 MinorComplete required Doxygen tags on touched overload docs.
Line 377 updates this public overload’s docs, but the block still omits explicit
@briefand@throwstags (same pattern appears in other touched overload-doc blocks in this patch).📝 Suggested doc pattern
- /** Overload of `find_ptr(Operation&) const` that accepts rvalues. + /** `@brief` Overload of `find_ptr(Operation&) const` that accepts rvalues. /// The operation is passed (invoked) as a named lvalue to the /// lvalue-reference overload rather than being perfectly forwarded. @@ `@param`[in] operation matching criteria `@return` const pointer to matching element or nullptr + `@throws` Anything propagated by `operation` or traversal/iterator operations. */As per coding guidelines: “All public member functions must have
@brief,@param,@return, and@throwsDoxygen tags.”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ah-dry.H` around lines 377 - 383, The Doxygen block for the rvalue overload of find_ptr is missing required tags; update the comment for the overload (the one describing find_ptr(Operation&) const that accepts rvalues) to include an explicit `@brief` describing the purpose, add the existing `@param` and `@return` tags if not already present, and add an `@throws` tag documenting any exceptions the function may propagate (or @{throws none}@ if it never throws). Apply the same `@brief/`@throws addition pattern to the other overload-doc blocks touched in this patch so all public member overloads comply with the `@brief`, `@param`, `@return`, and `@throws` requirement.
♻️ Duplicate comments (1)
tpl_sort_utils.H (1)
6234-6239:⚠️ Potential issue | 🟠 Major
timsort(T*, long, long, ...)skips null-contract validation for singleton ranges.With current ordering,
l == rreturns before checkinga == nullptr. This diverges from the pointer-length overload contract and weakens API consistency.Proposed fix
ah_range_error_if(l < 0 or r < 0) << "timsort(): negative bounds [" << l << ", " << r << "] are not allowed"; - if (l >= r) - return; - - ah_invalid_argument_if(a == nullptr) + ah_invalid_argument_if(a == nullptr and l <= r) << "timsort(): null pointer contract violation for non-empty range [" << l << ", " << r << "]; cannot call timsort_detail::timsort_impl()"; + + if (l >= r) + return;As per coding guidelines: "Validate external inputs before processing" and "Check boundary conditions and off-by-one errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 6234 - 6239, The null-pointer contract check for timsort is skipped for singleton ranges because the early-return "if (l >= r) return;" precedes the ah_invalid_argument_if(a == nullptr) validation; move the ah_invalid_argument_if(a == nullptr) check (the null-pointer validation that emits "timsort(): null pointer contract violation...") to occur before the range-size early return in the timsort(T*, long, long, ...) entry (the overload that calls timsort_detail::timsort_impl) so that a == nullptr is validated for all calls, including l == r cases.
🧹 Nitpick comments (3)
hyperloglog.H (1)
67-69: Nice to have: constrainTwith a hashability concept for clearer compile-time diagnostics.
HyperLogLog<T>currently accepts anyT, but it requiresmurmur3hash(const T&, seed)to exist.As per coding guidelines: "Use concepts (C++20) to constrain template parameters; avoid SFINAE when possible".♻️ Proposed refactor
+# include <concepts> ... namespace Aleph { + template <typename U> + concept MurmurHashable = requires(const U & v) + { + { murmur3hash(v, 0x12345678) } -> std::convertible_to<size_t>; + }; + - template <typename T> + template <MurmurHashable T> class HyperLogLogAlso applies to: 78-83
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hyperloglog.H` around lines 67 - 69, The template HyperLogLog<T> currently accepts any T but implicitly requires a murmur3hash(const T&, uint32_t) function; add a C++20 concept (e.g., Hashable<T> or Murmur3Hashable<T>) that expresses the requirement for a callable murmur3hash(const T&, std::uint32_t) (or equivalent signature used in your code) and constrain the class template with it (e.g., template<typename T> requires Hashable<T> class HyperLogLog). Update the other templated declarations around the same region (lines referenced 78-83) to use the same concept so compile-time diagnostics are clear and consistent, and ensure the concept name and signature match the existing murmur3hash usage in the file.ah-dry.H (1)
132-170: Constrain template parameters with C++20 concepts to improve diagnostics.The new
traversefree functions (lines 132–170) andEqualSequenceMethodtemplate (lines 1607–1621) lack concept constraints on their template parameters. Applying concepts would:
- Clarify expected capabilities (Container must provide
traverse(Operation&), Operation must be invocable asbool(Item_Type &))- Provide better compiler diagnostics on template instantiation failure
- Align with the codebase pattern established in
ah-ranges.H,ahFunctional.H, andgraph-dry.HThis follows the coding guideline: "Use concepts (C++20) to constrain template parameters; avoid SFINAE when possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ah-dry.H` around lines 132 - 170, Constrain the two free-function templates traverse and the EqualSequenceMethod template with C++20 concepts: require that Container provides a member traverse(Operation&) and that Operation is invocable with the container's element type as bool(Item_Type&). Specifically, add requires-clauses (or use a named concept) to the templates for traverse(const Container&, Operation&), traverse(const Container&, Operation&&) and EqualSequenceMethod so instantiation fails with clear diagnostics when Container::traverse(op) is not available or when Operation cannot be called as bool(Item_Type&); update the noexcept expressions if necessary to reference the constrained parameter names.Tests/sort_utils.cc (1)
2283-2307: Remove the redundantENABLE_PERF_TESTSlookup inperf_large.Line [2283] already skips when perf tests are disabled, so the second check at Line [2300] is unnecessary and can be flattened for readability.
♻️ Proposed cleanup
TEST(Timsort, perf_large) { - if (not std::getenv("ENABLE_PERF_TESTS")) + const char * perf_enabled = std::getenv("ENABLE_PERF_TESTS"); + if (not perf_enabled) GTEST_SKIP() << "Skipping perf test (set ENABLE_PERF_TESTS to enable)"; @@ - if (std::getenv("ENABLE_PERF_TESTS")) - { - long max_ms = 500; - if (const char * env_ms = std::getenv("TIMSORT_MAX_MS")) - max_ms = std::atol(env_ms); - - EXPECT_LE(duration, max_ms) << "Timsort performance regression detected"; - } + long max_ms = 500; + if (const char * env_ms = std::getenv("TIMSORT_MAX_MS")) + max_ms = std::atol(env_ms); + EXPECT_LE(duration, max_ms) << "Timsort performance regression detected";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 2283 - 2307, The second getenv check in the perf_large test is redundant because the test already returns early when ENABLE_PERF_TESTS is unset; remove the outer "if (std::getenv(\"ENABLE_PERF_TESTS\")) { ... }" block (the one that wraps max_ms and EXPECT_LE) and unindent its body so the max_ms logic and the EXPECT_LE(duration, max_ms) assertion run unconditionally after computing duration; keep the TIMSORT_MAX_MS lookup (const char* env_ms = std::getenv("TIMSORT_MAX_MS")) and the variables N, a, timsort, duration, max_ms and the EXPECT_LE assertion unchanged otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hyperloglog.H`:
- Around line 107-110: Add explicit Doxygen `@pre` tags for the constructor and
merge() method: in the HyperLogLog constructor comment (the block documenting
the constructor that currently describes parameter b and domain_error) add "@pre
4 <= b && b <= 16" (or "@pre 4 <= b <= 16") to document the required range; in
the merge() method comment (the docblock around the merge(const HyperLogLog&
other) function referenced at lines ~181-189) add "@pre b_ == other.b_" to state
the required matching precision. Ensure the new `@pre` lines are placed alongside
the existing `@param/`@throws notes in their respective Doxygen comment blocks so
the preconditions are explicit.
In `@tpl_sort_utils.H`:
- Around line 975-977: The insertion_sort noexcept is unsafe because the body
currently uses copy ops but the noexcept checks only move-assignability; change
the algorithm to use move semantics and update the noexcept to reflect move
construction/assignment: inside insertion_sort (function name insertion_sort in
tpl_sort_utils.H) create the temporary via move (e.g., T tmp = std::move(a[i]))
and perform shifts with move-assign (a[j] = std::move(a[j-1])) and place tmp
with move (a[j] = std::move(tmp)); then change the noexcept expression to
require std::is_nothrow_move_constructible_v<T> and
std::is_nothrow_move_assignable_v<T> (keep the cmp noexcept check) so the
declared noexcept matches the operations performed.
---
Outside diff comments:
In `@ah-dry.H`:
- Around line 377-383: The Doxygen block for the rvalue overload of find_ptr is
missing required tags; update the comment for the overload (the one describing
find_ptr(Operation&) const that accepts rvalues) to include an explicit `@brief`
describing the purpose, add the existing `@param` and `@return` tags if not already
present, and add an `@throws` tag documenting any exceptions the function may
propagate (or @{throws none}@ if it never throws). Apply the same `@brief/`@throws
addition pattern to the other overload-doc blocks touched in this patch so all
public member overloads comply with the `@brief`, `@param`, `@return`, and `@throws`
requirement.
---
Duplicate comments:
In `@tpl_sort_utils.H`:
- Around line 6234-6239: The null-pointer contract check for timsort is skipped
for singleton ranges because the early-return "if (l >= r) return;" precedes the
ah_invalid_argument_if(a == nullptr) validation; move the
ah_invalid_argument_if(a == nullptr) check (the null-pointer validation that
emits "timsort(): null pointer contract violation...") to occur before the
range-size early return in the timsort(T*, long, long, ...) entry (the overload
that calls timsort_detail::timsort_impl) so that a == nullptr is validated for
all calls, including l == r cases.
---
Nitpick comments:
In `@ah-dry.H`:
- Around line 132-170: Constrain the two free-function templates traverse and
the EqualSequenceMethod template with C++20 concepts: require that Container
provides a member traverse(Operation&) and that Operation is invocable with the
container's element type as bool(Item_Type&). Specifically, add requires-clauses
(or use a named concept) to the templates for traverse(const Container&,
Operation&), traverse(const Container&, Operation&&) and EqualSequenceMethod so
instantiation fails with clear diagnostics when Container::traverse(op) is not
available or when Operation cannot be called as bool(Item_Type&); update the
noexcept expressions if necessary to reference the constrained parameter names.
In `@hyperloglog.H`:
- Around line 67-69: The template HyperLogLog<T> currently accepts any T but
implicitly requires a murmur3hash(const T&, uint32_t) function; add a C++20
concept (e.g., Hashable<T> or Murmur3Hashable<T>) that expresses the requirement
for a callable murmur3hash(const T&, std::uint32_t) (or equivalent signature
used in your code) and constrain the class template with it (e.g.,
template<typename T> requires Hashable<T> class HyperLogLog). Update the other
templated declarations around the same region (lines referenced 78-83) to use
the same concept so compile-time diagnostics are clear and consistent, and
ensure the concept name and signature match the existing murmur3hash usage in
the file.
In `@Tests/sort_utils.cc`:
- Around line 2283-2307: The second getenv check in the perf_large test is
redundant because the test already returns early when ENABLE_PERF_TESTS is
unset; remove the outer "if (std::getenv(\"ENABLE_PERF_TESTS\")) { ... }" block
(the one that wraps max_ms and EXPECT_LE) and unindent its body so the max_ms
logic and the EXPECT_LE(duration, max_ms) assertion run unconditionally after
computing duration; keep the TIMSORT_MAX_MS lookup (const char* env_ms =
std::getenv("TIMSORT_MAX_MS")) and the variables N, a, timsort, duration, max_ms
and the EXPECT_LE assertion unchanged otherwise.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Tests/sort_utils.ccTree_Decomposition.Hah-dry.Hhyperloglog.Htpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (1)
- Tree_Decomposition.H
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 40 out of 40 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (3)
Tree_Decomposition.H:200
- The standalone scope block starts on the same line as the preceding statement (
id_to_node_ = ...; {), which makes the control flow harder to read and is inconsistent with the surrounding brace style. Please move the{ ... }to its own line (or remove the extra scope if it’s not needed) to keep formatting clear and consistent.
id_to_node_ = Array<Node *>::create(n_); {
MapOLhash<Node *, size_t> tmp(n_);
node_to_id_.swap(tmp);
htlist.H:1140
- There is an extra leading space before the Doxygen comment on the new copy constructor, which makes it misaligned with the other iterator comments in this section. Align the comment indentation with the surrounding
//!\~English ...lines for consistency.
//!\~English Copy constructor \~Spanish Constructor de copia
Iterator(const Iterator &) = default;
Examples/random_select_example.cc:55
- The helper
print_arrayis indented as if it were inside a block (templateline starts with extra whitespace) even though it is at global scope. This looks unintentional and makes the file formatting inconsistent—please left-align it with the other top-level declarations.
template <typename Container>
void print_array(const char * label, const Container & a)
{
std::cout << std::left << std::setw(15) << label << ": [";
for (size_t i = 0; i < a.size(); ++i)
{
std::cout << a[i];
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tpl_sort_utils.H (1)
1181-1189:⚠️ Potential issue | 🟠 MajorMust fix:
merge()is not stable on equal keys.On ties, the merge picks from the right partition (
elsebranch), which reorders equal elements across halves and violates the documented stable behavior of merge sort.💡 Proposed fix
- for (long k = l; k <= r; ++k) - if (cmp(buf[i], buf[j])) - a[k] = std::move(buf[i++]); - else - a[k] = std::move(buf[j--]); + for (long k = l; k <= r; ++k) + if (cmp(buf[j], buf[i])) + a[k] = std::move(buf[j--]); + else + a[k] = std::move(buf[i++]); // left side wins ties => stable🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 1181 - 1189, The merge loop in merge() is unstable because on ties it takes from the right; update the selection logic so equal keys come from the left partition to preserve stability: using the existing symbols cmp, buf, a, i, j, k, change the condition to choose buf[i] when elements compare equal (i.e., when both cmp(buf[i], buf[j]) and cmp(buf[j], buf[i]) are false), otherwise use cmp(buf[i], buf[j]) as before — effectively prefer left when !cmp(buf[j], buf[i]) so equal keys keep original order.
♻️ Duplicate comments (2)
hyperloglog.H (2)
138-144:⚠️ Potential issue | 🟡 MinorNice to have: make the
estimate()thread-safety note fully explicit.Line 143 omits
clear()as a concurrent writer; readers are only safe when no concurrentupdate(),merge(), orclear()runs.As per coding guidelines, "Document thread-safety guarantees in class/function comments" and "Document whether algorithms are thread-safe, reentrant, or require external synchronization."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hyperloglog.H` around lines 138 - 144, Update the estimate() comment to explicitly state its thread-safety guarantees include that concurrent reads are only safe when there are no concurrent writers calling update(), merge(), or clear(); specifically modify the comment block above the estimate() method to add "or clear()" to the list of prohibited concurrent writers (referencing estimate(), update(), merge(), clear()) so it clearly documents required external synchronization for readers.
119-125:⚠️ Potential issue | 🟠 MajorShould fix: complete Doxygen contract tags for void public methods.
update(),merge(), andclear()still miss explicit@returntags, so the public API docs remain inconsistent with the repository’s strict Doxygen contract rule.Proposed doc-only patch
@@ /** `@brief` Add an element to the set. * * `@param`[in] val Element to incorporate into the cardinality sketch. + * `@return` void * `@throws` none * `@note` Complexity: O(1). * `@note` Not thread-safe: concurrent calls to update() must be externally synchronized. */ @@ /** `@brief` Merge another HyperLogLog into this one (union of sets). @@ * `@param`[in] other HyperLogLog with the same precision parameter b. * `@pre` b_ == other.b_ + * `@return` void * `@throws` std::domain_error if the precision parameters (b) differ. * `@note` Complexity: O(m) where m = 2^b. * `@note` Not thread-safe: concurrent calls must be externally synchronized. */ @@ /** `@brief` Reset all registers to zero. * + * `@return` void * `@throws` none * `@note` Complexity: O(m) where m = 2^b. * `@note` Not thread-safe: concurrent calls must be externally synchronized. */As per coding guidelines, "All public member functions must have
@brief,@param,@return, and@throwsDoxygen tags."Also applies to: 182-206
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hyperloglog.H` around lines 119 - 125, Add explicit Doxygen `@return` tags for the public void methods update(), merge(), and clear() so the API docs comply with the repository's Doxygen contract; locate the declarations/definitions of HyperLogLog::update(), HyperLogLog::merge(const HyperLogLog&), and HyperLogLog::clear() and add a line like "@return void" (or equivalent phrasing used across the repo) to each comment block, preserving existing `@brief`, `@param`, and `@throws` tags and formatting conventions.
🧹 Nitpick comments (1)
tpl_sort_utils.H (1)
5571-5574: Nice to have: deduplicate repeatedapply_sort_with_templambdas.These wrappers repeat the same lambda shape for bucket_sort/timsort container overloads. A small helper would reduce duplication and ease future maintenance.
As per coding guidelines: "Refactor duplicated code into reusable functions or templates".
Also applies to: 5631-5634, 6257-6260, 6315-6318, 6337-6340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 5571 - 5574, The repeated lambdas passed to bucket_sort_detail::apply_sort_with_temp (which simply forward T* p, size_t n, const Compare& c to bucket_sort(p,n,c) or tim_sort(p,n,c)) should be refactored into a small reusable helper — e.g., a template function or factory like make_forwarding_sort_invoker<SortFn>() that captures a sort function (bucket_sort or tim_sort) and returns the appropriate callable; replace the inline lambdas at the shown call site and the other locations (lines around 5631-5634, 6257-6260, 6315-6318, 6337-6340) to pass the helper result to bucket_sort_detail::apply_sort_with_temp instead of repeating the lambda. This keeps the call sites concise and centralizes the forwarding logic for apply_sort_with_temp.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tpl_sort_utils.H`:
- Around line 5432-5449: Add a `@pre` that states the required monotonic
relationship between bucket_key and cmp for global ordering: e.g.,
"Precondition: for all elements x,y, if cmp(x,y) is true then bucket_key(x) <=
bucket_key(y) (i.e., bucket_key is monotonic with respect to cmp)"; update the
bucket_sort documentation (the comment block for function bucket_sort) to
include this `@pre` and a matching `@post` that says "If the precondition holds, the
output array is globally ordered according to cmp after sorting"; reference the
parameter names bucket_key and cmp in the text so readers can locate the
contract.
---
Outside diff comments:
In `@tpl_sort_utils.H`:
- Around line 1181-1189: The merge loop in merge() is unstable because on ties
it takes from the right; update the selection logic so equal keys come from the
left partition to preserve stability: using the existing symbols cmp, buf, a, i,
j, k, change the condition to choose buf[i] when elements compare equal (i.e.,
when both cmp(buf[i], buf[j]) and cmp(buf[j], buf[i]) are false), otherwise use
cmp(buf[i], buf[j]) as before — effectively prefer left when !cmp(buf[j],
buf[i]) so equal keys keep original order.
---
Duplicate comments:
In `@hyperloglog.H`:
- Around line 138-144: Update the estimate() comment to explicitly state its
thread-safety guarantees include that concurrent reads are only safe when there
are no concurrent writers calling update(), merge(), or clear(); specifically
modify the comment block above the estimate() method to add "or clear()" to the
list of prohibited concurrent writers (referencing estimate(), update(),
merge(), clear()) so it clearly documents required external synchronization for
readers.
- Around line 119-125: Add explicit Doxygen `@return` tags for the public void
methods update(), merge(), and clear() so the API docs comply with the
repository's Doxygen contract; locate the declarations/definitions of
HyperLogLog::update(), HyperLogLog::merge(const HyperLogLog&), and
HyperLogLog::clear() and add a line like "@return void" (or equivalent phrasing
used across the repo) to each comment block, preserving existing `@brief`, `@param`,
and `@throws` tags and formatting conventions.
---
Nitpick comments:
In `@tpl_sort_utils.H`:
- Around line 5571-5574: The repeated lambdas passed to
bucket_sort_detail::apply_sort_with_temp (which simply forward T* p, size_t n,
const Compare& c to bucket_sort(p,n,c) or tim_sort(p,n,c)) should be refactored
into a small reusable helper — e.g., a template function or factory like
make_forwarding_sort_invoker<SortFn>() that captures a sort function
(bucket_sort or tim_sort) and returns the appropriate callable; replace the
inline lambdas at the shown call site and the other locations (lines around
5631-5634, 6257-6260, 6315-6318, 6337-6340) to pass the helper result to
bucket_sort_detail::apply_sort_with_temp instead of repeating the lambda. This
keeps the call sites concise and centralizes the forwarding logic for
apply_sort_with_temp.
Refactor bucket_sort and timsort to use sort_invoker_factory. Introduce new concepts for HyperLogLog and traversable containers. Adjust existing Timsort tests for the updated implementation.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tpl_sort_utils.H (1)
4972-4973:⚠️ Potential issue | 🟠 MajorMust fix: replace
std::arraybuckets with Aleph containers in radix-list implementations.These new bucket buffers use
std::array, which conflicts with repository policy for container selection in this codebase.Proposed fix
- std::array<HTList, 256> buckets; + Array<HTList> buckets = Array<HTList>::create(256); @@ - for (auto & bucket: buckets) - list.HTList::append(bucket); + for (size_t b = 0; b < 256; ++b) + list.HTList::append(buckets[b]);- std::array<Dnode<IntT>, 256> buckets; + Array<Dnode<IntT>> buckets = Array<Dnode<IntT>>::create(256); @@ - for (auto & bucket: buckets) - list.template Dnode<IntT>::append_list(&bucket); + for (size_t b = 0; b < 256; ++b) + list.template Dnode<IntT>::append_list(&buckets[b]);As per coding guidelines: "MUST use Aleph containers when an equivalent exists. Standard library containers are ONLY permitted when no Aleph equivalent is available."
Also applies to: 5011-5011
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 4972 - 4973, Replace the std::array bucket buffers in the radix-list code with the Aleph container equivalent: change the declaration std::array<HTList, 256> buckets (and the duplicate at the other occurrence) to use the Aleph array/fixed-array type used in this codebase (e.g., aleph::array<HTList, 256> buckets), and update any necessary includes or namespace qualifiers so the HTList buckets are constructed and used exactly as before but using the Aleph container instead of std::array.ah-dry.H (1)
48-50:⚠️ Potential issue | 🟠 MajorAdd explicit
<concepts>include forstd::convertible_to.The concepts at lines 62 and 74 use
std::convertible_to, which is defined in<concepts>. While this header is currently included transitively viaahFunctional.H, explicit inclusion is required to avoid brittle build-order dependencies.Patch
# include "ahFunctional.H" +# include <concepts> # include <utility> # include <vector> # include <ah-errors.H>Also applies to: 54-63, 74-75
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ah-dry.H` around lines 48 - 50, Add an explicit include for the C++ concepts header so uses of std::convertible_to in ah-dry.H are guaranteed to compile; update the include block that currently has <utility>, <vector>, and <ah-errors.H> to also include <concepts> so the concept expressions (those using std::convertible_to at the concepts around the declarations at lines ~62 and ~74) resolve without relying on transitive includes from ahFunctional.H.
♻️ Duplicate comments (1)
tpl_sort_utils.H (1)
6264-6269:⚠️ Potential issue | 🟡 MinorShould fix: null-pointer contract is too strict for empty ranges in
timsort(T*, long, long, ...).This still throws on
a == nullptreven whenl > r(empty range). That contradicts the current early-return semantics and the “non-empty range” intent.Proposed fix
- ah_invalid_argument_if(a == nullptr) + ah_invalid_argument_if(a == nullptr and l <= r) << "timsort(): null pointer contract violation for range [" << l << ", " << r << "]; cannot call timsort_detail::timsort_impl()";As per coding guidelines: "Validate external inputs before processing" and "Check boundary conditions and off-by-one errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tpl_sort_utils.H` around lines 6264 - 6269, The null-pointer check using ah_invalid_argument_if(a == nullptr) is too strict for empty ranges in timsort(T* a, long l, long r, ...); change the logic so empty ranges are allowed: either move the early-return (if (l >= r) return;) before the null-pointer assertion, or tighten the assertion to only trigger for non-empty ranges (e.g., assert a == nullptr only when l < r); update the check near timsort and timsort_detail::timsort_impl to reference parameters a, l, r accordingly.
🧹 Nitpick comments (3)
hyperloglog.H (1)
181-181: Nice to have: replace hard-coded2^64literal with computed constant.Using
std::ldexp(1.0, 64)avoids a long magic literal and is easier to audit.Optional refactor
- static constexpr double two_to_64 = 18446744073709551616.0; + static constexpr double two_to_64 = std::ldexp(1.0, 64);As per coding guidelines: "Avoid magic numbers; use named constants or enums".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hyperloglog.H` at line 181, The constant two_to_64 is defined with a hard-coded 2^64 literal; replace its initializer with a computed value using std::ldexp(1.0, 64) in the static constexpr double two_to_64 declaration, and add the needed include for <cmath> (or <math.h>) so ldexp is available; keep the name and constexpr qualifier unchanged to preserve semantics in any code using two_to_64.Tests/sort_utils.cc (1)
1918-1964: Nice to have: add one seeded property-style randomized loop forrandom_select.The crafted cases are good. Adding a deterministic randomized cross-check (compare
random_selectagainst sorted-reference rank over many generated arrays) would improve regression sensitivity for partition edge interactions.Based on learnings: “Use property-based testing or randomized tests for complex algorithms”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Tests/sort_utils.cc` around lines 1918 - 1964, Add a deterministic property-style randomized test that seeds the RNG and repeatedly generates random arrays and indices, then validates Aleph::random_select against a sorted-reference rank; implement a new TEST in the SortUtilsRandomSelect suite (e.g., RandomizedSeededProperty) that creates a std::mt19937 with a fixed seed, for many iterations builds random DynArray<int> values, computes expected result by sorting or using std::nth_element on a copy, calls Aleph::random_select(©(0), static_cast<long>(k), static_cast<long>(copy.size()), std::less<int>()), and uses EXPECT_EQ to compare to the reference value to catch partition-edge regressions.ah-dry.H (1)
1658-1659: ConstrainEqualSequenceMethodwith concepts for clearer diagnostics.Given this file already introduces
AlephTraversable, adding constraints here would make contract violations fail earlier and more clearly.Refactor sketch
-template <class Container> +template <class Container> + requires AlephTraversable<Container> and + requires (const Container & c) + { + { c.size() } -> std::convertible_to<size_t>; + c.get_it(); + } class EqualSequenceMethodAs per coding guidelines: “Use concepts to constrain template parameters”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ah-dry.H` around lines 1658 - 1659, The template class EqualSequenceMethod should be constrained with the AlephTraversable concept so misuse yields clearer diagnostics; change its declaration to require AlephTraversable (e.g. use either “template <class Container> requires AlephTraversable<Container> class EqualSequenceMethod” or the abbreviated form “template <AlephTraversable Container> class EqualSequenceMethod”) and update any forward declarations/usages accordingly, ensuring the AlephTraversable concept is visible (include or import the header/namespace where AlephTraversable is defined).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ah-dry.H`:
- Around line 406-415: Update the Doxygen for the rvalue overload of find_ptr()
(and the other two similar overloads referenced) to not claim "@throws none";
instead state that the function may propagate exceptions thrown by the provided
Operation and provide the same exception guarantees as the corresponding
lvalue-reference overload (e.g., wording like "May throw the same exceptions as
the corresponding lvalue overload; no additional exception-safety guarantees are
provided"). Locate the rvalue overload docs for find_ptr() (the comment block
describing the rvalue wrapper) and the two other comment blocks at the indicated
spots and replace "@throws none" with this accurate exception-safety wording.
- Around line 132-136: The const overload of traverse on GenericTraverse
currently breaks const-correctness by const_casting and calling the mutable
traverse; instead remove the const_cast and provide a const-safe overload that
constrains the Operation to accept a const reference to the element type (e.g.
require TraverseOperationConst<Operation, typename Container::Item_Type const&>
or a similar trait), call a new const-qualified helper that passes const
Item_Type& to the callback, and ensure noexcept uses
traverse_is_noexcept<Operation>() appropriately; do not forward to the non-const
traverse and document the behaviour if you prefer the alternative of explicitly
allowing mutation.
In `@hyperloglog.H`:
- Around line 44-48: The code uses std::convertible_to in the MurmurHashable
concept's requires-clause but does not include <concepts>, relying on transitive
includes; add an explicit `#include` <concepts> near the other standard includes
(alongside <cmath>, <algorithm>, <bit>, <cstdint>) so the MurmurHashable concept
(and any other use of std::convertible_to) is well-defined and portable across
toolchains.
---
Outside diff comments:
In `@ah-dry.H`:
- Around line 48-50: Add an explicit include for the C++ concepts header so uses
of std::convertible_to in ah-dry.H are guaranteed to compile; update the include
block that currently has <utility>, <vector>, and <ah-errors.H> to also include
<concepts> so the concept expressions (those using std::convertible_to at the
concepts around the declarations at lines ~62 and ~74) resolve without relying
on transitive includes from ahFunctional.H.
In `@tpl_sort_utils.H`:
- Around line 4972-4973: Replace the std::array bucket buffers in the radix-list
code with the Aleph container equivalent: change the declaration
std::array<HTList, 256> buckets (and the duplicate at the other occurrence) to
use the Aleph array/fixed-array type used in this codebase (e.g.,
aleph::array<HTList, 256> buckets), and update any necessary includes or
namespace qualifiers so the HTList buckets are constructed and used exactly as
before but using the Aleph container instead of std::array.
---
Duplicate comments:
In `@tpl_sort_utils.H`:
- Around line 6264-6269: The null-pointer check using ah_invalid_argument_if(a
== nullptr) is too strict for empty ranges in timsort(T* a, long l, long r,
...); change the logic so empty ranges are allowed: either move the early-return
(if (l >= r) return;) before the null-pointer assertion, or tighten the
assertion to only trigger for non-empty ranges (e.g., assert a == nullptr only
when l < r); update the check near timsort and timsort_detail::timsort_impl to
reference parameters a, l, r accordingly.
---
Nitpick comments:
In `@ah-dry.H`:
- Around line 1658-1659: The template class EqualSequenceMethod should be
constrained with the AlephTraversable concept so misuse yields clearer
diagnostics; change its declaration to require AlephTraversable (e.g. use either
“template <class Container> requires AlephTraversable<Container> class
EqualSequenceMethod” or the abbreviated form “template <AlephTraversable
Container> class EqualSequenceMethod”) and update any forward
declarations/usages accordingly, ensuring the AlephTraversable concept is
visible (include or import the header/namespace where AlephTraversable is
defined).
In `@hyperloglog.H`:
- Line 181: The constant two_to_64 is defined with a hard-coded 2^64 literal;
replace its initializer with a computed value using std::ldexp(1.0, 64) in the
static constexpr double two_to_64 declaration, and add the needed include for
<cmath> (or <math.h>) so ldexp is available; keep the name and constexpr
qualifier unchanged to preserve semantics in any code using two_to_64.
In `@Tests/sort_utils.cc`:
- Around line 1918-1964: Add a deterministic property-style randomized test that
seeds the RNG and repeatedly generates random arrays and indices, then validates
Aleph::random_select against a sorted-reference rank; implement a new TEST in
the SortUtilsRandomSelect suite (e.g., RandomizedSeededProperty) that creates a
std::mt19937 with a fixed seed, for many iterations builds random DynArray<int>
values, computes expected result by sorting or using std::nth_element on a copy,
calls Aleph::random_select(©(0), static_cast<long>(k),
static_cast<long>(copy.size()), std::less<int>()), and uses EXPECT_EQ to compare
to the reference value to catch partition-edge regressions.
Refactor `SimHash` to use two 32-bit hashes for 64-bit fingerprint on 32-bit platforms. Update `HyperLogLog` to include <concepts> header. Refactor `radix_sort_impl` in `tpl_sort_utils.H` to use `Array` instead of `std::array` for buckets to avoid default constructible requirement. Add `TEST(SortUtilsRandomSelect, RandomizedSeededProperty)` to `sort_utils.cc`.
Refactor HyperLogLog methods for clarity. Standardize GenericItems aliases and equality checks.
Refactor HyperLogLog to remove MurmurHashable concept and directly use hash64. Update simhash.H to use dereferenced iterator for update calls.
Summary by CodeRabbit
New Features
Refactor
Tests
Documentation