Skip to content

Add combinatorics, number theory, and streaming algorithm utilities' or 'Add ah-comb, NTT, modular arithmetic, and streaming algorithms#40

Merged
lrleon merged 49 commits intomasterfrom
utils
Mar 3, 2026
Merged

Add combinatorics, number theory, and streaming algorithm utilities' or 'Add ah-comb, NTT, modular arithmetic, and streaming algorithms#40
lrleon merged 49 commits intomasterfrom
utils

Conversation

@lrleon
Copy link
Copy Markdown
Owner

@lrleon lrleon commented Feb 27, 2026

Summary by CodeRabbit

  • New Features

    • Many new examples and demos plus new public algorithm libraries: streaming estimators (reservoir sampling, Count‑Min Sketch, HyperLogLog, MinHash, SimHash), number‑theory & math (NTT, Pollard‑Rho, Miller‑Rabin, modular arithmetic, modular combinatorics, modular linear algebra), expanded sorting (Bucket Sort, TimSort), and combinatorics utilities (permutations, combinations, Gray code).
  • Refactor

    • Container equality now uses sequence semantics; several iterator types made explicitly copyable.
  • Tests

    • Large expansion of unit tests across streaming, number theory, combinatorics, sorting, and utilities.
  • Documentation

    • README updated in English and Spanish with new sections and usage examples.

Copilot AI review requested due to automatic review settings February 27, 2026 17:51
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds many new header-only algorithms (combinatorics, Gray code, sorting), number‑theory modules (modular arithmetic, primality, Pollard‑Rho, NTT, modular combinatorics/linalg), streaming sketches and demo, multiple new examples, and extensive tests and README updates. Several new public classes/functions are introduced.

Changes

Cohort / File(s) Summary
Build files
Examples/CMakeLists.txt, Tests/CMakeLists.txt
Register multiple new example executables and two new test targets.
Combinatorics & Gray code
ah-comb.H, Examples/combinatorics_enumeration_example.cc, Examples/gray_code_example.cc, Tests/ah-comb.cc
Add lexicographic permutation/combination APIs, masks/counts, Gray-code conversion/generator, examples and comprehensive tests — check reset semantics, validation, and overflow in combination_count.
Number theory & NTT
modular_arithmetic.H, primality.H, pollard_rho.H, ntt.H, modular_combinatorics.H, modular_linalg.H, Examples/math_nt_example.cc, Tests/math_nt_test.cc
Introduce safe modular ops, deterministic Miller‑Rabin, Pollard‑Rho factorization, NTT convolution, modular combinatorics and linear algebra, plus example and tests — review overflow, CRT/inverse logic, NTT size constraints, and primality bases.
Streaming sketches & demo
reservoir-sampling.H, count-min-sketch.H, hyperloglog.H, minhash.H, simhash.H, Examples/streaming_demo.cc, Tests/streaming_algorithms_test.cc
Add Reservoir Sampler, Count‑Min Sketch, HyperLogLog, MinHash, SimHash, demo program and tests — verify hashing/seed choices, merge semantics, error-bound computations, and RNG determinism.
Sorting: BucketSort & TimSort
tpl_sort_utils.H, Tests/sort_utils.cc
Large addition/refactor adding bucket_sort and timsort implementations with many overloads and adapters; extensive tests added — inspect stability, galloping/merge invariants, and exception/noexcept behavior.
Container equality & mixins
ah-dry.H, tpl_array.H, tpl_dynArray.H, Tests/ah-dry-mixin_test.cc, Tests/ah-dry.cc
Add EqualSequenceMethod mixin and switch Array/DynArray to inherit it; add tests validating sequence equality semantics — check CRTP/inheritance impacts.
Utilities & examples
Examples/print_rule.H, Examples/random_select_example.cc, Examples/streaming_demo.cc, Examples/math_nt_example.cc, Examples/gray_code_example.cc, Examples/combinatorics_enumeration_example.cc
Add small helper header and multiple example programs demonstrating new APIs; verify includes and that example target names match sources.
Documentation
README.md, README.es.md
Add new sections and example references for combinatorics, number theory, Gray code, and streaming algorithms—confirm referenced filenames/anchors exist.
Tests: new & expanded suites
Tests/streaming_algorithms_test.cc, Tests/math_nt_test.cc, Tests/ah-comb.cc, Tests/sort_utils.cc, Tests/*
Many new and extended tests across streaming, number theory, combinatorics, and sorting — watch for determinism, flakiness, and increased runtime.
Hash/container small API tweaks
tpl_hash.H, htlist.H, tpl_dynDlist.H, Tests/geom_algorithms_test_common.h
Add DynHash aliases, explicit iterator copy ctors, and minor const/formatting fixes — review ABI/CRTP effects and iterator copyability.
Misc refactors & formatting
Knapsack.H, Subset_Sum.H, Blossom.H, Tree_Decomposition.H, tpl_array.H
Minor refactors and formatting tweaks; ensure semantics unchanged and compatibility preserved.

Sequence Diagram(s)

sequenceDiagram
  participant Stream as Stream
  participant Demo as StreamingDemo
  participant RS as Reservoir_Sampler
  participant CMS as Count_Min_Sketch
  participant HLL as HyperLogLog
  participant MH as MinHash
  participant SH as SimHash

  rect rgba(200,200,255,0.5)
  Stream->>Demo: provide token stream
  end

  rect rgba(200,255,200,0.5)
  Demo->>RS: update(items)
  RS-->>Demo: get_sample()
  end

  rect rgba(255,230,200,0.5)
  Demo->>CMS: update(items)
  CMS-->>Demo: estimate(freq)
  end

  rect rgba(255,200,255,0.5)
  Demo->>HLL: update(items)
  HLL-->>Demo: estimate(cardinality)
  end

  rect rgba(200,255,255,0.5)
  Demo->>MH: update(sets)
  MH-->>Demo: similarity_estimate()
  Demo->>SH: update(docs)
  SH-->>Demo: fingerprint_similarity()
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Header Docstring Coverage ⚠️ Warning Overall documentation coverage is 23.6% (34 of 144 public declarations documented), significantly below the required 80% threshold. Add Doxygen-style documentation blocks for all public declarations in failing headers, prioritizing those with 0-25% coverage such as modular_arithmetic.H, modular_combinatorics.H, ntt.H, primality.H, and tpl_hash.H.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: it adds combinatorics, number theory, and streaming algorithm utilities to the library.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch utils

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces significant enhancements to the Aleph-w library with advanced algorithms and utilities. The implementation adds two new major sorting algorithms (Bucket Sort and Timsort), improves the random selection algorithm with 3-way partitioning, and provides a comprehensive suite of number theory and streaming algorithms.

Changes:

  • Implemented Bucket Sort and Timsort with support for multiple container types
  • Enhanced random_select with iterative 3-way partitioning for better duplicate handling
  • Added probabilistic streaming algorithms (HyperLogLog, MinHash, SimHash, Count-Min Sketch, Reservoir Sampling)
  • Introduced number theory toolkit (modular arithmetic, primality testing, NTT, Pollard's Rho factorization)
  • Extended combinatorics utilities with lexicographic enumeration and Gray code functions

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tpl_sort_utils.H Core sorting utilities: Bucket Sort, Timsort implementations, and improved random_select with 3-way partitioning
tpl_dynArray.H Updated inheritance from EqualToMethod to EqualSequenceMethod
tpl_array.H Updated inheritance from EqualToMethod to EqualSequenceMethod
simhash.H New SimHash class for cosine similarity estimation via bitwise fingerprints
reservoir-sampling.H New Reservoir_Sampler for random sampling from streams
primality.H Miller-Rabin deterministic primality testing for 64-bit integers
pollard_rho.H Pollard's rho algorithm for integer factorization
ntt.H Number Theoretic Transform for fast polynomial multiplication
modular_linalg.H Modular matrix operations (determinant, inverse) over finite fields
modular_combinatorics.H Modular binomial coefficients with precomputed factorials and Lucas theorem
modular_arithmetic.H Safe 64-bit modular arithmetic primitives (multiplication, exponentiation, CRT)
minhash.H MinHash for Jaccard similarity estimation between sets
hyperloglog.H HyperLogLog cardinality estimator for counting distinct elements
count-min-sketch.H Count-Min Sketch for frequency estimation in streams
ah-dry.H New EqualSequenceMethod class for sequence container equality testing
ah-comb.H Extended combinatorics: lexicographic permutation/combination enumeration, Gray codes
Tests/streaming_algorithms_test.cc Comprehensive tests for all streaming algorithms
Tests/sort_utils.cc Tests for random_select improvements, Bucket Sort, and Timsort
Tests/math_nt_test.cc Tests for number theory algorithms
Tests/ah-comb.cc Tests for new combinatorics utilities
Tests/CMakeLists.txt Added new test programs
README.md Documentation for new algorithms and features
README.es.md Spanish documentation for combinatorics enumeration
Examples/*.cc Example programs demonstrating new features
Examples/CMakeLists.txt Added new example programs
Comments suppressed due to low confidence (4)

tpl_sort_utils.H:2378

  • The function parameter names l and r are ambiguous single-letter names. Consider renaming them to left and right for clarity.
  const T &__random_select(T *a, const long i, long l, long r,

ah-comb.H:54

  • The documentation clarifies the dual meaning of 'permutation' but could be clearer by explicitly stating this distinction earlier in the file header to avoid confusion for users unfamiliar with the codebase.
    @note `traverse_perm`/`for_each_perm`/`build_perms` use "permutation" in the
          cartesian-product sense. `next_permutation` is the classical
          lexicographic permutation of a single sequence.

tpl_sort_utils.H:4908

  • Inconsistent spacing in range-based for loop. The colon should have spaces on both sides (bucket : buckets) to match the project's style elsewhere in the file.
          for (auto & bucket: buckets)

tpl_sort_utils.H:4939

  • Inconsistent spacing in range-based for loop. The colon should have spaces on both sides (bucket : buckets) to match the project's style elsewhere in the file.
          for (auto & bucket: buckets)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 049e044d46

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (9)
simhash.H (2)

108-116: Inconsistent array access: prefer unchecked operator() for internal loops.

Line 112 uses bounds-checked v_[i] while update() uses unchecked v_(i). Since the loop bounds are known to be valid (0 to 63), use v_(i) consistently for better performance.

Proposed fix
     [[nodiscard]] uint64_t get_fingerprint() const noexcept
     {
       uint64_t fingerprint = 0;
       for (int i = 0; i < 64; ++i)
-        if (v_[i] > 0)
+        if (v_(i) > 0)
           fingerprint |= (static_cast<uint64_t>(1) << i);
       
       return fingerprint;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simhash.H` around lines 108 - 116, The get_fingerprint() method uses
bounds-checked array access v_[i] inside the loop while update() uses the
unchecked accessor v_(i); replace v_[i] with v_(i) in get_fingerprint() so the
internal loop consistently uses the unchecked operator() for indices 0..63
(keeping the existing loop bounds and noexcept signature) to avoid the extra
bounds checks and match update().

72-75: Consider using a named constant for the fingerprint size.

The value 64 appears in multiple places (constructor, update, get_fingerprint, clear). A static constexpr member would improve maintainability.

Proposed improvement
   template <typename T>
   class SimHash
   {
+    static constexpr size_t kBits = 64;
     Array<double> v_; /**< Accumulator vector for each bit. */

   public:
     /** `@brief` Construct SimHash accumulator. */
     SimHash()
     {
-      v_ = Array<double>(64, 0.0);
+      v_ = Array<double>(kBits, 0.0);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@simhash.H` around lines 72 - 75, Introduce a single named constant for the
fingerprint size (e.g., static constexpr size_t FINGERPRINT_SIZE = 64) and
replace all magic literals `64` in the SimHash class with that constant; update
the SimHash() constructor (v_ = Array<double>(FINGERPRINT_SIZE, 0.0)), and any
other methods referenced in the diff such as update, get_fingerprint, and clear
to use SimHash::FINGERPRINT_SIZE instead of the literal so the size is
maintained in one place.
Tests/sort_utils.cc (1)

2103-2120: Add regression cases for zero buckets and descending comparator in BucketSort tests.

Current coverage is strong, but these two scenarios are still missing and would lock down the contract for the new overloads.

💡 Suggested additions
+TEST(BucketSort, custom_bucket_key_zero_buckets_throws)
+{
+  int a[] = {3, 1, 2};
+  auto key = [](const int & v) -> size_t { return static_cast<size_t>(v); };
+  EXPECT_THROW((bucket_sort(a, size_t{3}, size_t{0}, key)), std::domain_error);
+}
+
+TEST(BucketSort, float_descending_comparator)
+{
+  float a[] = {0.1f, 2.0f, -1.5f, 0.3f, 2.0f};
+  bucket_sort(a, size_t{5}, Aleph::greater<float>());
+  EXPECT_TRUE(std::is_sorted(std::begin(a), std::end(a), std::greater<float>()));
+}

As per coding guidelines: "Tests must cover edge cases: empty inputs, single element, large inputs".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/sort_utils.cc` around lines 2103 - 2120, Add two regression tests for
BucketSort: one TEST named BucketSort.descending_comparator that calls
bucket_sort with a descending comparator (e.g., std::greater<...>) on a small
descending input and asserts the result is sorted ascending according to that
comparator; and one TEST named BucketSort.zero_buckets_contract that exercises
the new overload taking an explicit buckets count—assert that
bucket_sort(nullptr, 0) does not throw (already present) and that calling
bucket_sort on a non-null buffer with buckets == 0 throws std::invalid_argument
to lock down the zero-buckets contract. Reference the bucket_sort symbol and the
TEST macro names when adding these cases.
Examples/random_select_example.cc (1)

50-53: Nice to have: deduplicate print_rule() across examples.

This helper is repeated in multiple example files; pulling it from print_rule.H keeps examples DRY.

♻️ Proposed refactor
 `#include` <iostream>
 `#include` <iomanip>
 
 `#include` <tpl_sort_utils.H>
 `#include` <tpl_dynArray.H>
+#include <print_rule.H>
@@
 namespace
 {
-  void print_rule()
-  {
-    std::cout << "--------------------------------------------------------\n";
-  }
-
   template <typename Container>
   void print_array(const char * label, const Container & a)
   {

As per coding guidelines "Refactor duplicated code into reusable functions or templates."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Examples/random_select_example.cc` around lines 50 - 53, The helper function
print_rule() is duplicated across examples; remove the local definition in
Examples/random_select_example.cc and instead include the shared declaration by
adding `#include` "print_rule.H" (or the header that defines/declares
print_rule()) and rely on the single shared implementation; ensure the example
no longer defines print_rule() and that the build links against the shared
implementation so Examples::random_select_example uses the centralized
print_rule() function.
Tests/streaming_algorithms_test.cc (1)

50-160: Should fix: add explicit edge-case tests for stream APIs.

Please add focused cases for empty/small inputs (e.g., k=0 reservoir, tiny HLL cardinalities, empty MinHash/SimHash updates) to lock in contracts on boundaries.

As per coding guidelines "Tests must cover edge cases: empty inputs, single element, large inputs."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/streaming_algorithms_test.cc` around lines 50 - 160, Add explicit
edge-case unit tests for the streaming APIs: for reservoir_sample add tests for
k=0 with non-empty input and empty input iterators (verify sample.size()==0 and
no crash), for Count_Min_Sketch add updates/estimates on empty sketch and
single-element sketches (using Count_Min_Sketch::from_error_bounds or the
constructor used in CMSSketchMerge) and assert estimates are 0 for unseen keys
and correct for single updates, for HyperLogLog construct tiny/register-low HLLs
(e.g., small p) and verify estimate for empty set is 0 and for single-element
updates is ~1, and for MinHash and SimHash add tests that call update zero times
(empty) and once (single element) and ensure similarity/estimate/fingerprint
calls do not crash and return sensible deterministic values (e.g., similarity
between identical single-element sketches is ~1 and empty comparisons behave
consistently). Ensure tests reference reservoir_sample, Count_Min_Sketch,
HyperLogLog, MinHash, and SimHash so the behavior at boundaries is explicitly
asserted.
Examples/streaming_demo.cc (1)

84-87: Nice to have: compute “Actual” frequencies dynamically.

The hardcoded “Actual” numbers are easy to desync if the demo data changes; deriving them from stream will keep the example self-consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Examples/streaming_demo.cc` around lines 84 - 87, Replace the hardcoded
"Actual" counts with values computed from the demo input `stream`: build a
frequency map by iterating over the existing `stream` container (e.g., an
unordered_map<string,int> or map) and count occurrences, then use that map when
printing the estimated frequencies (refer to `stream` and `cms.estimate`) so the
printed "(Actual: ...)" values come from the computed map (e.g., map["apple"],
map["banana"], map["word_0"]) instead of fixed literals.
Tests/math_nt_test.cc (2)

85-111: Should fix: add seeded property-style checks for primality/factorization.

These tests are solid examples, but miller_rabin/pollard_rho would be safer with deterministic randomized cases (fixed seed) that assert invariants like “product of factors == n” and “each factor is prime.”

Based on learnings: "Use property-based testing or randomized tests for complex algorithms" and "Tests must verify algorithmic correctness, not just absence of crashes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/math_nt_test.cc` around lines 85 - 111, Add deterministic randomized
property-style checks to the MillerRabin and PollardRho tests: in TEST(MathNT,
MillerRabin) and TEST(MathNT, PollardRho) initialize a fixed-seed RNG, generate
a set of random n values (including composites and primes), and for each assert
miller_rabin(n) matches a known-primality oracle (or compare with a slow
deterministic test for small n); for pollard_rho, for each random composite n
call pollard_rho(n) and assert the product of returned factors equals n and that
every returned factor passes miller_rabin (is prime); ensure tests are
deterministic by using the fixed seed and normalize factor ordering (e.g., sort)
before comparisons.

98-123: Should fix: add a bounded performance regression check for critical paths.

Please add at least one stable regression test case (large, deterministic input set) for hot algorithms like factorization or NTT to catch accidental complexity regressions.

Based on learnings: "Include performance regression tests for critical algorithms".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/math_nt_test.cc` around lines 98 - 123, Add a bounded performance
regression test in Tests/math_nt_test.cc that times a large deterministic input
for the hot algorithms (use pollard_rho for a large composite like product of
two known 32/64-bit primes and NTT::multiply for large power-of-two vectors),
measuring elapsed time with a steady clock and asserting it stays below a fixed
threshold (e.g., EXPECT_LE(duration_ms, <threshold_ms>)) to catch complexity
regressions; ensure the inputs are deterministic (hard-coded primes/arrays) and
reference the pollard_rho and NTT<>::multiply calls so the test fails if
factorization or NTT slows unexpectedly.
modular_arithmetic.H (1)

64-84: Consider documenting or validating that m > 0.

When m == 0, the 128-bit path causes division by zero (undefined behavior), and the fallback path would have issues with the modulus operations. While callers typically ensure m > 0, a @pre tag or runtime check would make the contract explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modular_arithmetic.H` around lines 64 - 84, The mod_mul function lacks a
precondition for m > 0; add an explicit check and/or documented `@pre` to avoid
division-by-zero in the __uint128_t path and incorrect behavior in the fallback
loop. Update the mod_mul declaration to document the contract (e.g., "@pre m >
0") and add a runtime validation at the start of mod_mul (assert or throw/abort)
that fails fast when m == 0 so callers get a clear error instead of undefined
behavior; reference the mod_mul(uint64_t a, uint64_t b, uint64_t m) function
when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ah-dry.H`:
- Around line 1571-1592: The method equal_to is marked noexcept but calls
element equality which may throw; change its exception specification to either
remove noexcept or make it conditional using the element-type equality trait:
replace the trailing noexcept with noexcept(noexcept(std::declval<const typename
Container::Item_Type&>() == std::declval<const typename
Container::Item_Type&>())) so the function is noexcept only when the item
operator== is noexcept; ensure you include <utility> and <type_traits> if needed
and keep the rest of equal_to (const_me, get_it, iterator loop) unchanged.

In `@count-min-sketch.H`:
- Around line 143-152: The merge implementation in Count_Min_Sketch currently
only checks w_ and d_ but must also verify seeds_ match; update
Count_Min_Sketch::merge to compare this->seeds_ with other.seeds_ (element-wise
or via equality operator for the container) and raise the ah_domain_error_if
when they differ, before modifying table_ and total_count_, so incompatible
sketches cannot be merged.

In `@Examples/streaming_demo.cc`:
- Around line 55-63: Replace all uses of std::vector<std::string> for the
variables stream, set1, and set2 with the Aleph container DynList<std::string>;
update any push_back/iteration usage to the DynList API accordingly (e.g., use
DynList's append/push semantics and iterators) and apply the same replacement at
the other occurrences mentioned (around the later 108-109 usage). Locate
declarations named stream, set1, set2 and change their types to
DynList<std::string>, then adjust the loop that extends stream to use DynList's
add/append method and any subsequent set operations to the Aleph equivalents.

In `@hyperloglog.H`:
- Around line 82-86: The constructor currently computes m_ with an unchecked
shift (m_(static_cast<size_t>(1) << b)) before the range check, which can UB for
large b; fix by validating b before performing the shift: add a static helper
function (e.g., compute_m_or_throw(uint8_t b)) that uses the existing
ah_domain_error_if check to validate b in range [4,16] and then returns
(static_cast<size_t>(1) << b), and call that helper in the initializer list as
m_(compute_m_or_throw(b)); this keeps b_ initialization and the error behavior
but ensures the shift only happens after validation in the helper.

In `@modular_arithmetic.H`:
- Around line 95-107: The function mod_exp currently returns 1 when m == 1
because res is initialized to 1 and never reduced; update mod_exp(uint64_t base,
uint64_t exp, const uint64_t m) to handle this edge by returning 0 immediately
when m == 1 (or alternatively document that m must be >1), i.e., add an early
check for m == 1 before using base %= m and the main loop so all outputs become
base^exp mod 1 == 0.
- Around line 170-181: The loop computing prod can overflow and mod_inv is
called with uint64_t values that may exceed int64_t; fix by computing the full
product in a wider type (e.g., use a __int128 accumulator or check for overflow
on prod *= m and bail/return an error if the product would exceed UINT64_MAX),
then cast or reduce p before calling mod_inv to match its signed parameter
(e.g., compute p = prod / mod[i] in the wider type, reduce p_mod = (uint64_t)(p
% mod[i]) and pass a safe cast to mod_inv, or better, change mod_inv's signature
to accept uint64_t consistently), and use the wider accumulator when calling
mod_mul/mod_inv to avoid intermediate overflows (update references to prod, p,
mod_inv, mod_mul, rem, result accordingly).

In `@modular_combinatorics.H`:
- Around line 65-81: The constructor ModularCombinatorics(size_t max_n, uint64_t
p) must validate the modulus before building fact_ and invFact_; add an early
check that p > 1 (and optionally that max_n > 0) and fail fast (throw
std::invalid_argument or use assert) if not, so methods that call
mod_mul/mod_inv never run with an invalid modulus; perform this validation at
the start of the constructor (before any use of mod_, fact_, invFact_) and
document the precondition in the constructor comment or exception message.
- Around line 89-116: The nCk and lucas_nCk functions can index fact_ and
invFact_ out of range; add explicit bounds checks against the precomputed table
size (e.g., fact_.size() or a stored max_n) inside nCk and at the point
lucas_nCk extracts ni/ki, and when the requested n (or ni) exceeds that max
either expand/ensure the factorial tables (provide/ call a
ensure_capacity(max_needed) helper) or fall back to a safe multiplicative
modular-combination routine that does not index fact_/invFact_ (iterative
numerator/denominator with modular inverses). In practice: inside nCk(uint64_t
n, uint64_t k) validate n <= max_precomputed and either expand tables or compute
via fallback; inside lucas_nCk(uint64_t n, uint64_t k) validate ni <
max_precomputed before calling nCk(ni,ki) (and similarly ensure recursive n/mod_
steps are guarded). Ensure you reference and use fact_, invFact_, nCk and
lucas_nCk symbols when implementing these checks and fallbacks.

In `@modular_linalg.H`:
- Around line 69-74: The code handling Array<Array<uint64_t>> in
modular_linalg.H doesn't validate that all inner arrays have the same length
before performing in-place modulus and later elimination/inversion; add an
explicit ragged-matrix check at the start of the Array<Array<uint64_t>> branch
(where MatrixT == Array<Array<uint64_t>> and mat is iterated) that verifies
every mat[i].size() equals mat[0].size(), and if not return/throw a clear error
(or set an error flag) to reject ragged input before any mat(i)(j) %= mod or any
elimination/inversion routines run.
- Around line 65-83: The constructor Modular_Matrix(const MatrixT & m, const
uint64_t p) must validate the modulus before doing any normalization: at the
start of the constructor check that p is a valid modulus (e.g. p > 1, or at
least p != 0 depending on project constraints) and throw a descriptive exception
(std::invalid_argument) if invalid; only after this validation assign mod and
proceed to normalize mat in both the Array<Array<uint64_t>> branch and the
DynMatrix branch that uses mat.read_ne and mat.write. Ensure the check occurs
before any use of mod and reference the constructor name Modular_Matrix, member
variables mat and mod, and the normalization code paths (Array<Array<uint64_t>>
loop and the DynMatrix read_ne/write calls).
- Line 160: The current calls to mod_inv (e.g., where const uint64_t inv =
mod_inv(static_cast<int64_t>(diag), static_cast<int64_t>(mod)); in
modular_linalg.H and other callers in modular_arithmetic.H) are forcing
narrowing casts from uint64_t to int64_t; fix by updating the mod_inv API to
accept unsigned 64-bit values (either change signature to inline int64_t
mod_inv(const uint64_t a, const uint64_t m) or add an overload inline int64_t
mod_inv(const uint64_t a, const uint64_t m) that internally handles unsigned
arithmetic) and then remove the static_casts at the call sites (e.g., references
to diag and mod) so no narrowing conversion occurs; ensure the implementation
correctly handles unsigned inputs and preserves return semantics.

In `@ntt.H`:
- Around line 82-90: Validate input size at the start of
transform(Array<uint64_t>& a, bool invert): check that n = a.size() is > 0 and
is a power of two (use bit test or loop) and that (MOD - 1) % n == 0 so the root
exponent (MOD-1)/n is integral; if either check fails, do not proceed to
bit_reverse or mod_exp(ROOT, (MOD-1)/len, MOD). Replace any raw throw usage with
the Aleph exception macros from ah-errors.H (use the project-standard macro to
report precondition failures with a clear message mentioning transform, n and
MOD). Apply the same validation and macro-based error reporting to the other
transform-related block around lines 124-131 where similar assumptions are made.

In `@pollard_rho.H`:
- Around line 71-73: The lambda f that computes the Pollard step uses
(mod_mul(val, val, n) + c) % n which can overflow before the modulus; update f
(the lambda named f) to perform the addition safely under modulus by either
using a modular-add helper (e.g., mod_add) or by promoting the sum to a wider
integer type (unsigned __int128) when adding c to the result of mod_mul(val,
val, n) and then taking % n, ensuring you only reference mod_mul, c, and n
inside f and preserve the original semantics.

In `@README.md`:
- Around line 3632-3664: The new streaming subsection (anchor
readme-streaming-algorithms and its code block referencing reservoir-sampling.H,
count-min-sketch.H, hyperloglog.H, minhash.H, simhash.H, and the example using
Aleph::Count_Min_Sketch and Aleph::HyperLogLog) was inserted in the middle of an
existing markdown table and broke its row continuity (so later rows such as the
combinatorics_enumeration_example.cc entry no longer render as table rows); fix
this by moving the entire Probabilistic Streaming Algorithms section (heading,
table snippet and code block) so it appears after the existing table finishes,
or alternatively wrap the streaming content as a single table cell if it must
stay inline, ensuring the original table row for "Lexicographic
permutation/combination enumeration" and the following rows are restored exactly
as table syntax (pipe-delimited) so the markdown table renders correctly.

In `@reservoir-sampling.H`:
- Around line 78-86: The constructor Reservoir_Sampler currently calls
gsl_rng_alloc(...) and immediately uses rng_ via gsl_rng_set without checking
for null; add a null check after gsl_rng_alloc(gsl_rng_mt19937) and handle
allocation failure (e.g., throw or call ah_domain_error_if with a descriptive
message) before calling gsl_rng_set, ensuring rng_ is only used when non-null;
update the Reservoir_Sampler constructor (rng_, gsl_rng_alloc, gsl_rng_set
references) to perform the check and fail fast if allocation returns nullptr.

In `@Tests/streaming_algorithms_test.cc`:
- Around line 84-90: The test incorrectly asserts that
cms.estimate("dragonfruit") == 0 which is not guaranteed by Count-Min Sketch due
to hash collisions; remove the EXPECT_EQ(cms.estimate("dragonfruit"), 0u) line
(or replace it with a relaxed, deterministic check such as
EXPECT_GE(cms.estimate("dragonfruit"), 0u) or an assertion that it is below a
known noise bound if you can compute one from the CMS parameters) so the test no
longer expects an exact zero for unseen keys; look for the
cms.estimate("dragonfruit") assertion in the test and delete or relax it
accordingly.

In `@tpl_sort_utils.H`:
- Around line 5260-5284: bucket_sort_impl currently assumes num_buckets > 0 and
unconditionally writes offsets[0], which will OOB when num_buckets == 0; guard
the whole bucket computation early: in bucket_sort_impl check if num_buckets ==
0 (or n <= 1) and return immediately, or handle the zero-bucket case before
creating/using Array<size_t> offsets and accessing offsets[0]; update any
related logic that creates counts/offsets and computes prefix sums (references:
bucket_sort_impl, counts, offsets, bucket_key, num_buckets) so no reads/writes
occur when num_buckets is zero.
- Around line 6047-6053: The overload timsort(T* a, long l, long r, ...)
currently dereferences a + l without validating a; add a null-pointer contract
check at the start of timsort(T*, long, long, ...) to guard non-empty ranges: if
a == nullptr and r >= l then either throw a std::invalid_argument (or call
assert) with a clear message and return, otherwise proceed to call
timsort_detail::timsort_impl(a + l, static_cast<size_t>(r - l + 1), cmp); ensure
the check references the function name timsort and the callee
timsort_detail::timsort_impl so the change is easy to locate.
- Around line 5395-5403: The key lambda ignores the provided comparator (cmp)
and always maps values from min→max; detect whether ordering is descending by
evaluating cmp(min_val, max_val) and, if descending, invert the computed bucket
index before returning it. Concretely, inside the key lambda (captures min_val,
range, num_buckets, and cmp), compute b as you already do, then if cmp(min_val,
max_val) is false (meaning a descending comparator like greater was provided)
set b = (num_buckets - 1) - b; then return the clamped b and call
bucket_sort_detail::bucket_sort_impl(a, n, num_buckets, key, cmp) as before.

---

Nitpick comments:
In `@Examples/random_select_example.cc`:
- Around line 50-53: The helper function print_rule() is duplicated across
examples; remove the local definition in Examples/random_select_example.cc and
instead include the shared declaration by adding `#include` "print_rule.H" (or the
header that defines/declares print_rule()) and rely on the single shared
implementation; ensure the example no longer defines print_rule() and that the
build links against the shared implementation so Examples::random_select_example
uses the centralized print_rule() function.

In `@Examples/streaming_demo.cc`:
- Around line 84-87: Replace the hardcoded "Actual" counts with values computed
from the demo input `stream`: build a frequency map by iterating over the
existing `stream` container (e.g., an unordered_map<string,int> or map) and
count occurrences, then use that map when printing the estimated frequencies
(refer to `stream` and `cms.estimate`) so the printed "(Actual: ...)" values
come from the computed map (e.g., map["apple"], map["banana"], map["word_0"])
instead of fixed literals.

In `@modular_arithmetic.H`:
- Around line 64-84: The mod_mul function lacks a precondition for m > 0; add an
explicit check and/or documented `@pre` to avoid division-by-zero in the
__uint128_t path and incorrect behavior in the fallback loop. Update the mod_mul
declaration to document the contract (e.g., "@pre m > 0") and add a runtime
validation at the start of mod_mul (assert or throw/abort) that fails fast when
m == 0 so callers get a clear error instead of undefined behavior; reference the
mod_mul(uint64_t a, uint64_t b, uint64_t m) function when making these changes.

In `@simhash.H`:
- Around line 108-116: The get_fingerprint() method uses bounds-checked array
access v_[i] inside the loop while update() uses the unchecked accessor v_(i);
replace v_[i] with v_(i) in get_fingerprint() so the internal loop consistently
uses the unchecked operator() for indices 0..63 (keeping the existing loop
bounds and noexcept signature) to avoid the extra bounds checks and match
update().
- Around line 72-75: Introduce a single named constant for the fingerprint size
(e.g., static constexpr size_t FINGERPRINT_SIZE = 64) and replace all magic
literals `64` in the SimHash class with that constant; update the SimHash()
constructor (v_ = Array<double>(FINGERPRINT_SIZE, 0.0)), and any other methods
referenced in the diff such as update, get_fingerprint, and clear to use
SimHash::FINGERPRINT_SIZE instead of the literal so the size is maintained in
one place.

In `@Tests/math_nt_test.cc`:
- Around line 85-111: Add deterministic randomized property-style checks to the
MillerRabin and PollardRho tests: in TEST(MathNT, MillerRabin) and TEST(MathNT,
PollardRho) initialize a fixed-seed RNG, generate a set of random n values
(including composites and primes), and for each assert miller_rabin(n) matches a
known-primality oracle (or compare with a slow deterministic test for small n);
for pollard_rho, for each random composite n call pollard_rho(n) and assert the
product of returned factors equals n and that every returned factor passes
miller_rabin (is prime); ensure tests are deterministic by using the fixed seed
and normalize factor ordering (e.g., sort) before comparisons.
- Around line 98-123: Add a bounded performance regression test in
Tests/math_nt_test.cc that times a large deterministic input for the hot
algorithms (use pollard_rho for a large composite like product of two known
32/64-bit primes and NTT::multiply for large power-of-two vectors), measuring
elapsed time with a steady clock and asserting it stays below a fixed threshold
(e.g., EXPECT_LE(duration_ms, <threshold_ms>)) to catch complexity regressions;
ensure the inputs are deterministic (hard-coded primes/arrays) and reference the
pollard_rho and NTT<>::multiply calls so the test fails if factorization or NTT
slows unexpectedly.

In `@Tests/sort_utils.cc`:
- Around line 2103-2120: Add two regression tests for BucketSort: one TEST named
BucketSort.descending_comparator that calls bucket_sort with a descending
comparator (e.g., std::greater<...>) on a small descending input and asserts the
result is sorted ascending according to that comparator; and one TEST named
BucketSort.zero_buckets_contract that exercises the new overload taking an
explicit buckets count—assert that bucket_sort(nullptr, 0) does not throw
(already present) and that calling bucket_sort on a non-null buffer with buckets
== 0 throws std::invalid_argument to lock down the zero-buckets contract.
Reference the bucket_sort symbol and the TEST macro names when adding these
cases.

In `@Tests/streaming_algorithms_test.cc`:
- Around line 50-160: Add explicit edge-case unit tests for the streaming APIs:
for reservoir_sample add tests for k=0 with non-empty input and empty input
iterators (verify sample.size()==0 and no crash), for Count_Min_Sketch add
updates/estimates on empty sketch and single-element sketches (using
Count_Min_Sketch::from_error_bounds or the constructor used in CMSSketchMerge)
and assert estimates are 0 for unseen keys and correct for single updates, for
HyperLogLog construct tiny/register-low HLLs (e.g., small p) and verify estimate
for empty set is 0 and for single-element updates is ~1, and for MinHash and
SimHash add tests that call update zero times (empty) and once (single element)
and ensure similarity/estimate/fingerprint calls do not crash and return
sensible deterministic values (e.g., similarity between identical single-element
sketches is ~1 and empty comparisons behave consistently). Ensure tests
reference reservoir_sample, Count_Min_Sketch, HyperLogLog, MinHash, and SimHash
so the behavior at boundaries is explicitly asserted.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d3296d and 049e044.

📒 Files selected for processing (30)
  • Examples/CMakeLists.txt
  • Examples/combinatorics_enumeration_example.cc
  • Examples/gray_code_example.cc
  • Examples/math_nt_example.cc
  • Examples/print_rule.H
  • Examples/random_select_example.cc
  • Examples/streaming_demo.cc
  • README.es.md
  • README.md
  • Tests/CMakeLists.txt
  • Tests/ah-comb.cc
  • Tests/math_nt_test.cc
  • Tests/sort_utils.cc
  • Tests/streaming_algorithms_test.cc
  • ah-comb.H
  • ah-dry.H
  • count-min-sketch.H
  • hyperloglog.H
  • minhash.H
  • modular_arithmetic.H
  • modular_combinatorics.H
  • modular_linalg.H
  • ntt.H
  • pollard_rho.H
  • primality.H
  • reservoir-sampling.H
  • simhash.H
  • tpl_array.H
  • tpl_dynArray.H
  • tpl_sort_utils.H

@lrleon lrleon changed the title Utils Add combinatorics, number theory, and streaming algorithm utilities' or 'Add ah-comb, NTT, modular arithmetic, and streaming algorithms Feb 27, 2026
Enhances several sorting algorithms (mergesort, quicksort, introsort, heapsort, counting sort, radix sort, bucket sort) and adds new tests for streaming algorithms (Count-Min Sketch, HyperLogLog). Refines modular arithmetic operations and improves noexcept specifications.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
modular_arithmetic.H (1)

207-215: ⚠️ Potential issue | 🔴 Critical

Must fix: crt needs modulus validation and overflow checks for prod.

Line 209 can silently overflow, and Line 214 can divide by zero if any modulus is 0.

🧩 Suggested fix
     uint64_t prod = 1;
-    for (const uint64_t m : mod)
-        prod *= m;
+    for (const uint64_t m : mod)
+      {
+        ah_invalid_argument_if(m < 2) << "crt: modulus must be >= 2";
+        ah_runtime_error_if(prod > std::numeric_limits<uint64_t>::max() / m)
+          << "crt: product of moduli overflows uint64_t";
+        prod *= m;
+      }

As per coding guidelines: "Validate input preconditions" and "Check boundary conditions and off-by-one errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modular_arithmetic.H` around lines 207 - 215, The crt implementation must
validate inputs and prevent overflow: first check rem.size() == mod.size() and
that every mod[i] > 0 (reject/throw on zero) before computing prod; compute the
product using a wider type or checked multiplication (e.g., use __int128 for
intermediate multiplies) and detect if prod exceeds UINT64_MAX (return
error/throw) to avoid silent overflow; when computing p = prod / mod[i] you can
assume mod[i] > 0 after validation, and pass correctly sized/checked values into
mod_inv; update symbols referenced: crt, prod, p, mod, rem, mod_inv.
modular_combinatorics.H (2)

96-125: ⚠️ Potential issue | 🔴 Critical

Must fix: nCk/lucas_nCk can read beyond precomputed tables.

Line 99 and Line 124 index fact_/invFact_ without verifying requested range, which can cause out-of-bounds access. If you add Aleph range checks, these methods should no longer be noexcept.

🧩 Suggested fix
-    [[nodiscard]] uint64_t nCk(uint64_t n, uint64_t k) const noexcept
+    [[nodiscard]] uint64_t nCk(uint64_t n, uint64_t k) const
     {
         if (k > n) return 0;
+        ah_range_error_if(n >= fact_.size())
+          << "nCk: n exceeds precomputed range";
         const uint64_t num = fact_[n];
         const uint64_t den = mod_mul(invFact_[k], invFact_[n - k], mod_);
         return mod_mul(num, den, mod_);
     }
@@
-    [[nodiscard]] uint64_t lucas_nCk(uint64_t n, uint64_t k) const noexcept
+    [[nodiscard]] uint64_t lucas_nCk(uint64_t n, uint64_t k) const
     {
         if (k > n) return 0;
         if (k == 0) return 1;

         const uint64_t ni = n % mod_;
         const uint64_t ki = k % mod_;
+        ah_range_error_if(ni >= fact_.size())
+          << "lucas_nCk: ni exceeds precomputed range";

         if (ki > ni) return 0; // Lucas theorem property

As per coding guidelines: "Check boundary conditions and off-by-one errors" and "Validate input preconditions".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modular_combinatorics.H` around lines 96 - 125, nCk and lucas_nCk currently
index fact_ and invFact_ without bounds checks which can OOB; update both to
validate that the requested n (and ni in lucas_nCk) is within the precomputed
range (e.g. compare n and ni against fact_.size() or a max_n_ member) before
accessing fact_/invFact_, and handle out-of-range inputs (throw an exception or
return 0 per project convention) instead of undefined behavior; also remove
noexcept from both nCk and lucas_nCk signatures so they can propagate errors.
Ensure references to fact_, invFact_, nCk, lucas_nCk and mod_ are updated
accordingly.

68-72: ⚠️ Potential issue | 🔴 Critical

Must fix: validate modulus before any table construction.

Line 70 can underflow mod_ - 1 when p == 0, and later operations can hit invalid modulo paths.

🛡️ Suggested fix
 ModularCombinatorics(size_t max_n, uint64_t p) : mod_(p)
 {
+    ah_invalid_argument_if(mod_ < 2)
+      << "ModularCombinatorics: modulus must be >= 2";
+
     if (max_n >= mod_)
         max_n = static_cast<size_t>(mod_ - 1);

As per coding guidelines: "Validate input preconditions" and "ALWAYS use Aleph exception macros from ah-errors.H instead of raw throw statements".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modular_combinatorics.H` around lines 68 - 72, Validate the incoming modulus
p at the start of the ModularCombinatorics(size_t max_n, uint64_t p) constructor
before any use of mod_ or arithmetic like mod_ - 1 (e.g., check p == 0 and
reject it), and if invalid raise an Aleph-style exception using the macros from
ah-errors.H (use the project’s ALEPH/ah-errors macro rather than raw throw);
only after successful validation adjust max_n relative to mod_ and proceed with
table construction.
ntt.H (1)

92-99: ⚠️ Potential issue | 🔴 Critical

Must fix: validate transform size constraints before butterfly loops.

Without checks, non-power-of-two lengths can access past bounds (Line 112), and unsupported sizes produce invalid roots. n == 0 is also invalid for inverse scaling.

🛡️ Suggested fix
+#include <ah-errors.H>
@@
     static void transform(Array<uint64_t>& a, bool invert)
     {
-        int n = a.size();
+        const size_t n = a.size();
+        ah_invalid_argument_if(n == 0)
+          << "NTT::transform: n must be > 0";
+        ah_invalid_argument_if((n & (n - 1)) != 0)
+          << "NTT::transform: n must be a power of two";
+        ah_invalid_argument_if(((MOD - 1) % n) != 0)
+          << "NTT::transform: n must divide MOD-1";
         bit_reverse(a);
@@
-        for (int len = 2; len <= n; len <<= 1)
+        for (size_t len = 2; len <= n; len <<= 1)
@@
-            for (int i = 0; i < n; i += len)
+            for (size_t i = 0; i < n; i += len)
@@
-                for (int j = 0; j < len / 2; j++)
+                for (size_t j = 0; j < len / 2; j++)

As per coding guidelines: "Validate input preconditions" and "ALWAYS use Aleph exception macros from ah-errors.H instead of raw throw statements".

Also applies to: 106-124

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ntt.H` around lines 92 - 99, The transform function must validate input size
and root-compatibility before running butterfly loops: check that n = a.size() >
0 and is a power of two (e.g., verify (n & (n-1)) == 0) and raise an Aleph error
macro from ah-errors.H if not; before computing wlen with mod_exp(ROOT,
(MOD-1)/len, MOD) ensure (MOD-1) % len == 0 (or otherwise fail with an Aleph
error) so roots are valid; also guard loops so len never exceeds n; replace any
raw throw with the appropriate Aleph exception macro; ensure the inverse branch
handles n==0 validation prior to scaling (e.g., require valid n before calling
modular inverse).
Tests/streaming_algorithms_test.cc (1)

89-89: ⚠️ Potential issue | 🟠 Major

Must fix: remove the hard upper bound for unseen Count-Min Sketch keys.

Line 89 still assumes low collision noise (< 5), which is not guaranteed by CMS and can become flaky across hash layouts/implementations.

💡 Deterministic alternative
-  EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low
+  const auto unseen_before = cms.estimate("dragonfruit");
+  cms.update("dragonfruit", 1);
+  EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/streaming_algorithms_test.cc` at line 89, Remove the flaky hard upper
bound check on Count-Min Sketch collisions: delete the
EXPECT_LT(cms.estimate("dragonfruit"), 5u) assertion and instead assert a
deterministic property such as non-negativity (e.g., use
EXPECT_GE(cms.estimate("dragonfruit"), 0u)) or otherwise remove the check;
locate the check by the test variable/name cms and its estimate("dragonfruit")
call in the streaming_algorithms_test.cc test and update that assertion
accordingly.
🧹 Nitpick comments (4)
ah-dry.H (1)

1539-1609: Well-designed equality mixin with proper exception safety.

The EqualSequenceMethod class is correctly implemented:

  • Self-comparison optimization (line 1575-1576)
  • Size short-circuit before element-wise comparison
  • Using not (a == b) instead of a != b is a good choice—it only requires operator== on the element type
  • Conditional noexcept properly propagates the exception specification from the element's operator==

Minor consistency note: const_me() at line 1555 could be marked noexcept to match similar helpers elsewhere in the file (e.g., lines 169, 557).

,

Optional: Add noexcept to const_me()
-  const Container * const_me() const
+  const Container * const_me() const noexcept
   {
     return static_cast<const Container*>(this);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ah-dry.H` around lines 1539 - 1609, Mark the helper method const_me() in
class EqualSequenceMethod as noexcept to match similar helpers and improve
consistency; update the signature of const_me() (the const helper that returns
static_cast<const Container*>) to be declared noexcept while leaving its body
unchanged so exception-specification propagation for equal_to() remains
consistent.
Tests/math_nt_test.cc (1)

117-140: Should fix: add explicit failure-path tests for NTT and modular-combinatorics contracts.

Please add EXPECT_THROW coverage for invalid NTT lengths and out-of-range combinatorics requests (after bounds checks are added), plus minimal empty/singleton edge cases.

Based on learnings: "Applies to Tests//*_test.cc : Tests must cover edge cases: empty inputs, single element, large inputs" and "Applies to Tests//*_test.cc : Tests must verify algorithmic correctness, not just absence of crashes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/math_nt_test.cc` around lines 117 - 140, Add explicit edge-case and
failure-path assertions: extend the NTT tests to include empty and singleton
inputs (call NTT<>::multiply with size 0 and size 1 arrays and assert correct
results) and add EXPECT_THROW checks when calling NTT<>::multiply with invalid
lengths (e.g., non-power-of-two or lengths that the implementation documents as
invalid). For ModularCombinatorics, add tests that call nCk with out-of-range
parameters (k>n and negative/invalid n/k where applicable) and assert
EXPECT_EQ(..., 0) or EXPECT_THROW depending on the new bounds-checking behavior,
plus tests for nCk with n=0 and n=1 and lucas_nCk on small edge cases (e.g., n
or k = 0 or 1) to verify correct outputs.
Tests/streaming_algorithms_test.cc (1)

50-169: Should fix: add edge-case contract tests for the new streaming APIs.

Current coverage is mostly happy-path. Please add deterministic edge-case tests (empty/singleton input, invalid constructor bounds, and invalid merge/precondition cases) for each algorithm class.

Based on learnings: "Applies to Tests//*_test.cc : Tests must cover edge cases: empty inputs, single element, large inputs" and "Applies to Tests//*_test.cc : Tests must be deterministic and repeatable".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/streaming_algorithms_test.cc` around lines 50 - 169, Add deterministic
edge-case unit tests for each streaming algorithm: for reservoir_sample call
reservoir_sample with k=0, k=1 and empty input and a fixed seed to assert sizes
and bounds; for Count_Min_Sketch add tests that from_error_bounds throws for
invalid eps/delta (already partially present) plus tests for update/estimate on
empty/singleton streams and a deterministic merge precondition test that merge
throws or fails when dimensions differ (use Count_Min_Sketch::from_error_bounds
and merge with mismatched width/depth); for HyperLogLog add constructor
validation tests (e.g., invalid register bits) and tests for estimate on empty
and singleton streams using HyperLogLog::update and estimate; for MinHash add
tests for empty sets, singleton sets and deterministic similarity for identical
singletons using MinHash::update and similarity; for SimHash add tests that
empty documents and singleton-word docs produce deterministic fingerprints via
SimHash::update/get_fingerprint and that similarity behaves for identical vs
disjoint inputs; ensure all tests use fixed seeds or deterministic inputs and
assert exact sizes/expected behavior or thrown std::domain_error where
preconditions are violated.
Tests/sort_utils.cc (1)

2220-2233: random_large is useful, but consider reducing unit-test load or reclassifying as stress/perf.

N = 100000 in the default unit suite may noticeably increase CI runtime. Consider a smaller unit size plus a dedicated performance/stress target for the large case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/sort_utils.cc` around lines 2220 - 2233, The Timsort::random_large test
uses N = 100000 which is heavy for unit CI; change the test in TEST(Timsort,
random_large) to use a smaller N (e.g., 1000 or 10000) for the default unit
suite and move or duplicate the large N case into a new perf/stress test target
(e.g., TEST(Timsort, perf_large) or a separate benchmark file) that runs timsort
on N = 100000; update any test registration or build/test config so only the
smaller test runs in fast CI while the large case runs in the performance/stress
job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modular_arithmetic.H`:
- Around line 64-71: The modular routines must validate the modulus before
performing any % operations: add an early precondition check in mod_mul (and the
other modular primitive around lines 95-99) to ensure m > 0 and handle
violations (e.g., throw std::invalid_argument("modulus must be > 0") or assert)
so no code path does (x % m) with m==0; place the check at the start of the
functions (mod_mul and the other modular function) and return/throw immediately
if the check fails.

In `@ntt.H`:
- Around line 141-144: The loop that sets n = 1; while (n < a.size() + b.size()
- 1) n <<= 1; must be bounded by the maximum NTT length supported by (MOD,
ROOT); compute max_ntt_len as the largest power-of-two divisor of MOD-1 (i.e.
largest n such that n | (MOD-1)) and after computing the desired n check if n >
max_ntt_len; if so either reduce n to max_ntt_len and handle/truncate inputs
accordingly or return/throw a clear error; update the convolution/ntt entry
point (the function using n, a.size(), b.size(), MOD, ROOT) to validate and
handle the oversized-input case.

In `@Tests/sort_utils.cc`:
- Around line 2172-2181: Update the TEST named BucketSort.zero_buckets to assert
post-conditions beyond crash-safety: after calling bucket_sort(a, n,
num_buckets, key) verify that the contents of the input array `a` remain
unchanged (or match the documented behavior) — e.g., compare each element of `a`
to its original values (3,1,2) using the existing test framework assertions
rather than only EXPECT_NO_THROW; reference the test's local variables `a`, `n`,
`num_buckets`, `key`, and the function `bucket_sort` when adding these
assertions.

---

Duplicate comments:
In `@modular_arithmetic.H`:
- Around line 207-215: The crt implementation must validate inputs and prevent
overflow: first check rem.size() == mod.size() and that every mod[i] > 0
(reject/throw on zero) before computing prod; compute the product using a wider
type or checked multiplication (e.g., use __int128 for intermediate multiplies)
and detect if prod exceeds UINT64_MAX (return error/throw) to avoid silent
overflow; when computing p = prod / mod[i] you can assume mod[i] > 0 after
validation, and pass correctly sized/checked values into mod_inv; update symbols
referenced: crt, prod, p, mod, rem, mod_inv.

In `@modular_combinatorics.H`:
- Around line 96-125: nCk and lucas_nCk currently index fact_ and invFact_
without bounds checks which can OOB; update both to validate that the requested
n (and ni in lucas_nCk) is within the precomputed range (e.g. compare n and ni
against fact_.size() or a max_n_ member) before accessing fact_/invFact_, and
handle out-of-range inputs (throw an exception or return 0 per project
convention) instead of undefined behavior; also remove noexcept from both nCk
and lucas_nCk signatures so they can propagate errors. Ensure references to
fact_, invFact_, nCk, lucas_nCk and mod_ are updated accordingly.
- Around line 68-72: Validate the incoming modulus p at the start of the
ModularCombinatorics(size_t max_n, uint64_t p) constructor before any use of
mod_ or arithmetic like mod_ - 1 (e.g., check p == 0 and reject it), and if
invalid raise an Aleph-style exception using the macros from ah-errors.H (use
the project’s ALEPH/ah-errors macro rather than raw throw); only after
successful validation adjust max_n relative to mod_ and proceed with table
construction.

In `@ntt.H`:
- Around line 92-99: The transform function must validate input size and
root-compatibility before running butterfly loops: check that n = a.size() > 0
and is a power of two (e.g., verify (n & (n-1)) == 0) and raise an Aleph error
macro from ah-errors.H if not; before computing wlen with mod_exp(ROOT,
(MOD-1)/len, MOD) ensure (MOD-1) % len == 0 (or otherwise fail with an Aleph
error) so roots are valid; also guard loops so len never exceeds n; replace any
raw throw with the appropriate Aleph exception macro; ensure the inverse branch
handles n==0 validation prior to scaling (e.g., require valid n before calling
modular inverse).

In `@Tests/streaming_algorithms_test.cc`:
- Line 89: Remove the flaky hard upper bound check on Count-Min Sketch
collisions: delete the EXPECT_LT(cms.estimate("dragonfruit"), 5u) assertion and
instead assert a deterministic property such as non-negativity (e.g., use
EXPECT_GE(cms.estimate("dragonfruit"), 0u)) or otherwise remove the check;
locate the check by the test variable/name cms and its estimate("dragonfruit")
call in the streaming_algorithms_test.cc test and update that assertion
accordingly.

---

Nitpick comments:
In `@ah-dry.H`:
- Around line 1539-1609: Mark the helper method const_me() in class
EqualSequenceMethod as noexcept to match similar helpers and improve
consistency; update the signature of const_me() (the const helper that returns
static_cast<const Container*>) to be declared noexcept while leaving its body
unchanged so exception-specification propagation for equal_to() remains
consistent.

In `@Tests/math_nt_test.cc`:
- Around line 117-140: Add explicit edge-case and failure-path assertions:
extend the NTT tests to include empty and singleton inputs (call NTT<>::multiply
with size 0 and size 1 arrays and assert correct results) and add EXPECT_THROW
checks when calling NTT<>::multiply with invalid lengths (e.g., non-power-of-two
or lengths that the implementation documents as invalid). For
ModularCombinatorics, add tests that call nCk with out-of-range parameters (k>n
and negative/invalid n/k where applicable) and assert EXPECT_EQ(..., 0) or
EXPECT_THROW depending on the new bounds-checking behavior, plus tests for nCk
with n=0 and n=1 and lucas_nCk on small edge cases (e.g., n or k = 0 or 1) to
verify correct outputs.

In `@Tests/sort_utils.cc`:
- Around line 2220-2233: The Timsort::random_large test uses N = 100000 which is
heavy for unit CI; change the test in TEST(Timsort, random_large) to use a
smaller N (e.g., 1000 or 10000) for the default unit suite and move or duplicate
the large N case into a new perf/stress test target (e.g., TEST(Timsort,
perf_large) or a separate benchmark file) that runs timsort on N = 100000;
update any test registration or build/test config so only the smaller test runs
in fast CI while the large case runs in the performance/stress job.

In `@Tests/streaming_algorithms_test.cc`:
- Around line 50-169: Add deterministic edge-case unit tests for each streaming
algorithm: for reservoir_sample call reservoir_sample with k=0, k=1 and empty
input and a fixed seed to assert sizes and bounds; for Count_Min_Sketch add
tests that from_error_bounds throws for invalid eps/delta (already partially
present) plus tests for update/estimate on empty/singleton streams and a
deterministic merge precondition test that merge throws or fails when dimensions
differ (use Count_Min_Sketch::from_error_bounds and merge with mismatched
width/depth); for HyperLogLog add constructor validation tests (e.g., invalid
register bits) and tests for estimate on empty and singleton streams using
HyperLogLog::update and estimate; for MinHash add tests for empty sets,
singleton sets and deterministic similarity for identical singletons using
MinHash::update and similarity; for SimHash add tests that empty documents and
singleton-word docs produce deterministic fingerprints via
SimHash::update/get_fingerprint and that similarity behaves for identical vs
disjoint inputs; ensure all tests use fixed seeds or deterministic inputs and
assert exact sizes/expected behavior or thrown std::domain_error where
preconditions are violated.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 049e044 and 17989db.

📒 Files selected for processing (13)
  • Examples/streaming_demo.cc
  • Tests/math_nt_test.cc
  • Tests/sort_utils.cc
  • Tests/streaming_algorithms_test.cc
  • ah-dry.H
  • count-min-sketch.H
  • hyperloglog.H
  • modular_arithmetic.H
  • modular_combinatorics.H
  • modular_linalg.H
  • ntt.H
  • reservoir-sampling.H
  • tpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (4)
  • modular_linalg.H
  • count-min-sketch.H
  • hyperloglog.H
  • reservoir-sampling.H

Refactor modular arithmetic, Pollard Rho, and various streaming algorithms for improved robustness, clarity, and error handling. Updates include noexcept specifiers, precondition checks, and optimized calculations. Also updates MinHash to include a merge method.

Test updates to reflect these changes and add performance regression tests. README.md also updated to reflect new example structure.
Copilot AI review requested due to automatic review settings February 28, 2026 22:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
Tests/streaming_algorithms_test.cc (1)

89-90: ⚠️ Potential issue | 🟠 Major

Avoid collision-sensitive bound for unseen-key CMS assertion.

EXPECT_LT(..., 5u) is still brittle for a sketch; assert a deterministic monotonic property instead.

✅ Deterministic alternative
-  EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low
+  const auto unseen_before = cms.estimate("dragonfruit");
+  cms.update("dragonfruit", 1);
+  EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1u);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/streaming_algorithms_test.cc` around lines 89 - 90, The brittle numeric
bound on unseen-key ("dragonfruit") should be replaced with a deterministic
monotonicity assertion: record size_t before = cms.estimate("dragonfruit");
perform a known update to that same key using the test's existing update method
(e.g., cms.add or cms.increment — use the actual update function used elsewhere
in this test file), then assert cms.estimate("dragonfruit") >= before (and
optionally non-negative). Update the EXPECT_LT(cms.estimate("dragonfruit"), 5u)
line to these monotonic checks referencing cms.estimate and the test's existing
update method.
🧹 Nitpick comments (3)
pollard_rho.H (1)

105-127: Consider limiting recursion depth or using iteration for very large inputs.

The extract_prime_factors function uses recursion, and for numbers with many small prime factors, this could lead to deep recursion. However, for practical 64-bit integers, the recursion depth is bounded by log₂(n) ≤ 64, which is safe.

The error handling via ah_runtime_error_unless(false) at line 123 is unconventional but functionally correct—it always throws with the given message.

💡 Optional: Consider using ah_runtime_error() for unconditional errors
-    ah_runtime_error_unless(false)
+    ah_runtime_error()
       << "pollard_rho: failed to find a factor of " << n
       << " after " << max_attempts << " attempts";

This is clearer in intent as an unconditional error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pollard_rho.H` around lines 105 - 127, The unconditional error at the end of
find_any_factor uses ah_runtime_error_unless(false) which is confusing in
intent; replace that call with ah_runtime_error() and keep the same streamed
message so the function unconditionally throws with a clear API (refer to
find_any_factor and the error message block). Also review extract_prime_factors
for deep recursion on pathological inputs and consider converting it to an
iterative loop or adding a shallow recursion guard if you expect many small
factors, but the immediate change is to swap ah_runtime_error_unless(false) to
ah_runtime_error() in find_any_factor.
tpl_sort_utils.H (1)

5930-5933: Unsigned arithmetic relies on wraparound in edge case.

When hint == 0 and searching left, ofs can reach max_ofs = 1. Line 5931 then computes hint - ofs = 0 - 1, which underflows for size_t. The algorithm still produces correct results because line 5936's ++last_ofs wraps it back to 0, but this relies on implicit unsigned wraparound semantics.

Consider adding a comment documenting this intentional reliance on wraparound, or refactoring to avoid the subtlety:

// Note: When hint == 0 and ofs == 1, last_ofs underflows to SIZE_MAX.
// This is intentional: the subsequent ++last_ofs wraps it to 0,
// yielding the correct binary search bounds.
const size_t tmp = last_ofs;
last_ofs = hint - ofs;
ofs = hint - tmp;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 5930 - 5933, The subtraction last_ofs = hint -
ofs can intentionally underflow for size_t when hint == 0 and ofs == 1; either
document this behavior with a clear comment referring to hint, ofs, last_ofs and
tmp explaining that the subsequent ++last_ofs wraps to 0, or eliminate the
reliance on unsigned wraparound by guarding the subtraction (e.g. if (hint <
ofs) set last_ofs to SIZE_MAX - (ofs - hint) + 1 or switch to a signed
difference) so the binary-search bounds computed by tmp, last_ofs, hint and ofs
are correct without implicit wraparound.
Tests/sort_utils.cc (1)

2235-2254: Performance test threshold may be environment-sensitive.

The 500ms threshold for sorting 100k integers may pass/fail inconsistently across different CI runners or developer machines. Consider either:

  1. Using a relative benchmark (compare against a baseline algorithm)
  2. Documenting the expected hardware characteristics
  3. Making the threshold configurable via environment variable

That said, 500ms for 100k integers is quite generous, so this is likely fine in practice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/sort_utils.cc` around lines 2235 - 2254, The perf test TEST(Timsort,
perf_large) uses a fixed 500ms threshold which can be flaky across environments;
change the test to use a configurable threshold: read an environment variable
(e.g., TIMSORT_MAX_MS) with a sensible default of 500, or compute a relative
threshold by timing a baseline sort (e.g., std::sort on the same array) and
asserting timsort is within a multiplier; update the assertion that uses
duration and keep the correctness loop (ASSERT_LE on a(i-1), a(i)) unchanged;
reference TEST(Timsort, perf_large), the N constant, the timsort call and the
duration check when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@count-min-sketch.H`:
- Around line 67-68: The Count_Min_Sketch template currently allows negative
counters and silent arithmetic wraparound; update() and merge() must be
hardened: first constrain the CounterT template to an unsigned integral (add a
static_assert or use std::unsigned_integral for CounterT and/or change the
default to size_t) and then add runtime protections in
Count_Min_Sketch::update(...) and Count_Min_Sketch::merge(const
Count_Min_Sketch&) to reject or clamp negative deltas and to detect overflow
before performing additions (e.g., check if current_value >
std::numeric_limits<CounterT>::max() - delta and handle by capping to max or
returning an error/throwing), and replicate the same checks where counters are
incremented in the class (references: Count_Min_Sketch, update(), merge()).
- Around line 81-88: In the Count_Min_Sketch constructor (Count_Min_Sketch and
related overloads) validate the computed dimensions before casting/allocating:
ensure computed product (w_ * d_ or the values derived from epsilon/delta) is
finite, non-zero, and does not overflow size_t (i.e., check against
numeric_limits<size_t>::max() / other_dim), and reject or clamp inputs with a
clear ah_domain_error_if when invalid; only after these checks allocate table_ =
Array<CounterT>(w_ * d_, ...) and reserve seeds_. Apply the same pre-allocation
validation to the other constructor/logic referenced (the one converting
epsilon/delta to w/d) so floating-point-to-size_t conversions are checked for
non-finite, negative, zero, or overflow before casting or memory allocation.

In `@Examples/random_select_example.cc`:
- Around line 59-63: There is an unmatched extra closing brace after the end of
print_array that causes a compilation error; remove the stray `}` following the
print_array closing brace so the function's scope ends correctly and main()
starts immediately (verify the braces around print_array and main are balanced
and that the only closing brace for print_array is the one currently at the end
of that function).

In `@Examples/streaming_demo.cc`:
- Line 39: Replace uses of std::unordered_map / std::unordered_set in
Examples/streaming_demo.cc with the Aleph equivalents (DynHashMap and
DynHashSet): add the Aleph header include(s) for DynHashMap/DynHashSet, change
the type declarations and any construction/usage sites (e.g., the include at the
top that currently references <unordered_map> and all usages around the blocks
noted (around lines referenced in the review) to use DynHashMap<Key,Value> and
DynHashSet<Key> instead, and update any namespace qualifiers if needed) so the
example follows the project container policy; ensure template parameters and
APIs are adapted if Aleph's containers use different constructor or method
names.

In `@Tests/math_nt_test.cc`:
- Around line 232-251: The test MathNT.ModularLinalg currently computes products
using inv_mat and the original matrix entries (inv_mat * m_data), so the comment
"Check A * A^-1 = I" is misleading; update the comment to "Check A^-1 * A = I"
or change the product order to multiply the original matrix (mat or m_data) on
the left and inv_mat on the right so the code actually verifies A * A^-1 = I;
adjust the four EXPECT_EQ lines accordingly to reference the correct
multiplication order involving inv_mat and m_data (or mat) to match the chosen
comment.

In `@Tests/streaming_algorithms_test.cc`:
- Around line 37-39: Replace uses of std::vector and std::set in
Tests/streaming_algorithms_test.cc with the Aleph equivalents across the noted
locations (lines around 37-39, 52-53, 68-70, and 138-140): find every
declaration/usage of std::vector and std::set in this test file and change them
to the corresponding Aleph container types (e.g., Aleph::vector / Aleph::set or
the project-specific container names used elsewhere in the repo), adjust any
construction/iteration APIs if needed to match Aleph semantics, and run the
tests to ensure compilation and behavior remain unchanged.

---

Duplicate comments:
In `@Tests/streaming_algorithms_test.cc`:
- Around line 89-90: The brittle numeric bound on unseen-key ("dragonfruit")
should be replaced with a deterministic monotonicity assertion: record size_t
before = cms.estimate("dragonfruit"); perform a known update to that same key
using the test's existing update method (e.g., cms.add or cms.increment — use
the actual update function used elsewhere in this test file), then assert
cms.estimate("dragonfruit") >= before (and optionally non-negative). Update the
EXPECT_LT(cms.estimate("dragonfruit"), 5u) line to these monotonic checks
referencing cms.estimate and the test's existing update method.

---

Nitpick comments:
In `@pollard_rho.H`:
- Around line 105-127: The unconditional error at the end of find_any_factor
uses ah_runtime_error_unless(false) which is confusing in intent; replace that
call with ah_runtime_error() and keep the same streamed message so the function
unconditionally throws with a clear API (refer to find_any_factor and the error
message block). Also review extract_prime_factors for deep recursion on
pathological inputs and consider converting it to an iterative loop or adding a
shallow recursion guard if you expect many small factors, but the immediate
change is to swap ah_runtime_error_unless(false) to ah_runtime_error() in
find_any_factor.

In `@Tests/sort_utils.cc`:
- Around line 2235-2254: The perf test TEST(Timsort, perf_large) uses a fixed
500ms threshold which can be flaky across environments; change the test to use a
configurable threshold: read an environment variable (e.g., TIMSORT_MAX_MS) with
a sensible default of 500, or compute a relative threshold by timing a baseline
sort (e.g., std::sort on the same array) and asserting timsort is within a
multiplier; update the assertion that uses duration and keep the correctness
loop (ASSERT_LE on a(i-1), a(i)) unchanged; reference TEST(Timsort, perf_large),
the N constant, the timsort call and the duration check when making the change.

In `@tpl_sort_utils.H`:
- Around line 5930-5933: The subtraction last_ofs = hint - ofs can intentionally
underflow for size_t when hint == 0 and ofs == 1; either document this behavior
with a clear comment referring to hint, ofs, last_ofs and tmp explaining that
the subsequent ++last_ofs wraps to 0, or eliminate the reliance on unsigned
wraparound by guarding the subtraction (e.g. if (hint < ofs) set last_ofs to
SIZE_MAX - (ofs - hint) + 1 or switch to a signed difference) so the
binary-search bounds computed by tmp, last_ofs, hint and ofs are correct without
implicit wraparound.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17989db and aede16b.

📒 Files selected for processing (20)
  • Examples/random_select_example.cc
  • Examples/streaming_demo.cc
  • README.md
  • Tests/ah-comb.cc
  • Tests/math_nt_test.cc
  • Tests/sort_utils.cc
  • Tests/streaming_algorithms_test.cc
  • ah-dry.H
  • count-min-sketch.H
  • hyperloglog.H
  • minhash.H
  • modular_arithmetic.H
  • modular_combinatorics.H
  • modular_linalg.H
  • ntt.H
  • pollard_rho.H
  • primality.H
  • reservoir-sampling.H
  • simhash.H
  • tpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (4)
  • Tests/ah-comb.cc
  • hyperloglog.H
  • ah-dry.H
  • modular_combinatorics.H

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

tpl_sort_utils.H:5600

  • The floating-point bucket_sort lambda captures max_val, but it is never used inside the lambda. This can trigger -Wunused-lambda-capture under stricter warning settings; remove the unused capture (or use it if intended).
    auto key = [min_val, max_val, range, num_buckets, cmp](const T & val) -> size_t
      {
        auto b = static_cast<size_t>((val - min_val) / range *
                                      static_cast<T>(num_buckets - 1));
        if (b >= num_buckets)

…ng tests

Refactor sorting algorithms in tpl_sort_utils.H to improve clarity, performance, and add new overloads. Enhance Count-Min Sketch for robustness against overflow and add new error bounds validation. Update reservoir sampling for better type handling. Convert vector and set usage to DynArray and DynHashSet in streaming algorithm tests. Adjust random_select example and modular_linalg for type consistency.

- Refactor `Count_Min_Sketch` to handle counter overflow, improve error bounds validation, and use `std::unsigned_integral` for `CounterT`.
- Update `Reservoir_Sampler` to use `std::iterator_traits` for better iterator compatibility.
- Improve `tpl_sort_utils.H` with more detailed Doxygen comments and add new `timsort` overloads for different container types (DynArray, Array, C-array, DynList) and range-based sorting.
- Add overloads for `counting_sort` and `radix_sort` to support various container types (DynArray, Array, DynList, DynDlist, raw pointers, C-arrays) with proper constraint checks.
- Enhance `binary_search` and `build_index` utilities with additional overloads for pointer types and improved documentation.
- Update `sort_utils_test.cc` to use `DynArray` and `DynHashSet` instead of `std::vector` and `std::set` in streaming tests.
- Fix logic for checking inverse in `math_nt_test.cc` from `A * A^-1 = I` to `A^-1 * A = I`.
- Update `random_select_example.cc` and `streaming_demo.cc` to reflect API changes and use `DynArray` and `DynHashMap`.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
Tests/streaming_algorithms_test.cc (1)

90-90: ⚠️ Potential issue | 🟡 Minor

Avoid absolute upper-bound assertions for unseen CMS keys.

EXPECT_LT(cms.estimate("dragonfruit"), 5u) still depends on collision/layout specifics and is brittle. Prefer a contract-based deterministic check.

💡 Proposed fix
-  EXPECT_LT(cms.estimate("dragonfruit"), 5u); // unseen key: might have collisions but should be low
+  const auto unseen_before = cms.estimate("dragonfruit");
+  cms.update("dragonfruit", 1);
+  EXPECT_GE(cms.estimate("dragonfruit"), unseen_before + 1u);

As per coding guidelines: "Tests must verify algorithmic correctness, not just absence of crashes" and "Tests must be deterministic and repeatable".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/streaming_algorithms_test.cc` at line 90, The test's brittle absolute
upper-bound should be replaced with a deterministic contract: assert that
cms.estimate("dragonfruit") honors CMS invariants (it must be >= the true count
0 and <= the total number of inserted items). Replace the
EXPECT_LT(cms.estimate("dragonfruit"), 5u) check with two assertions using
cms.estimate — e.g., EXPECT_GE(cms.estimate("dragonfruit"), 0u) and
EXPECT_LE(cms.estimate("dragonfruit"), total_inserted_count) — where
total_inserted_count is the known sum of all increments used earlier in the
test; this uses cms.estimate as the unique symbol to locate the assertion to
change.
🧹 Nitpick comments (3)
tpl_hash.H (1)

134-135: Optional: preserve full template configurability in the aliases.

If you want Dyn* aliases to fully mirror HashSet/HashMap flexibility, expose the table/comparator template parameters too.

♻️ Suggested refactor
-  template <typename T>
-  using DynHashSet = HashSet<T>;
+  template <typename T,
+            template <typename, class> class HashSetTable = ODhashTable,
+            class Cmp = Aleph::equal_to<T>>
+  using DynHashSet = HashSet<T, HashSetTable, Cmp>;

-  template <typename K, typename V>
-  using DynHashMap = HashMap<K, V>;
+  template <typename K, typename V,
+            template <typename, typename, class> class HashMapTable = MapODhash,
+            class Cmp = Aleph::equal_to<K>>
+  using DynHashMap = HashMap<K, V, HashMapTable, Cmp>;

Also applies to: 138-139

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_hash.H` around lines 134 - 135, The Dyn* type aliases currently fix
template parameters by aliasing template <typename T> using DynHashSet =
HashSet<T>; (and the similar DynHashMap aliases at the other spots), which loses
HashSet/HashMap's table/comparator configurability; change these aliases to
forward all template parameters (e.g., template <typename K, typename Table =
/*default*/, typename Cmp = /*default*/> using DynHashSet = HashSet<K, Table,
Cmp> and similarly for DynHashMap) so the aliases mirror the original templates’
full parameter lists (update the aliases named DynHashSet and the DynHashMap
aliases referenced near the other lines).
tpl_sort_utils.H (1)

5880-5902: Nice to have: deduplicate adjacent Doxygen blocks for the same symbol.

There are repeated back-to-back doc blocks for gallop_left, gallop_right, and timsort_impl. Keeping one canonical block per symbol will reduce doc maintenance noise.

As per coding guidelines: "Doxygen must generate no warnings for changed files".

Also applies to: 5952-5970, 6174-6199

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 5880 - 5902, The file contains duplicated
adjacent Doxygen blocks for the same symbols (e.g., gallop_left, gallop_right,
timsort_impl); remove the redundant consecutive comment blocks and keep a single
canonical Doxygen block per symbol that includes all required tags (`@tparam`,
`@param`, `@return`) and the full descriptive text. Locate the duplicate blocks
around the symbols named gallop_left, gallop_right and timsort_impl and merge
their content into one doc comment each (preserve the most complete wording),
then delete the extra block(s) so only one Doxygen comment immediately precedes
each function declaration/definition to avoid duplicate-doc warnings.
Examples/streaming_demo.cc (1)

105-105: Use a semantically accurate name for the set.

unique_map is a DynHashSet<string>. Renaming to unique_set improves readability.

♻️ Proposed rename
-    DynHashSet<string> unique_map;
+    DynHashSet<string> unique_set;
@@
-        unique_map.insert(s);
+        unique_set.insert(s);
@@
-    cout << "Estimated unique elements: " << fixed << setprecision(1) << hll.estimate() << " (Actual: " << unique_map.size() << ")\n";
+    cout << "Estimated unique elements: " << fixed << setprecision(1) << hll.estimate() << " (Actual: " << unique_set.size() << ")\n";
As per coding guidelines: "Use descriptive variable and function names."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Examples/streaming_demo.cc` at line 105, Rename the variable unique_map (a
DynHashSet<string>) to a semantically accurate name such as unique_set
throughout the file and any related scopes; update all references to unique_map
to the new identifier to reflect that it is a set (e.g., change declarations,
usages, and any comments referencing unique_map) so the symbol
DynHashSet<string> is paired with a clear, descriptive name like unique_set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Examples/streaming_demo.cc`:
- Around line 112-113: The cout stream is left in fixed/precision mode after
printing HyperLogLog (hll.estimate()), which corrupts later MinHash/other
numeric outputs; after the line that prints hll.estimate() (and before
subsequent prints around the MinHash output), restore the stream state by either
calling cout.unsetf(ios::floatfield) and resetting precision to the previous
value or by saving/restoring the stream state with ios::fmtflags and precision;
locate the HyperLogLog print (uses hll.estimate()) and the MinHash print sites
to wrap or restore the formatting so subsequent numeric output is not rounded.

In `@modular_linalg.H`:
- Around line 203-205: The subtraction (get_val(...)+mod-term)%mod in the
elimination step overflows for large uint64_t moduli; update the code paths that
compute 'term' via mod_mul and then set_val(a, j, k, ...) to perform
branch-based modular subtraction instead: load current = get_val(a, j, k), if
current >= term then new = current - term else new = current + mod - term, and
pass new to set_val; apply the same change to all similar sites (the elimination
loops using get_val/set_val and term from mod_mul) to ensure overflow-safe
results.

In `@Tests/math_nt_test.cc`:
- Around line 152-184: Remove or disable hard wall-clock assertions in the
TEST(MathNT, PerformanceRegression) test: keep the correctness checks that call
pollard_rho(n) and NTT<>::multiply(a,b) and their EXPECT_EQ/EXPECT_TRUE
assertions, but remove the two EXPECT_LE(duration, ...) timing assertions (the
ones after pollard_rho and after NTT<>::multiply) or relocate them to a
dedicated benchmark/perf test; alternatively, guard those timing checks behind
an explicit opt-in (e.g., an environment flag like RUN_PERF_TESTS) so normal CI
runs omit the flaky wall-clock checks.
- Around line 123-133: The test assumes pollard_rho returns factors in a fixed
order but it can validly return factors in any order; update the assertions in
Tests/math_nt_test.cc to compare order-insensitively by converting the result of
pollard_rho(...) and the expected Array<uint64_t> (exp12, exp97, exp_big) into
comparable multisets or by sorting both containers before calling EXPECT_EQ (or
using EXPECT_EQ on sorted vectors) so the checks for pollard_rho(12),
pollard_rho(97), and pollard_rho(1000000007ULL * 97ULL) pass regardless of
factor order.

In `@Tests/sort_utils.cc`:
- Around line 2236-2252: The TEST(Timsort, perf_large) currently asserts a hard
wall-clock (EXPECT_LE(duration, 500)) which makes CI flaky; change the test to
make the performance check opt-in: remove or disable the hard assertion and
instead gate the timing assertion behind an environment/config flag (e.g. check
getenv("ENABLE_PERF_TESTS") or a command-line test parameter) so that when the
flag is set you assert EXPECT_LE(duration, 500) for timsort otherwise just log
or SUCCEED() with the measured duration; ensure you reference the same symbols
(TEST(Timsort, perf_large), N, timsort, duration) when implementing the
conditional so behavior is unchanged when the opt-in flag is active.

In `@tpl_sort_utils.H`:
- Around line 6312-6323: The timsort function currently does pointer arithmetic
with a + l before validating that l and r are non-negative; add a bounds check
at the start of timsort(T *a, const long l, const long r, const Compare & cmp)
that rejects negative indices (e.g., if (l < 0 || r < 0)
ah_invalid_argument_if(true) << "timsort(): negative bounds ...") and ensure
this check runs before any use of a or a + l; keep the existing empty-range
check (l >= r) and the existing null-pointer check (ah_invalid_argument_if(a ==
nullptr)) but move or add the negative-range validation so
timsort_detail::timsort_impl is only called with safe, non-negative size_t
length.
- Around line 5800-5807: The listed Timsort helper templates (reverse_range,
binary_insertion_sort, gallop_left, gallop_right and other timsort_detail
helpers that call user cmp, std::move or std::swap) must not be marked
unconditional noexcept; change their exception specification to a conditional
one that reflects whether the comparator and moves/swaps are nothrow (e.g.
noexcept(noexcept(cmp(std::declval<T&>(), std::declval<T&>())) &&
std::is_nothrow_swappable_v<T>()) for functions using cmp+swap, or include
std::is_nothrow_move_constructible_v<T> when using std::move), and remove the
plain noexcept from reverse_range and the other identified functions so
exception propagation is preserved.

---

Duplicate comments:
In `@Tests/streaming_algorithms_test.cc`:
- Line 90: The test's brittle absolute upper-bound should be replaced with a
deterministic contract: assert that cms.estimate("dragonfruit") honors CMS
invariants (it must be >= the true count 0 and <= the total number of inserted
items). Replace the EXPECT_LT(cms.estimate("dragonfruit"), 5u) check with two
assertions using cms.estimate — e.g., EXPECT_GE(cms.estimate("dragonfruit"), 0u)
and EXPECT_LE(cms.estimate("dragonfruit"), total_inserted_count) — where
total_inserted_count is the known sum of all increments used earlier in the
test; this uses cms.estimate as the unique symbol to locate the assertion to
change.

---

Nitpick comments:
In `@Examples/streaming_demo.cc`:
- Line 105: Rename the variable unique_map (a DynHashSet<string>) to a
semantically accurate name such as unique_set throughout the file and any
related scopes; update all references to unique_map to the new identifier to
reflect that it is a set (e.g., change declarations, usages, and any comments
referencing unique_map) so the symbol DynHashSet<string> is paired with a clear,
descriptive name like unique_set.

In `@tpl_hash.H`:
- Around line 134-135: The Dyn* type aliases currently fix template parameters
by aliasing template <typename T> using DynHashSet = HashSet<T>; (and the
similar DynHashMap aliases at the other spots), which loses HashSet/HashMap's
table/comparator configurability; change these aliases to forward all template
parameters (e.g., template <typename K, typename Table = /*default*/, typename
Cmp = /*default*/> using DynHashSet = HashSet<K, Table, Cmp> and similarly for
DynHashMap) so the aliases mirror the original templates’ full parameter lists
(update the aliases named DynHashSet and the DynHashMap aliases referenced near
the other lines).

In `@tpl_sort_utils.H`:
- Around line 5880-5902: The file contains duplicated adjacent Doxygen blocks
for the same symbols (e.g., gallop_left, gallop_right, timsort_impl); remove the
redundant consecutive comment blocks and keep a single canonical Doxygen block
per symbol that includes all required tags (`@tparam`, `@param`, `@return`) and the
full descriptive text. Locate the duplicate blocks around the symbols named
gallop_left, gallop_right and timsort_impl and merge their content into one doc
comment each (preserve the most complete wording), then delete the extra
block(s) so only one Doxygen comment immediately precedes each function
declaration/definition to avoid duplicate-doc warnings.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aede16b and 97b2c61.

📒 Files selected for processing (10)
  • Examples/random_select_example.cc
  • Examples/streaming_demo.cc
  • Tests/math_nt_test.cc
  • Tests/sort_utils.cc
  • Tests/streaming_algorithms_test.cc
  • count-min-sketch.H
  • modular_linalg.H
  • reservoir-sampling.H
  • tpl_hash.H
  • tpl_sort_utils.H

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ah-dry.H (1)

377-383: ⚠️ Potential issue | 🟡 Minor

Complete required Doxygen tags on touched overload docs.

Line 377 updates this public overload’s docs, but the block still omits explicit @brief and @throws tags (same pattern appears in other touched overload-doc blocks in this patch).

📝 Suggested doc pattern
-  /** Overload of `find_ptr(Operation&) const` that accepts rvalues.
+  /** `@brief` Overload of `find_ptr(Operation&) const` that accepts rvalues.
   /// The operation is passed (invoked) as a named lvalue to the 
   /// lvalue-reference overload rather than being perfectly forwarded.
@@
       `@param`[in] operation matching criteria
       `@return` const pointer to matching element or nullptr
+      `@throws` Anything propagated by `operation` or traversal/iterator operations.
   */

As per coding guidelines: “All public member functions must have @brief, @param, @return, and @throws Doxygen tags.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ah-dry.H` around lines 377 - 383, The Doxygen block for the rvalue overload
of find_ptr is missing required tags; update the comment for the overload (the
one describing find_ptr(Operation&) const that accepts rvalues) to include an
explicit `@brief` describing the purpose, add the existing `@param` and `@return` tags
if not already present, and add an `@throws` tag documenting any exceptions the
function may propagate (or @{throws none}@ if it never throws). Apply the same
`@brief/`@throws addition pattern to the other overload-doc blocks touched in this
patch so all public member overloads comply with the `@brief`, `@param`, `@return`,
and `@throws` requirement.
♻️ Duplicate comments (1)
tpl_sort_utils.H (1)

6234-6239: ⚠️ Potential issue | 🟠 Major

timsort(T*, long, long, ...) skips null-contract validation for singleton ranges.

With current ordering, l == r returns before checking a == nullptr. This diverges from the pointer-length overload contract and weakens API consistency.

Proposed fix
   ah_range_error_if(l < 0 or r < 0)
     << "timsort(): negative bounds [" << l << ", " << r << "] are not allowed";
 
-  if (l >= r)
-    return;
-
-  ah_invalid_argument_if(a == nullptr)
+  ah_invalid_argument_if(a == nullptr and l <= r)
     << "timsort(): null pointer contract violation for non-empty range ["
     << l << ", " << r << "]; cannot call timsort_detail::timsort_impl()";
+
+  if (l >= r)
+    return;

As per coding guidelines: "Validate external inputs before processing" and "Check boundary conditions and off-by-one errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 6234 - 6239, The null-pointer contract check
for timsort is skipped for singleton ranges because the early-return "if (l >=
r) return;" precedes the ah_invalid_argument_if(a == nullptr) validation; move
the ah_invalid_argument_if(a == nullptr) check (the null-pointer validation that
emits "timsort(): null pointer contract violation...") to occur before the
range-size early return in the timsort(T*, long, long, ...) entry (the overload
that calls timsort_detail::timsort_impl) so that a == nullptr is validated for
all calls, including l == r cases.
🧹 Nitpick comments (3)
hyperloglog.H (1)

67-69: Nice to have: constrain T with a hashability concept for clearer compile-time diagnostics.

HyperLogLog<T> currently accepts any T, but it requires murmur3hash(const T&, seed) to exist.

♻️ Proposed refactor
+# include <concepts>
 ...
 namespace Aleph
 {
+  template <typename U>
+  concept MurmurHashable = requires(const U & v)
+  {
+    { murmur3hash(v, 0x12345678) } -> std::convertible_to<size_t>;
+  };
+
-  template <typename T>
+  template <MurmurHashable T>
   class HyperLogLog
As per coding guidelines: "Use concepts (C++20) to constrain template parameters; avoid SFINAE when possible".

Also applies to: 78-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperloglog.H` around lines 67 - 69, The template HyperLogLog<T> currently
accepts any T but implicitly requires a murmur3hash(const T&, uint32_t)
function; add a C++20 concept (e.g., Hashable<T> or Murmur3Hashable<T>) that
expresses the requirement for a callable murmur3hash(const T&, std::uint32_t)
(or equivalent signature used in your code) and constrain the class template
with it (e.g., template<typename T> requires Hashable<T> class HyperLogLog).
Update the other templated declarations around the same region (lines referenced
78-83) to use the same concept so compile-time diagnostics are clear and
consistent, and ensure the concept name and signature match the existing
murmur3hash usage in the file.
ah-dry.H (1)

132-170: Constrain template parameters with C++20 concepts to improve diagnostics.

The new traverse free functions (lines 132–170) and EqualSequenceMethod template (lines 1607–1621) lack concept constraints on their template parameters. Applying concepts would:

  • Clarify expected capabilities (Container must provide traverse(Operation&), Operation must be invocable as bool(Item_Type &))
  • Provide better compiler diagnostics on template instantiation failure
  • Align with the codebase pattern established in ah-ranges.H, ahFunctional.H, and graph-dry.H

This follows the coding guideline: "Use concepts (C++20) to constrain template parameters; avoid SFINAE when possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ah-dry.H` around lines 132 - 170, Constrain the two free-function templates
traverse and the EqualSequenceMethod template with C++20 concepts: require that
Container provides a member traverse(Operation&) and that Operation is invocable
with the container's element type as bool(Item_Type&). Specifically, add
requires-clauses (or use a named concept) to the templates for traverse(const
Container&, Operation&), traverse(const Container&, Operation&&) and
EqualSequenceMethod so instantiation fails with clear diagnostics when
Container::traverse(op) is not available or when Operation cannot be called as
bool(Item_Type&); update the noexcept expressions if necessary to reference the
constrained parameter names.
Tests/sort_utils.cc (1)

2283-2307: Remove the redundant ENABLE_PERF_TESTS lookup in perf_large.

Line [2283] already skips when perf tests are disabled, so the second check at Line [2300] is unnecessary and can be flattened for readability.

♻️ Proposed cleanup
 TEST(Timsort, perf_large)
 {
-  if (not std::getenv("ENABLE_PERF_TESTS"))
+  const char * perf_enabled = std::getenv("ENABLE_PERF_TESTS");
+  if (not perf_enabled)
     GTEST_SKIP() << "Skipping perf test (set ENABLE_PERF_TESTS to enable)";
@@
-  if (std::getenv("ENABLE_PERF_TESTS"))
-    {
-      long max_ms = 500;
-      if (const char * env_ms = std::getenv("TIMSORT_MAX_MS"))
-        max_ms = std::atol(env_ms);
-
-      EXPECT_LE(duration, max_ms) << "Timsort performance regression detected";
-    }
+  long max_ms = 500;
+  if (const char * env_ms = std::getenv("TIMSORT_MAX_MS"))
+    max_ms = std::atol(env_ms);
+  EXPECT_LE(duration, max_ms) << "Timsort performance regression detected";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/sort_utils.cc` around lines 2283 - 2307, The second getenv check in the
perf_large test is redundant because the test already returns early when
ENABLE_PERF_TESTS is unset; remove the outer "if
(std::getenv(\"ENABLE_PERF_TESTS\")) { ... }" block (the one that wraps max_ms
and EXPECT_LE) and unindent its body so the max_ms logic and the
EXPECT_LE(duration, max_ms) assertion run unconditionally after computing
duration; keep the TIMSORT_MAX_MS lookup (const char* env_ms =
std::getenv("TIMSORT_MAX_MS")) and the variables N, a, timsort, duration, max_ms
and the EXPECT_LE assertion unchanged otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperloglog.H`:
- Around line 107-110: Add explicit Doxygen `@pre` tags for the constructor and
merge() method: in the HyperLogLog constructor comment (the block documenting
the constructor that currently describes parameter b and domain_error) add "@pre
4 <= b && b <= 16" (or "@pre 4 <= b <= 16") to document the required range; in
the merge() method comment (the docblock around the merge(const HyperLogLog&
other) function referenced at lines ~181-189) add "@pre b_ == other.b_" to state
the required matching precision. Ensure the new `@pre` lines are placed alongside
the existing `@param/`@throws notes in their respective Doxygen comment blocks so
the preconditions are explicit.

In `@tpl_sort_utils.H`:
- Around line 975-977: The insertion_sort noexcept is unsafe because the body
currently uses copy ops but the noexcept checks only move-assignability; change
the algorithm to use move semantics and update the noexcept to reflect move
construction/assignment: inside insertion_sort (function name insertion_sort in
tpl_sort_utils.H) create the temporary via move (e.g., T tmp = std::move(a[i]))
and perform shifts with move-assign (a[j] = std::move(a[j-1])) and place tmp
with move (a[j] = std::move(tmp)); then change the noexcept expression to
require std::is_nothrow_move_constructible_v<T> and
std::is_nothrow_move_assignable_v<T> (keep the cmp noexcept check) so the
declared noexcept matches the operations performed.

---

Outside diff comments:
In `@ah-dry.H`:
- Around line 377-383: The Doxygen block for the rvalue overload of find_ptr is
missing required tags; update the comment for the overload (the one describing
find_ptr(Operation&) const that accepts rvalues) to include an explicit `@brief`
describing the purpose, add the existing `@param` and `@return` tags if not already
present, and add an `@throws` tag documenting any exceptions the function may
propagate (or @{throws none}@ if it never throws). Apply the same `@brief/`@throws
addition pattern to the other overload-doc blocks touched in this patch so all
public member overloads comply with the `@brief`, `@param`, `@return`, and `@throws`
requirement.

---

Duplicate comments:
In `@tpl_sort_utils.H`:
- Around line 6234-6239: The null-pointer contract check for timsort is skipped
for singleton ranges because the early-return "if (l >= r) return;" precedes the
ah_invalid_argument_if(a == nullptr) validation; move the
ah_invalid_argument_if(a == nullptr) check (the null-pointer validation that
emits "timsort(): null pointer contract violation...") to occur before the
range-size early return in the timsort(T*, long, long, ...) entry (the overload
that calls timsort_detail::timsort_impl) so that a == nullptr is validated for
all calls, including l == r cases.

---

Nitpick comments:
In `@ah-dry.H`:
- Around line 132-170: Constrain the two free-function templates traverse and
the EqualSequenceMethod template with C++20 concepts: require that Container
provides a member traverse(Operation&) and that Operation is invocable with the
container's element type as bool(Item_Type&). Specifically, add requires-clauses
(or use a named concept) to the templates for traverse(const Container&,
Operation&), traverse(const Container&, Operation&&) and EqualSequenceMethod so
instantiation fails with clear diagnostics when Container::traverse(op) is not
available or when Operation cannot be called as bool(Item_Type&); update the
noexcept expressions if necessary to reference the constrained parameter names.

In `@hyperloglog.H`:
- Around line 67-69: The template HyperLogLog<T> currently accepts any T but
implicitly requires a murmur3hash(const T&, uint32_t) function; add a C++20
concept (e.g., Hashable<T> or Murmur3Hashable<T>) that expresses the requirement
for a callable murmur3hash(const T&, std::uint32_t) (or equivalent signature
used in your code) and constrain the class template with it (e.g.,
template<typename T> requires Hashable<T> class HyperLogLog). Update the other
templated declarations around the same region (lines referenced 78-83) to use
the same concept so compile-time diagnostics are clear and consistent, and
ensure the concept name and signature match the existing murmur3hash usage in
the file.

In `@Tests/sort_utils.cc`:
- Around line 2283-2307: The second getenv check in the perf_large test is
redundant because the test already returns early when ENABLE_PERF_TESTS is
unset; remove the outer "if (std::getenv(\"ENABLE_PERF_TESTS\")) { ... }" block
(the one that wraps max_ms and EXPECT_LE) and unindent its body so the max_ms
logic and the EXPECT_LE(duration, max_ms) assertion run unconditionally after
computing duration; keep the TIMSORT_MAX_MS lookup (const char* env_ms =
std::getenv("TIMSORT_MAX_MS")) and the variables N, a, timsort, duration, max_ms
and the EXPECT_LE assertion unchanged otherwise.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c68b67 and 09ad60b.

📒 Files selected for processing (5)
  • Tests/sort_utils.cc
  • Tree_Decomposition.H
  • ah-dry.H
  • hyperloglog.H
  • tpl_sort_utils.H
🚧 Files skipped from review as they are similar to previous changes (1)
  • Tree_Decomposition.H

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (3)

Tree_Decomposition.H:200

  • The standalone scope block starts on the same line as the preceding statement (id_to_node_ = ...; {), which makes the control flow harder to read and is inconsistent with the surrounding brace style. Please move the { ... } to its own line (or remove the extra scope if it’s not needed) to keep formatting clear and consistent.
        id_to_node_ = Array<Node *>::create(n_); {
          MapOLhash<Node *, size_t> tmp(n_);
          node_to_id_.swap(tmp);

htlist.H:1140

  • There is an extra leading space before the Doxygen comment on the new copy constructor, which makes it misaligned with the other iterator comments in this section. Align the comment indentation with the surrounding //!\~English ... lines for consistency.
     //!\~English Copy constructor \~Spanish Constructor de copia
    Iterator(const Iterator &) = default;

Examples/random_select_example.cc:55

  • The helper print_array is indented as if it were inside a block (template line starts with extra whitespace) even though it is at global scope. This looks unintentional and makes the file formatting inconsistent—please left-align it with the other top-level declarations.
  template <typename Container>
  void print_array(const char * label, const Container & a)
  {
    std::cout << std::left << std::setw(15) << label << ": [";
    for (size_t i = 0; i < a.size(); ++i)
      {
        std::cout << a[i];

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tpl_sort_utils.H (1)

1181-1189: ⚠️ Potential issue | 🟠 Major

Must fix: merge() is not stable on equal keys.

On ties, the merge picks from the right partition (else branch), which reorders equal elements across halves and violates the documented stable behavior of merge sort.

💡 Proposed fix
-    for (long k = l; k <= r; ++k)
-      if (cmp(buf[i], buf[j]))
-        a[k] = std::move(buf[i++]);
-      else
-        a[k] = std::move(buf[j--]);
+    for (long k = l; k <= r; ++k)
+      if (cmp(buf[j], buf[i]))
+        a[k] = std::move(buf[j--]);
+      else
+        a[k] = std::move(buf[i++]); // left side wins ties => stable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 1181 - 1189, The merge loop in merge() is
unstable because on ties it takes from the right; update the selection logic so
equal keys come from the left partition to preserve stability: using the
existing symbols cmp, buf, a, i, j, k, change the condition to choose buf[i]
when elements compare equal (i.e., when both cmp(buf[i], buf[j]) and cmp(buf[j],
buf[i]) are false), otherwise use cmp(buf[i], buf[j]) as before — effectively
prefer left when !cmp(buf[j], buf[i]) so equal keys keep original order.
♻️ Duplicate comments (2)
hyperloglog.H (2)

138-144: ⚠️ Potential issue | 🟡 Minor

Nice to have: make the estimate() thread-safety note fully explicit.

Line 143 omits clear() as a concurrent writer; readers are only safe when no concurrent update(), merge(), or clear() runs.

As per coding guidelines, "Document thread-safety guarantees in class/function comments" and "Document whether algorithms are thread-safe, reentrant, or require external synchronization."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperloglog.H` around lines 138 - 144, Update the estimate() comment to
explicitly state its thread-safety guarantees include that concurrent reads are
only safe when there are no concurrent writers calling update(), merge(), or
clear(); specifically modify the comment block above the estimate() method to
add "or clear()" to the list of prohibited concurrent writers (referencing
estimate(), update(), merge(), clear()) so it clearly documents required
external synchronization for readers.

119-125: ⚠️ Potential issue | 🟠 Major

Should fix: complete Doxygen contract tags for void public methods.

update(), merge(), and clear() still miss explicit @return tags, so the public API docs remain inconsistent with the repository’s strict Doxygen contract rule.

Proposed doc-only patch
@@
     /** `@brief` Add an element to the set.
      *
      *  `@param`[in] val Element to incorporate into the cardinality sketch.
+     *  `@return` void
      *  `@throws` none
      *  `@note` Complexity: O(1).
      *  `@note` Not thread-safe: concurrent calls to update() must be externally synchronized.
      */
@@
     /** `@brief` Merge another HyperLogLog into this one (union of sets).
@@
      *  `@param`[in] other HyperLogLog with the same precision parameter b.
      *  `@pre` b_ == other.b_
+     *  `@return` void
      *  `@throws` std::domain_error if the precision parameters (b) differ.
      *  `@note` Complexity: O(m) where m = 2^b.
      *  `@note` Not thread-safe: concurrent calls must be externally synchronized.
      */
@@
     /** `@brief` Reset all registers to zero.
      *
+     *  `@return` void
      *  `@throws` none
      *  `@note` Complexity: O(m) where m = 2^b.
      *  `@note` Not thread-safe: concurrent calls must be externally synchronized.
      */

As per coding guidelines, "All public member functions must have @brief, @param, @return, and @throws Doxygen tags."

Also applies to: 182-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperloglog.H` around lines 119 - 125, Add explicit Doxygen `@return` tags for
the public void methods update(), merge(), and clear() so the API docs comply
with the repository's Doxygen contract; locate the declarations/definitions of
HyperLogLog::update(), HyperLogLog::merge(const HyperLogLog&), and
HyperLogLog::clear() and add a line like "@return void" (or equivalent phrasing
used across the repo) to each comment block, preserving existing `@brief`, `@param`,
and `@throws` tags and formatting conventions.
🧹 Nitpick comments (1)
tpl_sort_utils.H (1)

5571-5574: Nice to have: deduplicate repeated apply_sort_with_temp lambdas.

These wrappers repeat the same lambda shape for bucket_sort/timsort container overloads. A small helper would reduce duplication and ease future maintenance.

As per coding guidelines: "Refactor duplicated code into reusable functions or templates".

Also applies to: 5631-5634, 6257-6260, 6315-6318, 6337-6340

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 5571 - 5574, The repeated lambdas passed to
bucket_sort_detail::apply_sort_with_temp (which simply forward T* p, size_t n,
const Compare& c to bucket_sort(p,n,c) or tim_sort(p,n,c)) should be refactored
into a small reusable helper — e.g., a template function or factory like
make_forwarding_sort_invoker<SortFn>() that captures a sort function
(bucket_sort or tim_sort) and returns the appropriate callable; replace the
inline lambdas at the shown call site and the other locations (lines around
5631-5634, 6257-6260, 6315-6318, 6337-6340) to pass the helper result to
bucket_sort_detail::apply_sort_with_temp instead of repeating the lambda. This
keeps the call sites concise and centralizes the forwarding logic for
apply_sort_with_temp.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tpl_sort_utils.H`:
- Around line 5432-5449: Add a `@pre` that states the required monotonic
relationship between bucket_key and cmp for global ordering: e.g.,
"Precondition: for all elements x,y, if cmp(x,y) is true then bucket_key(x) <=
bucket_key(y) (i.e., bucket_key is monotonic with respect to cmp)"; update the
bucket_sort documentation (the comment block for function bucket_sort) to
include this `@pre` and a matching `@post` that says "If the precondition holds, the
output array is globally ordered according to cmp after sorting"; reference the
parameter names bucket_key and cmp in the text so readers can locate the
contract.

---

Outside diff comments:
In `@tpl_sort_utils.H`:
- Around line 1181-1189: The merge loop in merge() is unstable because on ties
it takes from the right; update the selection logic so equal keys come from the
left partition to preserve stability: using the existing symbols cmp, buf, a, i,
j, k, change the condition to choose buf[i] when elements compare equal (i.e.,
when both cmp(buf[i], buf[j]) and cmp(buf[j], buf[i]) are false), otherwise use
cmp(buf[i], buf[j]) as before — effectively prefer left when !cmp(buf[j],
buf[i]) so equal keys keep original order.

---

Duplicate comments:
In `@hyperloglog.H`:
- Around line 138-144: Update the estimate() comment to explicitly state its
thread-safety guarantees include that concurrent reads are only safe when there
are no concurrent writers calling update(), merge(), or clear(); specifically
modify the comment block above the estimate() method to add "or clear()" to the
list of prohibited concurrent writers (referencing estimate(), update(),
merge(), clear()) so it clearly documents required external synchronization for
readers.
- Around line 119-125: Add explicit Doxygen `@return` tags for the public void
methods update(), merge(), and clear() so the API docs comply with the
repository's Doxygen contract; locate the declarations/definitions of
HyperLogLog::update(), HyperLogLog::merge(const HyperLogLog&), and
HyperLogLog::clear() and add a line like "@return void" (or equivalent phrasing
used across the repo) to each comment block, preserving existing `@brief`, `@param`,
and `@throws` tags and formatting conventions.

---

Nitpick comments:
In `@tpl_sort_utils.H`:
- Around line 5571-5574: The repeated lambdas passed to
bucket_sort_detail::apply_sort_with_temp (which simply forward T* p, size_t n,
const Compare& c to bucket_sort(p,n,c) or tim_sort(p,n,c)) should be refactored
into a small reusable helper — e.g., a template function or factory like
make_forwarding_sort_invoker<SortFn>() that captures a sort function
(bucket_sort or tim_sort) and returns the appropriate callable; replace the
inline lambdas at the shown call site and the other locations (lines around
5631-5634, 6257-6260, 6315-6318, 6337-6340) to pass the helper result to
bucket_sort_detail::apply_sort_with_temp instead of repeating the lambda. This
keeps the call sites concise and centralizes the forwarding logic for
apply_sort_with_temp.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09ad60b and 481d330.

📒 Files selected for processing (2)
  • hyperloglog.H
  • tpl_sort_utils.H

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2026
Refactor bucket_sort and timsort to use sort_invoker_factory.
Introduce new concepts for HyperLogLog and traversable containers.
Adjust existing Timsort tests for the updated implementation.
Copilot AI review requested due to automatic review settings March 3, 2026 03:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tpl_sort_utils.H (1)

4972-4973: ⚠️ Potential issue | 🟠 Major

Must fix: replace std::array buckets with Aleph containers in radix-list implementations.

These new bucket buffers use std::array, which conflicts with repository policy for container selection in this codebase.

Proposed fix
-      std::array<HTList, 256> buckets;
+      Array<HTList> buckets = Array<HTList>::create(256);
@@
-          for (auto & bucket: buckets)
-            list.HTList::append(bucket);
+          for (size_t b = 0; b < 256; ++b)
+            list.HTList::append(buckets[b]);
-      std::array<Dnode<IntT>, 256> buckets;
+      Array<Dnode<IntT>> buckets = Array<Dnode<IntT>>::create(256);
@@
-          for (auto & bucket: buckets)
-            list.template Dnode<IntT>::append_list(&bucket);
+          for (size_t b = 0; b < 256; ++b)
+            list.template Dnode<IntT>::append_list(&buckets[b]);

As per coding guidelines: "MUST use Aleph containers when an equivalent exists. Standard library containers are ONLY permitted when no Aleph equivalent is available."

Also applies to: 5011-5011

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 4972 - 4973, Replace the std::array bucket
buffers in the radix-list code with the Aleph container equivalent: change the
declaration std::array<HTList, 256> buckets (and the duplicate at the other
occurrence) to use the Aleph array/fixed-array type used in this codebase (e.g.,
aleph::array<HTList, 256> buckets), and update any necessary includes or
namespace qualifiers so the HTList buckets are constructed and used exactly as
before but using the Aleph container instead of std::array.
ah-dry.H (1)

48-50: ⚠️ Potential issue | 🟠 Major

Add explicit <concepts> include for std::convertible_to.

The concepts at lines 62 and 74 use std::convertible_to, which is defined in <concepts>. While this header is currently included transitively via ahFunctional.H, explicit inclusion is required to avoid brittle build-order dependencies.

Patch
 # include "ahFunctional.H"
+# include <concepts>
 # include <utility>
 # include <vector>
 # include <ah-errors.H>

Also applies to: 54-63, 74-75

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ah-dry.H` around lines 48 - 50, Add an explicit include for the C++ concepts
header so uses of std::convertible_to in ah-dry.H are guaranteed to compile;
update the include block that currently has <utility>, <vector>, and
<ah-errors.H> to also include <concepts> so the concept expressions (those using
std::convertible_to at the concepts around the declarations at lines ~62 and
~74) resolve without relying on transitive includes from ahFunctional.H.
♻️ Duplicate comments (1)
tpl_sort_utils.H (1)

6264-6269: ⚠️ Potential issue | 🟡 Minor

Should fix: null-pointer contract is too strict for empty ranges in timsort(T*, long, long, ...).

This still throws on a == nullptr even when l > r (empty range). That contradicts the current early-return semantics and the “non-empty range” intent.

Proposed fix
-    ah_invalid_argument_if(a == nullptr)
+    ah_invalid_argument_if(a == nullptr and l <= r)
       << "timsort(): null pointer contract violation for range ["
       << l << ", " << r << "]; cannot call timsort_detail::timsort_impl()";

As per coding guidelines: "Validate external inputs before processing" and "Check boundary conditions and off-by-one errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tpl_sort_utils.H` around lines 6264 - 6269, The null-pointer check using
ah_invalid_argument_if(a == nullptr) is too strict for empty ranges in
timsort(T* a, long l, long r, ...); change the logic so empty ranges are
allowed: either move the early-return (if (l >= r) return;) before the
null-pointer assertion, or tighten the assertion to only trigger for non-empty
ranges (e.g., assert a == nullptr only when l < r); update the check near
timsort and timsort_detail::timsort_impl to reference parameters a, l, r
accordingly.
🧹 Nitpick comments (3)
hyperloglog.H (1)

181-181: Nice to have: replace hard-coded 2^64 literal with computed constant.

Using std::ldexp(1.0, 64) avoids a long magic literal and is easier to audit.

Optional refactor
-          static constexpr double two_to_64 = 18446744073709551616.0;
+          static constexpr double two_to_64 = std::ldexp(1.0, 64);

As per coding guidelines: "Avoid magic numbers; use named constants or enums".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperloglog.H` at line 181, The constant two_to_64 is defined with a
hard-coded 2^64 literal; replace its initializer with a computed value using
std::ldexp(1.0, 64) in the static constexpr double two_to_64 declaration, and
add the needed include for <cmath> (or <math.h>) so ldexp is available; keep the
name and constexpr qualifier unchanged to preserve semantics in any code using
two_to_64.
Tests/sort_utils.cc (1)

1918-1964: Nice to have: add one seeded property-style randomized loop for random_select.

The crafted cases are good. Adding a deterministic randomized cross-check (compare random_select against sorted-reference rank over many generated arrays) would improve regression sensitivity for partition edge interactions.

Based on learnings: “Use property-based testing or randomized tests for complex algorithms”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Tests/sort_utils.cc` around lines 1918 - 1964, Add a deterministic
property-style randomized test that seeds the RNG and repeatedly generates
random arrays and indices, then validates Aleph::random_select against a
sorted-reference rank; implement a new TEST in the SortUtilsRandomSelect suite
(e.g., RandomizedSeededProperty) that creates a std::mt19937 with a fixed seed,
for many iterations builds random DynArray<int> values, computes expected result
by sorting or using std::nth_element on a copy, calls
Aleph::random_select(&copy(0), static_cast<long>(k),
static_cast<long>(copy.size()), std::less<int>()), and uses EXPECT_EQ to compare
to the reference value to catch partition-edge regressions.
ah-dry.H (1)

1658-1659: Constrain EqualSequenceMethod with concepts for clearer diagnostics.

Given this file already introduces AlephTraversable, adding constraints here would make contract violations fail earlier and more clearly.

Refactor sketch
-template <class Container>
+template <class Container>
+  requires AlephTraversable<Container> and
+           requires (const Container & c)
+           {
+             { c.size() } -> std::convertible_to<size_t>;
+             c.get_it();
+           }
 class EqualSequenceMethod

As per coding guidelines: “Use concepts to constrain template parameters”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ah-dry.H` around lines 1658 - 1659, The template class EqualSequenceMethod
should be constrained with the AlephTraversable concept so misuse yields clearer
diagnostics; change its declaration to require AlephTraversable (e.g. use either
“template <class Container> requires AlephTraversable<Container> class
EqualSequenceMethod” or the abbreviated form “template <AlephTraversable
Container> class EqualSequenceMethod”) and update any forward
declarations/usages accordingly, ensuring the AlephTraversable concept is
visible (include or import the header/namespace where AlephTraversable is
defined).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ah-dry.H`:
- Around line 406-415: Update the Doxygen for the rvalue overload of find_ptr()
(and the other two similar overloads referenced) to not claim "@throws none";
instead state that the function may propagate exceptions thrown by the provided
Operation and provide the same exception guarantees as the corresponding
lvalue-reference overload (e.g., wording like "May throw the same exceptions as
the corresponding lvalue overload; no additional exception-safety guarantees are
provided"). Locate the rvalue overload docs for find_ptr() (the comment block
describing the rvalue wrapper) and the two other comment blocks at the indicated
spots and replace "@throws none" with this accurate exception-safety wording.
- Around line 132-136: The const overload of traverse on GenericTraverse
currently breaks const-correctness by const_casting and calling the mutable
traverse; instead remove the const_cast and provide a const-safe overload that
constrains the Operation to accept a const reference to the element type (e.g.
require TraverseOperationConst<Operation, typename Container::Item_Type const&>
or a similar trait), call a new const-qualified helper that passes const
Item_Type& to the callback, and ensure noexcept uses
traverse_is_noexcept<Operation>() appropriately; do not forward to the non-const
traverse and document the behaviour if you prefer the alternative of explicitly
allowing mutation.

In `@hyperloglog.H`:
- Around line 44-48: The code uses std::convertible_to in the MurmurHashable
concept's requires-clause but does not include <concepts>, relying on transitive
includes; add an explicit `#include` <concepts> near the other standard includes
(alongside <cmath>, <algorithm>, <bit>, <cstdint>) so the MurmurHashable concept
(and any other use of std::convertible_to) is well-defined and portable across
toolchains.

---

Outside diff comments:
In `@ah-dry.H`:
- Around line 48-50: Add an explicit include for the C++ concepts header so uses
of std::convertible_to in ah-dry.H are guaranteed to compile; update the include
block that currently has <utility>, <vector>, and <ah-errors.H> to also include
<concepts> so the concept expressions (those using std::convertible_to at the
concepts around the declarations at lines ~62 and ~74) resolve without relying
on transitive includes from ahFunctional.H.

In `@tpl_sort_utils.H`:
- Around line 4972-4973: Replace the std::array bucket buffers in the radix-list
code with the Aleph container equivalent: change the declaration
std::array<HTList, 256> buckets (and the duplicate at the other occurrence) to
use the Aleph array/fixed-array type used in this codebase (e.g.,
aleph::array<HTList, 256> buckets), and update any necessary includes or
namespace qualifiers so the HTList buckets are constructed and used exactly as
before but using the Aleph container instead of std::array.

---

Duplicate comments:
In `@tpl_sort_utils.H`:
- Around line 6264-6269: The null-pointer check using ah_invalid_argument_if(a
== nullptr) is too strict for empty ranges in timsort(T* a, long l, long r,
...); change the logic so empty ranges are allowed: either move the early-return
(if (l >= r) return;) before the null-pointer assertion, or tighten the
assertion to only trigger for non-empty ranges (e.g., assert a == nullptr only
when l < r); update the check near timsort and timsort_detail::timsort_impl to
reference parameters a, l, r accordingly.

---

Nitpick comments:
In `@ah-dry.H`:
- Around line 1658-1659: The template class EqualSequenceMethod should be
constrained with the AlephTraversable concept so misuse yields clearer
diagnostics; change its declaration to require AlephTraversable (e.g. use either
“template <class Container> requires AlephTraversable<Container> class
EqualSequenceMethod” or the abbreviated form “template <AlephTraversable
Container> class EqualSequenceMethod”) and update any forward
declarations/usages accordingly, ensuring the AlephTraversable concept is
visible (include or import the header/namespace where AlephTraversable is
defined).

In `@hyperloglog.H`:
- Line 181: The constant two_to_64 is defined with a hard-coded 2^64 literal;
replace its initializer with a computed value using std::ldexp(1.0, 64) in the
static constexpr double two_to_64 declaration, and add the needed include for
<cmath> (or <math.h>) so ldexp is available; keep the name and constexpr
qualifier unchanged to preserve semantics in any code using two_to_64.

In `@Tests/sort_utils.cc`:
- Around line 1918-1964: Add a deterministic property-style randomized test that
seeds the RNG and repeatedly generates random arrays and indices, then validates
Aleph::random_select against a sorted-reference rank; implement a new TEST in
the SortUtilsRandomSelect suite (e.g., RandomizedSeededProperty) that creates a
std::mt19937 with a fixed seed, for many iterations builds random DynArray<int>
values, computes expected result by sorting or using std::nth_element on a copy,
calls Aleph::random_select(&copy(0), static_cast<long>(k),
static_cast<long>(copy.size()), std::less<int>()), and uses EXPECT_EQ to compare
to the reference value to catch partition-edge regressions.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 481d330 and def85a3.

📒 Files selected for processing (4)
  • Tests/sort_utils.cc
  • ah-dry.H
  • hyperloglog.H
  • tpl_sort_utils.H

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated 6 comments.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2026
Refactor `SimHash` to use two 32-bit hashes for 64-bit fingerprint on 32-bit platforms.
Update `HyperLogLog` to include <concepts> header.
Refactor `radix_sort_impl` in `tpl_sort_utils.H` to use `Array` instead of `std::array` for buckets to avoid default constructible requirement.
Add `TEST(SortUtilsRandomSelect, RandomizedSeededProperty)` to `sort_utils.cc`.
Refactor HyperLogLog methods for clarity.
Standardize GenericItems aliases and equality checks.
Copilot AI review requested due to automatic review settings March 3, 2026 14:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 41 changed files in this pull request and generated 4 comments.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 3, 2026
Refactor HyperLogLog to remove MurmurHashable concept and directly use hash64.
Update simhash.H to use dereferenced iterator for update calls.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants