Skip to content

CNDB-16051: Include pq vectors in CompactionGraph bytesUsed calculation#2144

Merged
michaeljmarshall merged 4 commits intomainfrom
cndb-16051
Feb 10, 2026
Merged

CNDB-16051: Include pq vectors in CompactionGraph bytesUsed calculation#2144
michaeljmarshall merged 4 commits intomainfrom
cndb-16051

Conversation

@michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Dec 1, 2025

What is the issue

Fixes: https://github.com/riptano/cndb/issues/16051
CNDB PR: https://github.com/riptano/cndb/pull/16675

What does this PR fix and why was it fixed

The SAI segment builder logic keeps track of bytes used by a segment builder to ensure proper flushing to prevent OOM. This change tracks the PQ and BQ vector byte utilization per insertion.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 1, 2025

Comment on lines +226 to +228
// Set force PQ training size to ensure we hit the refine code path and apply it to half the vectors.
// TODO this test fails as of this commit due to recall issues. Will investigate further.
CompactionGraph.PQ_TRAINING_SIZE = baseVectors.size() / 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here seems likely to be with GraphIndexBuilder.rescore in the CompactionGraph. The failure is unrelated to this PR. I am looking into possible fixes. Generally speaking, I think we can entirely remove the rescore logic, which duplicates work unnecessarily. We could instead accumulate N vectors, refine a PQ, encode the vectors, and then insert them into a graph builder properly built using these PQ values. I have a local version of this that passes, but is a bit hacky.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how is this added test here related to the memory consumption tracking modification in the code under test? I can see we're likely invoking vector compression here, but we're not testing the memory consumption counters. Or am I missing sth?

Copy link
Member Author

@michaeljmarshall michaeljmarshall Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reducing the PQ_TRAINING_SIZE, we hit code that was previously uncovered and then we asserted on the recall for the graph built by that code and the test didn't pass. The failure is unrelated to my change.

@michaeljmarshall michaeljmarshall self-assigned this Feb 5, 2026
Copy link

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good.
The failing test - let's move it to a separate issue.
The code change is simple enough that we can live with it not covered for a bit longer (it wasn't covered anyway before).

for (int topK : List.of(1, 100))
{
var recall = testRecall(topK, queryVectors, groundTruth);
assertTrue("Post-compaction recall is " + recall, recall > postCompactionRecall);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is failing, maybe relax this condition for a while, instead of removing the whole test.
This way you still invoke the memory tracking code modification added in this PR.
BTW: is there a way to assert for the memory counting somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is failing, maybe relax this condition for a while, instead of removing the whole test.

I removed the assertion entirely since it isn't valuable assertion > 0.

BTW: is there a way to assert for the memory counting somehow?

We do have ways of doing that in the project, but I'm not sure if it is worth adding just yet. The current accounting is not exact (see #2002). The main point of this PR is to make sure we track one of the largest contributors to heap memory consumption.

As a follow up, I create datastax/jvector#614 to see if jvector can return the bytes added as a result of the invocation.

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2144 rejected by Butler


5 regressions found
See build details here


Found 5 new test failures

Test Explanation Runs Upstream
o.a.c.distributed.test.AbortedQueryLoggerTest.testLogsReadMetrics NEW 🔵🔴 0 / 22
o.a.c.index.sai.cql.VectorCompaction100dTest.testOneToManyCompaction[dc false] NEW 🔴 0 / 22
o.a.c.index.sai.cql.VectorSiftSmallTest.testSiftSmall[db false] NEW 🔴 0 / 22
o.a.c.metrics.TrieMemtableMetricsTest.testContentionMetrics (compression) NEW 🔴🔵 2 / 22
o.a.c.net.ProxyHandlerConnectionsTest.suddenDisconnect (compression) NEW 🔵🔴 0 / 22

Found 2 known test failures

@sonarqubecloud
Copy link

@michaeljmarshall
Copy link
Member Author

Test failures are all compression related time outs. Code coverage is sufficiently high, merging.

@michaeljmarshall michaeljmarshall merged commit 7a0ee2a into main Feb 10, 2026
2 of 4 checks passed
@michaeljmarshall michaeljmarshall deleted the cndb-16051 branch February 10, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants