[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536

netudima · 2025-12-18T23:09:38Z

No description provided.

…location for the flushing thread

…st clustering key in a partition

…per (avoid megamorphic call + cache isCounterColumn)

… auto-boxing for logging of primitive parameters

…tes.hashCode if not needed, avoid capturing lambda allocation in UnfilteredSerializer.serializeRowBody

…nd ShardedSkipListMemtable

src/java/org/apache/cassandra/db/marshal/AbstractType.java

src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java

blambov · 2026-01-09T13:29:42Z

src/java/org/apache/cassandra/io/sstable/format/SortedTableWriter.java

-            addRow(key, (Row) unfiltered);
+            addRow(key, (Row) unfiltered, isRowFirstOrLast);
        else
            addRangeTomstoneMarker((RangeTombstoneMarker) unfiltered);


This optimization can also apply to range tombstones.

blambov · 2026-01-09T13:43:19Z

src/java/org/apache/cassandra/db/rows/Cell.java

+            ValueAccessor<T> accessor;
+            // to avoid megamorphic calls we split call sites
+            // we have ArrayCell, BufferCell and NativeCell and all of them can be here in different scenarios
+            if (cell.getClass() == NativeCell.class)


I'm curious if you can you quantify the effect of this?

(If you don't already have a way to do it, the easiest is probably to run ReadSmallPartitionsBench with flush=YES and get the flush time log line.)

yes, I have interim results and profiles captured for each commit in his PR.

before this commit:

INFO [PerDiskMemtableFlushWriter_0:20] 2025-12-18T13:29:20,419 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-46-big-Data.db (226.600MiB) for commitlog position CommitLogPosition(segmentId=1766064398832, position=234105), time spent: 5540 ms, bytes flushed: 237607365/(47521473 per sec), partitions flushed: 258735/(51747 per sec), rows: 177129833/(35425966 per sec), cpu time: 3930 ms, allocated: 764.749MiB

after this commit:

INFO [PerDiskMemtableFlushWriter_0:11] 2025-12-18T14:32:46,404 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-36-big-Data.db (227.192MiB) for commitlog position CommitLogPosition(segmentId=1766068252348, position=233003), time spent: 4959 ms, bytes flushed: 238228477/(59557119 per sec), partitions flushed: 259283/(64820 per sec), rows: 177545115/(44386278 per sec), cpu time: 3485 ms, allocated: 766.370MiB

so, the delta for cpu time is quite visible: 3930 vs 3485
note: please ignore rows stats in log output for the interim results, in that old versions there was a bug there, as a WA partitions stats can be used - the test has 10 rows per partition.

flush_2_cell_stats.log
flush_1_stat_stats.log

flush_2_cell_cpu.html
flush_1_stat_cpu.html

In my test I have 5 value text columns. So, for a write rate 2'000'000 rows per second (which I am close to now), we have 10 mln cells, 8 calls per cell object in the serialization logic-> 80 mln calls per second for flushing only only, on this level even a small overhead matters.

This change also slightly improve serialization for Mutation, so the overall win is about 1-1.1% of CPU in total, based on async profiler CPU profile.

Obviously, for tables with just a single value column the result will be not that impressive but in my experience tables with 10 or even more than 100 cells (UDTs/collections) per row are not rare.

Notes:

I prefer e2e test for this kind of overheads validation due to a JIT profile pollution: when we do not have serialization or/and compaction executed within the same JVM run we may get too optimistic results because the related logic was not invoked for different cell classes in a microbenchmark and JVM can do monomorphic inlining. The cell seriialization logic is invoked for commit log/server-2-server, for flushing, for reads and for compaction, so we may have all 3 Cell types used in a worst case within the same call site and we get metamorphic calls (async profile shows then as vtable stub https://github.com/async-profiler/async-profiler/blob/master/docs/AdvancedStacktraceFeatures.md#display-actual-implementation-in-vtable)

in this logic we invoke cell methods several times per cell, so the cost of the type check I introduced is amortised

blambov · 2026-01-09T14:03:30Z

src/java/org/apache/cassandra/db/ClusteringPrefix.java

    public V get(int i);

+
+    default void writeValue(AbstractType<?> type, int i, DataOutputPlus out) throws IOException


The skipping of empty/null values is quite surprising given that we not only want them, but we also make a distinction between the two kinds. Could you rename the two methods in a way that states this is being done? (E.g. add SkipNullAndEmpty to the name)

blambov · 2026-01-09T14:21:56Z

src/java/org/apache/cassandra/db/memtable/Flushing.java

                                                                    flushSet.columns(),
-                                                                    flushSet.encodingStats()),
+                                                                    flushSet.encodingStats(),
+                                                                    flushSet.columnsChangedAfterCreation()),


As we are using the flush set's columns, I don't think the metadata's definition (and the question whether it's changed) is relevant at all. Is it really possible to serialize with a different set than the one the rows are already using?

blambov · 2026-01-09T15:15:21Z

src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java


+    public UnfilteredRowIterator flushingIterator(TableMetadata tableMetadata)
+    {
+        return unfilteredIterator(ColumnFilter.all(columns()), Slices.ALL, false);


Why do we do selection in the method above if columns() must contain all the iterator has? Can we change the method above so every non-restricted unfilteredIterator call also benefit?

blambov · 2026-01-09T15:19:30Z

src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java

+        long timestamp;
+        int ttl;
+        long localDeletionTime;
+        if (cell.getClass() == ArrayCell.class)


Aren't we more interested in NativeCell.class? I.e. shouldn't the checks here and in serialize be consistent?

netudima added 10 commits December 20, 2025 18:13

add more flushing stats: partitions/rows, bytes rate, CPU and heap al…

2e9394d

…location for the flushing thread

invoke metadataCollector.updateClusteringValues only for first and la…

ab3ee74

…st clustering key in a partition

split call sites for in Cell serialize logic, make isCounterCell chea…

fff8e1a

…per (avoid megamorphic call + cache isCounterColumn)

check if Guardrails enabled at the beginning of writing, avoid hidden…

4642a74

… auto-boxing for logging of primitive parameters

add fast return for BTreeRow.hasComplexDeletion, avoid column.name.by…

b84633f

…tes.hashCode if not needed, avoid capturing lambda allocation in UnfilteredSerializer.serializeRowBody

reduce allocations during serialization of NativeClustering

1f934f6

do not re-map colums in serializeRowBody if they haven't changed

f6c74c2

add flushing iterator without columns filtering

f6e09b7

improve cell methods inlining and try to use monomorphic calls

acebda4

minor changes in completed flushing log message

716d6be

netudima force-pushed the CASSANDRA-21083-trunk branch from 7fa2bf3 to 716d6be Compare December 20, 2025 18:22

add columnsChangedAfterCreation implementation for SkipListMemtable a…

f8af3e0

…nd ShardedSkipListMemtable

samueldlightfoot reviewed Dec 20, 2025

View reviewed changes

src/java/org/apache/cassandra/db/marshal/AbstractType.java Outdated Show resolved Hide resolved

samueldlightfoot reviewed Dec 20, 2025

View reviewed changes

src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java Show resolved Hide resolved

address review comments

dbdb5c8

blambov reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536

[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536

Uh oh!

netudima commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

blambov Jan 9, 2026

Uh oh!

blambov Jan 9, 2026

Uh oh!

netudima Jan 9, 2026 •

edited

Loading

Uh oh!

blambov Jan 9, 2026

Uh oh!

blambov Jan 9, 2026

Uh oh!

blambov Jan 9, 2026

Uh oh!

blambov Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		public V get(int i);


		default void writeValue(AbstractType<?> type, int i, DataOutputPlus out) throws IOException

[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536

Are you sure you want to change the base?

[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536

Uh oh!

Conversation

netudima commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

netudima Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

blambov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netudima Jan 9, 2026 •

edited

Loading