Skip to content

Conversation

@netudima
Copy link
Contributor

No description provided.

@netudima netudima force-pushed the CASSANDRA-21083-trunk branch from 7fa2bf3 to 716d6be Compare December 20, 2025 18:22
addRow(key, (Row) unfiltered);
addRow(key, (Row) unfiltered, isRowFirstOrLast);
else
addRangeTomstoneMarker((RangeTombstoneMarker) unfiltered);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization can also apply to range tombstones.

ValueAccessor<T> accessor;
// to avoid megamorphic calls we split call sites
// we have ArrayCell, BufferCell and NativeCell and all of them can be here in different scenarios
if (cell.getClass() == NativeCell.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you can you quantify the effect of this?

(If you don't already have a way to do it, the easiest is probably to run ReadSmallPartitionsBench with flush=YES and get the flush time log line.)

Copy link
Contributor Author

@netudima netudima Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have interim results and profiles captured for each commit in his PR.

before this commit:

INFO  [PerDiskMemtableFlushWriter_0:20] 2025-12-18T13:29:20,419 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-46-big-Data.db (226.600MiB) for commitlog position CommitLogPosition(segmentId=1766064398832, position=234105), time spent: 5540 ms, bytes flushed: 237607365/(47521473 per sec), partitions flushed: 258735/(51747 per sec), rows: 177129833/(35425966 per sec), cpu time: 3930 ms, allocated: 764.749MiB

after this commit:

INFO  [PerDiskMemtableFlushWriter_0:11] 2025-12-18T14:32:46,404 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-36-big-Data.db (227.192MiB) for commitlog position CommitLogPosition(segmentId=1766068252348, position=233003), time spent: 4959 ms, bytes flushed: 238228477/(59557119 per sec), partitions flushed: 259283/(64820 per sec), rows: 177545115/(44386278 per sec), cpu time: 3485 ms, allocated: 766.370MiB

so, the delta for cpu time is quite visible: 3930 vs 3485
note: please ignore rows stats in log output for the interim results, in that old versions there was a bug there, as a WA partitions stats can be used - the test has 10 rows per partition.

flush_2_cell_stats.log
flush_1_stat_stats.log

flush_2_cell_cpu.html
flush_1_stat_cpu.html

In my test I have 5 value text columns. So, for a write rate 2'000'000 rows per second (which I am close to now), we have 10 mln cells, 8 calls per cell object in the serialization logic-> 80 mln calls per second for flushing only only, on this level even a small overhead matters.

This change also slightly improve serialization for Mutation, so the overall win is about 1-1.1% of CPU in total, based on async profiler CPU profile.

Obviously, for tables with just a single value column the result will be not that impressive but in my experience tables with 10 or even more than 100 cells (UDTs/collections) per row are not rare.

Notes:

  1. I prefer e2e test for this kind of overheads validation due to a JIT profile pollution: when we do not have serialization or/and compaction executed within the same JVM run we may get too optimistic results because the related logic was not invoked for different cell classes in a microbenchmark and JVM can do monomorphic inlining. The cell seriialization logic is invoked for commit log/server-2-server, for flushing, for reads and for compaction, so we may have all 3 Cell types used in a worst case within the same call site and we get metamorphic calls (async profile shows then as vtable stub https://github.com/async-profiler/async-profiler/blob/master/docs/AdvancedStacktraceFeatures.md#display-actual-implementation-in-vtable)
  2. in this logic we invoke cell methods several times per cell, so the cost of the type check I introduced is amortised

public V get(int i);


default void writeValue(AbstractType<?> type, int i, DataOutputPlus out) throws IOException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skipping of empty/null values is quite surprising given that we not only want them, but we also make a distinction between the two kinds. Could you rename the two methods in a way that states this is being done? (E.g. add SkipNullAndEmpty to the name)

flushSet.columns(),
flushSet.encodingStats()),
flushSet.encodingStats(),
flushSet.columnsChangedAfterCreation()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are using the flush set's columns, I don't think the metadata's definition (and the question whether it's changed) is relevant at all. Is it really possible to serialize with a different set than the one the rows are already using?


public UnfilteredRowIterator flushingIterator(TableMetadata tableMetadata)
{
return unfilteredIterator(ColumnFilter.all(columns()), Slices.ALL, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do selection in the method above if columns() must contain all the iterator has? Can we change the method above so every non-restricted unfilteredIterator call also benefit?

long timestamp;
int ttl;
long localDeletionTime;
if (cell.getClass() == ArrayCell.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we more interested in NativeCell.class? I.e. shouldn't the checks here and in serialize be consistent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants