-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[CASSANDRA-21083][trunk] Optimize memtable flush logic #4536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…location for the flushing thread
…st clustering key in a partition
…per (avoid megamorphic call + cache isCounterColumn)
… auto-boxing for logging of primitive parameters
…tes.hashCode if not needed, avoid capturing lambda allocation in UnfilteredSerializer.serializeRowBody
7fa2bf3 to
716d6be
Compare
…nd ShardedSkipListMemtable
src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java
Show resolved
Hide resolved
| addRow(key, (Row) unfiltered); | ||
| addRow(key, (Row) unfiltered, isRowFirstOrLast); | ||
| else | ||
| addRangeTomstoneMarker((RangeTombstoneMarker) unfiltered); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization can also apply to range tombstones.
| ValueAccessor<T> accessor; | ||
| // to avoid megamorphic calls we split call sites | ||
| // we have ArrayCell, BufferCell and NativeCell and all of them can be here in different scenarios | ||
| if (cell.getClass() == NativeCell.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you can you quantify the effect of this?
(If you don't already have a way to do it, the easiest is probably to run ReadSmallPartitionsBench with flush=YES and get the flush time log line.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have interim results and profiles captured for each commit in his PR.
before this commit:
INFO [PerDiskMemtableFlushWriter_0:20] 2025-12-18T13:29:20,419 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-46-big-Data.db (226.600MiB) for commitlog position CommitLogPosition(segmentId=1766064398832, position=234105), time spent: 5540 ms, bytes flushed: 237607365/(47521473 per sec), partitions flushed: 258735/(51747 per sec), rows: 177129833/(35425966 per sec), cpu time: 3930 ms, allocated: 764.749MiB
after this commit:
INFO [PerDiskMemtableFlushWriter_0:11] 2025-12-18T14:32:46,404 Flushing.java:196 - Completed flushing /u02/dmko_cassandra/data/stress/stress_table-1b255f4def2540a60000000000000005/pa-36-big-Data.db (227.192MiB) for commitlog position CommitLogPosition(segmentId=1766068252348, position=233003), time spent: 4959 ms, bytes flushed: 238228477/(59557119 per sec), partitions flushed: 259283/(64820 per sec), rows: 177545115/(44386278 per sec), cpu time: 3485 ms, allocated: 766.370MiB
so, the delta for cpu time is quite visible: 3930 vs 3485
note: please ignore rows stats in log output for the interim results, in that old versions there was a bug there, as a WA partitions stats can be used - the test has 10 rows per partition.
flush_2_cell_stats.log
flush_1_stat_stats.log
flush_2_cell_cpu.html
flush_1_stat_cpu.html
In my test I have 5 value text columns. So, for a write rate 2'000'000 rows per second (which I am close to now), we have 10 mln cells, 8 calls per cell object in the serialization logic-> 80 mln calls per second for flushing only only, on this level even a small overhead matters.
This change also slightly improve serialization for Mutation, so the overall win is about 1-1.1% of CPU in total, based on async profiler CPU profile.
Obviously, for tables with just a single value column the result will be not that impressive but in my experience tables with 10 or even more than 100 cells (UDTs/collections) per row are not rare.
Notes:
- I prefer e2e test for this kind of overheads validation due to a JIT profile pollution: when we do not have serialization or/and compaction executed within the same JVM run we may get too optimistic results because the related logic was not invoked for different cell classes in a microbenchmark and JVM can do monomorphic inlining. The cell seriialization logic is invoked for commit log/server-2-server, for flushing, for reads and for compaction, so we may have all 3 Cell types used in a worst case within the same call site and we get metamorphic calls (async profile shows then as
vtable stubhttps://github.com/async-profiler/async-profiler/blob/master/docs/AdvancedStacktraceFeatures.md#display-actual-implementation-in-vtable) - in this logic we invoke cell methods several times per cell, so the cost of the type check I introduced is amortised
| public V get(int i); | ||
|
|
||
|
|
||
| default void writeValue(AbstractType<?> type, int i, DataOutputPlus out) throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skipping of empty/null values is quite surprising given that we not only want them, but we also make a distinction between the two kinds. Could you rename the two methods in a way that states this is being done? (E.g. add SkipNullAndEmpty to the name)
| flushSet.columns(), | ||
| flushSet.encodingStats()), | ||
| flushSet.encodingStats(), | ||
| flushSet.columnsChangedAfterCreation()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are using the flush set's columns, I don't think the metadata's definition (and the question whether it's changed) is relevant at all. Is it really possible to serialize with a different set than the one the rows are already using?
|
|
||
| public UnfilteredRowIterator flushingIterator(TableMetadata tableMetadata) | ||
| { | ||
| return unfilteredIterator(ColumnFilter.all(columns()), Slices.ALL, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do selection in the method above if columns() must contain all the iterator has? Can we change the method above so every non-restricted unfilteredIterator call also benefit?
| long timestamp; | ||
| int ttl; | ||
| long localDeletionTime; | ||
| if (cell.getClass() == ArrayCell.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we more interested in NativeCell.class? I.e. shouldn't the checks here and in serialize be consistent?
No description provided.