CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

samueldlightfoot · 2025-05-23T17:31:31Z

Status: draft for early review (by Ariel).

This work adds Direct IO support specifically for compaction reads of compressed tables. In its current state it should be easily extendable to be used by other bulk operations which would benefit from bypassing the page cache (and thus not polluting it).

It introduces the ability to perform SSTable scans using either the default open data file, or an ephemeral data file opened using DIO for the lifetime of the scan. This could be extended to use any alternative disk access mode in the future, if required.

Points of interest

We currently check DIO capability per SSTableReader data file. Perhaps we can/need to be smarter and check per data file directory.
FileHandle now supports toBuilder

Remaining work

Compaction integration tests
Enhance Direct CompressedChunkReader tests
Publish performance numbers
Start-up verifications for DIO support (currently done per data file within an SSTableReader)
StartupChecks for data file locations

patch by Sam Lightfoot; to be reviewed by Ariel Weisberg & Maxwell Guo for CASSANDRA-19987

https://issues.apache.org/jira/browse/CASSANDRA-19987

Maxwell-Guo · 2025-05-27T08:33:30Z

src/java/org/apache/cassandra/config/Config.java

@@ -1245,6 +1246,12 @@ public enum DiskAccessMode
        direct
    }

+    public enum ScanDiskAccessMode
+    {
+        disk_default,


why we do not use DiskAccessMode which has all the disk access mode?

This enum was intended to differentiate scans using the default disk access mode (conf.disk_access_mode via the existing dFile handle) from scans requiring a new, direct file handle.

Update: code has now been updated to use DiskAccessMode. I think this integrates well with the existing setup.

Maxwell-Guo · 2025-05-27T08:35:38Z

src/java/org/apache/cassandra/config/DatabaseDescriptor.java

@@ -901,6 +903,10 @@ else if (conf.repair_session_space.toMebibytes() > (int) (Runtime.getRuntime().m
        applyRepairCommandPoolSize(conf);
        applyReadThresholdsValidations(conf);

+        initializeCompactionScanDiskAccessMode();
+        if (compactionScanDiskAccessMode != conf.compaction_scan_disk_access_mode)


why we don't just logging the compaction scan access mode ? we can remove the if judegment.

Maxwell-Guo · 2025-05-27T08:39:08Z

src/java/org/apache/cassandra/io/util/ChannelProxy.java

+        switch (ioMode)
+        {
+            case DIRECT:
+                return new OpenOption[]{ StandardOpenOption.READ, ExtendedOpenOption.DIRECT };


Is ExtendedOpenOption.DIRECT flag enough for direct io ?

Yes - confirmed with blktrace and async-prof shows bypassing of the page cache.

What I mean is that does DIRECT flag is enough to ensure that all data (including metadata) sync to disk when crash happens. see
To provide this guarantee, the application must use fsync, or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl
from
https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics.html

Thanks for the links. This definitely seems applicable to the write path, but ChannelProxy is for a read only channel, which I do not believe requires the O_SYNC flag or similar, as there's no control data to flush.

We will currently still perform compaction writes with buffered IO (for now). However, and I think you've already spotted this, the commit log DIO writes will likely need to be verified that meta is flushed on block allocation.

Uh... the commit log is not safe as currently written. It writes to the file, but doesn't sync the metadata on flush. That means the commit log may claim it has flushed the data, but the filesystem journal has not been flushed so the file length will be wrong and could truncate the file on restart.

Additionally Direct IO doesn't actually make data durable on disk (emit write barriers) it just flushes it to the cache of the disk. If your disk cache is volatile then you can lose data.

…ith auto = default disk mode)

…essMode. Integrate DiskAccessMode to replace ScanDiskAccessMode

…te buffer size calc

Add Direct IO support for compaction reads

671184a

samueldlightfoot force-pushed the direct-io-5.0-wiring branch from d751233 to 671184a Compare May 23, 2025 18:33

samueldlightfoot added 2 commits May 24, 2025 10:58

Ensure direct file supplier copied for unbuildTo

b3a4b2d

improve enum handling

c317589

Maxwell-Guo reviewed May 27, 2025

View reviewed changes

DirectThreadLocalByteBufferHolder threading & cleaning fix

e10016d

samueldlightfoot force-pushed the direct-io-5.0-wiring branch from da6cce3 to e10016d Compare May 27, 2025 12:02

samueldlightfoot added 2 commits May 27, 2025 19:09

Remove ScanDiskAccessMode enum in favour of reusing DiskAccessMode (w…

d43ef45

…ith auto = default disk mode)

Replace directDataFileSupplier with dataFileFactory driven by DiskAcc…

ab1a45c

…essMode. Integrate DiskAccessMode to replace ScanDiskAccessMode

samueldlightfoot force-pushed the direct-io-5.0-wiring branch from 8da92c0 to ab1a45c Compare May 27, 2025 19:24

samueldlightfoot added 10 commits May 27, 2025 21:15

Refactor out dataFileFactory in favour of FileHandle::toBuilder

3cff88c

Revert changes post-removal of factory

c5315a5

Introduce DirectThreadLocalReadAheadBuffer factory to prevent duplica…

eceef62

…te buffer size calc

tests

3270450

Fix test using JDK16 toList

7ee44a5

Unify non-scan readers

7ed27c5

CompressedChunkReader tests

d9e674e

Remove redundant ByteBuffer flip

9a53814

Tidy IOMode selection

1ded646

Improve exception detail

2f584c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

Uh oh!

samueldlightfoot commented May 23, 2025 •

edited

Loading

Uh oh!

Maxwell-Guo May 27, 2025

Uh oh!

samueldlightfoot May 27, 2025 •

edited

Loading

Uh oh!

Maxwell-Guo May 27, 2025

Uh oh!

Maxwell-Guo May 27, 2025

Uh oh!

samueldlightfoot May 27, 2025

Uh oh!

Maxwell-Guo May 28, 2025 •

edited

Loading

Uh oh!

samueldlightfoot May 28, 2025 •

edited

Loading

Uh oh!

aweisberg May 30, 2025

Uh oh!

Uh oh!

CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

Are you sure you want to change the base?

CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

Uh oh!

Conversation

samueldlightfoot commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maxwell-Guo May 27, 2025

Choose a reason for hiding this comment

Uh oh!

samueldlightfoot May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Maxwell-Guo May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Maxwell-Guo May 27, 2025

Choose a reason for hiding this comment

Uh oh!

samueldlightfoot May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Maxwell-Guo May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samueldlightfoot May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweisberg May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samueldlightfoot commented May 23, 2025 •

edited

Loading

samueldlightfoot May 27, 2025 •

edited

Loading

Maxwell-Guo May 28, 2025 •

edited

Loading

samueldlightfoot May 28, 2025 •

edited

Loading