Skip to content

CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: trunk
Choose a base branch
from

Conversation

samueldlightfoot
Copy link
Contributor

@samueldlightfoot samueldlightfoot commented May 23, 2025

Status: draft for early review (by Ariel).

This work adds Direct IO support specifically for compaction reads of compressed tables. In its current state it should be easily extendable to be used by other bulk operations which would benefit from bypassing the page cache (and thus not polluting it).

It introduces the ability to perform SSTable scans using either the default open data file, or an ephemeral data file opened using DIO for the lifetime of the scan. This could be extended to use any alternative disk access mode in the future, if required.

Points of interest

  • We currently check DIO capability per SSTableReader data file. Perhaps we can/need to be smarter and check per data file directory.
  • FileHandle now supports toBuilder

Remaining work

  • Compaction integration tests
  • Enhance Direct CompressedChunkReader tests
  • Publish performance numbers
  • Start-up verifications for DIO support (currently done per data file within an SSTableReader)
  • StartupChecks for data file locations

patch by Sam Lightfoot; to be reviewed by Ariel Weisberg & Maxwell Guo for CASSANDRA-19987

https://issues.apache.org/jira/browse/CASSANDRA-19987

@@ -1245,6 +1246,12 @@ public enum DiskAccessMode
direct
}

public enum ScanDiskAccessMode
{
disk_default,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we do not use DiskAccessMode which has all the disk access mode?

Copy link
Contributor Author

@samueldlightfoot samueldlightfoot May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum was intended to differentiate scans using the default disk access mode (conf.disk_access_mode via the existing dFile handle) from scans requiring a new, direct file handle.

Update: code has now been updated to use DiskAccessMode. I think this integrates well with the existing setup.

@@ -901,6 +903,10 @@ else if (conf.repair_session_space.toMebibytes() > (int) (Runtime.getRuntime().m
applyRepairCommandPoolSize(conf);
applyReadThresholdsValidations(conf);

initializeCompactionScanDiskAccessMode();
if (compactionScanDiskAccessMode != conf.compaction_scan_disk_access_mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't just logging the compaction scan access mode ? we can remove the if judegment.

switch (ioMode)
{
case DIRECT:
return new OpenOption[]{ StandardOpenOption.READ, ExtendedOpenOption.DIRECT };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ExtendedOpenOption.DIRECT flag enough for direct io ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - confirmed with blktrace and async-prof shows bypassing of the page cache.

Copy link
Contributor

@Maxwell-Guo Maxwell-Guo May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that does DIRECT flag is enough to ensure that all data (including metadata) sync to disk when crash happens. see
To provide this guarantee, the application must use fsync, or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl
from
https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics.html

Copy link
Contributor Author

@samueldlightfoot samueldlightfoot May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links. This definitely seems applicable to the write path, but ChannelProxy is for a read only channel, which I do not believe requires the O_SYNC flag or similar, as there's no control data to flush.

We will currently still perform compaction writes with buffered IO (for now). However, and I think you've already spotted this, the commit log DIO writes will likely need to be verified that meta is flushed on block allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh... the commit log is not safe as currently written. It writes to the file, but doesn't sync the metadata on flush. That means the commit log may claim it has flushed the data, but the filesystem journal has not been flushed so the file length will be wrong and could truncate the file on restart.

Additionally Direct IO doesn't actually make data durable on disk (emit write barriers) it just flushes it to the cache of the disk. If your disk cache is volatile then you can lose data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants