-
Notifications
You must be signed in to change notification settings - Fork 3.7k
CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
CASSANDRA-19987 - Add Direct IO support for compaction reads (DRAFT) #4178
Conversation
d751233
to
671184a
Compare
@@ -1245,6 +1246,12 @@ public enum DiskAccessMode | |||
direct | |||
} | |||
|
|||
public enum ScanDiskAccessMode | |||
{ | |||
disk_default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we do not use DiskAccessMode which has all the disk access mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum was intended to differentiate scans using the default disk access mode (conf.disk_access_mode via the existing dFile handle) from scans requiring a new, direct file handle.
Update: code has now been updated to use DiskAccessMode. I think this integrates well with the existing setup.
@@ -901,6 +903,10 @@ else if (conf.repair_session_space.toMebibytes() > (int) (Runtime.getRuntime().m | |||
applyRepairCommandPoolSize(conf); | |||
applyReadThresholdsValidations(conf); | |||
|
|||
initializeCompactionScanDiskAccessMode(); | |||
if (compactionScanDiskAccessMode != conf.compaction_scan_disk_access_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we don't just logging the compaction scan access mode ? we can remove the if judegment.
switch (ioMode) | ||
{ | ||
case DIRECT: | ||
return new OpenOption[]{ StandardOpenOption.READ, ExtendedOpenOption.DIRECT }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ExtendedOpenOption.DIRECT flag enough for direct io ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - confirmed with blktrace and async-prof shows bypassing of the page cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that does DIRECT flag is enough to ensure that all data (including metadata) sync to disk when crash happens. see
To provide this guarantee, the application must use fsync, or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl
from
https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the links. This definitely seems applicable to the write path, but ChannelProxy is for a read only channel, which I do not believe requires the O_SYNC flag or similar, as there's no control data to flush.
We will currently still perform compaction writes with buffered IO (for now). However, and I think you've already spotted this, the commit log DIO writes will likely need to be verified that meta is flushed on block allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh... the commit log is not safe as currently written. It writes to the file, but doesn't sync the metadata on flush. That means the commit log may claim it has flushed the data, but the filesystem journal has not been flushed so the file length will be wrong and could truncate the file on restart.
Additionally Direct IO doesn't actually make data durable on disk (emit write barriers) it just flushes it to the cache of the disk. If your disk cache is volatile then you can lose data.
da6cce3
to
e10016d
Compare
…ith auto = default disk mode)
…essMode. Integrate DiskAccessMode to replace ScanDiskAccessMode
8da92c0
to
ab1a45c
Compare
…te buffer size calc
Status: draft for early review (by Ariel).
This work adds Direct IO support specifically for compaction reads of compressed tables. In its current state it should be easily extendable to be used by other bulk operations which would benefit from bypassing the page cache (and thus not polluting it).
It introduces the ability to perform SSTable scans using either the default open data file, or an ephemeral data file opened using DIO for the lifetime of the scan. This could be extended to use any alternative disk access mode in the future, if required.
Points of interest
Remaining work
patch by Sam Lightfoot; to be reviewed by Ariel Weisberg & Maxwell Guo for CASSANDRA-19987
https://issues.apache.org/jira/browse/CASSANDRA-19987