-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Detect a slow raidz child during reads #16900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Initial thoughts This is going to be really useful on large pools. We often see SCSI disks "lose their mind" for a few minutes in our larger pools, tanking performance. It's nice that this feature will basically read and rebuild the data from parity first rather than attempt to read from the bad disk. I can see us using this in production. Will disks that are "sitting out" still be read during a scrub? If a disk is sitting out, and a read is unable to reconstruct from the other "healthy" disks, will ZFS read from the sitting out disk as a last-ditch effort? Could this work for mirrors as well? What happens if the algorithm wants to sit out more disks in a raid group than there is parity? What happens if a pool has dissimilar disks in its raid group? Like, say, 7 NVMe and 1 HDD. Will it sit out the HDD since it will be so much slower than the other drives? That might actually be a nice side effect... It would be good to see which disks are currently sitting out via a vdev property. I would expect users would want to see this for troubleshooting performance problems and locating sick disks. I can see users wanting to manually set it on sick disks as well:
You could tell if Alternatively you could name the property |
Latest changes...
|
Yes. both resilver and scrub reads will not sit out.
Possibly but that would require a different outlier detection.
Currently one one outlier sit out at a time is allowed.
Yes, it would likely end up sitting out the HDD since it will be an outlier. I did test this exact scenario and could hear the sit out period. 😃
I was using |
A read-only vdev property seems like the way to go. That would provide some additional visibility in to when a drive is sitting out. I don't think we'd want to make it writable for exactly the reasons you mentioned. |
Added read-only |
1140ee3
to
3965f73
Compare
Some recent feedback changes. Rebased to latest master and squash commits. |
I gave this a try today using the script below. I was able to sit out the disk I delayed, but I wasn't able to remove the sit-out state afterwards. Maybe I need non-ARC reads to trigger the sitting-out -> not-sitting-out logic? #!/bin/bash
# Run this from a zfs git workspace
#
# shorten sit out period for testing
sudo bash -c 'echo 5 > /sys/module/zfs/parameters/raidz_read_sit_out_secs'
truncate -s 200M file{0,1,2,3,4,5,6,7,8,9}
sudo ./zpool create tank raidz2 `pwd`/file{0..9}
sudo dd if=/dev/urandom bs=1M count=100 of=/tank/bigfile
sudo ./zpool export tank
sudo ./zpool import -d . tank
echo "Initial state"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Add 500ms delay on last disk
sudo ./zinject -d `pwd`/file9 -D500:1 tank
# Do some reads
sudo dd if=/tank/bigfile of=/dev/null bs=4k
echo "Should be sitting out"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Clear fault injection
sudo ./zinject -c all
echo "wait for us to stop sitting out part 1"
sleep 6
# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Do some more reads to see if we can trigger the vdev to stop sitting out
sudo dd if=/tank/bigfile of=/dev/null bs=4k
echo "wait for us to stop sitting out part 2"
sleep 6
# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9 |
3965f73
to
77e4878
Compare
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <don.brady@klarasystems.com>
77e4878
to
64d5138
Compare
A response said that resilver reads would not be affected. I'd like to suggest that this may be a bad idea. We currently have a RAIDZ2 that is rebuilding. One of the disks is very, very slow but still responds. It's really slowing down the rebuild. I'd like the resilver to avoid reading from it. For a scrub I can see that you want to read everything, but I'm not so sure about resilver. |
I agree with this - we would want the resilver to finish as quickly as possible. That would mean sitting out the sick disk if we could reconstruct from parity. |
hrtime_t now = gethrtime(); | ||
uint64_t last = atomic_load_64(&vd->vdev_last_latency_check); | ||
|
||
if ((now - last) < MSEC2NSEC(raid_outlier_check_interval_ms) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think this could be switched to seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me more about this? I haven't studied the code closely, but on the surface it seems that would change the default check from 20ms to minimum 1000ms, ie 50 samples vs just 1? Seems like a lot more place for outliers to hide? Though I suppose an "outlier" that disappears within a second is not persistent or sustained, so yeah, maybe you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was 3 months ago, but as I written, my thinking started from the additional gethrtime()
per I/O.
What's about sufficiency of data, it might be subjective but I don't feel confident about that side either way. The code in calculate_ewma()
has a averaging factor of 0.5, which means it barely account any I/Os beyond the last few. Whether we do it every 20ms vs once a second, I believe the data still might be too noisy for any reliable math. The code tries to collect more data points from the number of children vdevs, but once it looks for the worst, it will look only on several of its last I/Os. Would some disk get seriously bad, its latency may start to measure in seconds, so it makes no sense to sample it every 20ms if we may not even get a single data point even at 50 times lower rate.
module/zfs/vdev_raidz.c
Outdated
vdev_child_slow_outlier(zio_t *zio) | ||
{ | ||
vdev_t *vd = zio->io_vd; | ||
if (raidz_read_sit_out_secs == 0 || vd->vdev_children < LAT_SAMPLES_MIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While some limit on number of samples makes sense, I am not sure it makes sense to apply it to the number of children. 3-wide RAIDZ1 may also have the same problems, but may be there we could use wider safety interval. If anywhere, the limit should be applied to a number of samples we have averaged for each children before we can trust any statistics. Though averaging hides the dispersion, so it might need more thinking how to account it better.
@don-brady I added a test case in my https://github.com/tonyhutter/zfs/tree/slow_vdev_sit_out branch. Commit is here: tonyhutter@8a58493 Would you mind pulling it into your patch stack? I have run it locally in a VM but have not tested in on the qemu runners yet. |
also added test from Tony Signed-off-by: Don Brady <don.brady@klarasystems.com>
} | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To skip vdevs which are sitting out during a resilver we need to modify this check slightly. Each resilvering drive will add one to rr->rr_missingdata
and if we're going to sit out one more drive there needs to be sufficient parity available. This updated code should do the trick. You'll need to update the raidz similarly.
Can you also add versions of tests/functional/replacement/attach_resilver.ksh
, tests/functional/replacement/replace_resilver.ksh
which sit out a drive during the resilver. We'll want to test raidz[1-3] and draid[1-3].
/*
* When the row contains a latency outlier and sufficient parity
* exists to reconstruct the column data, then skip reading the
* known slow child vdev as a performance optimization.
*/
if (rr->rr_outlier_cnt > 0) {
int available_parity = rr->rr_firstdatacol - rr->rr_missingparity;
int required_parity = rr->rr_missingdata + 1;
if (available_parity >= required_parity) {
for (int c = rr->rr_cols - 1;
c >= rr->rr_firstdatacol; c--) {
raidz_col_t *rc = &rr->rr_col[c];
if (rc->rc_error == 0 &&
rc->rc_latency_outlier) {
rr->rr_missingdata++;
rc->rc_error = SET_ERROR(EAGAIN);
rc->rc_skipped = 1;
break;
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change has been made, and the replace_resilver
variant is here, just not an attach_resilver
. I'll look at that shortly.
"raidz" should be removed from the parameter name in Linux to be consistent with the FreeBSD name. Right now it is:
|
@don-brady this will resolve the test case changes and get it passing on FreeBSD: DIFFdiff --git a/tests/zfs-tests/include/libtest.shlib b/tests/zfs-tests/include/libtest.shlib
index fbad1747b..975a50141 100644
--- a/tests/zfs-tests/include/libtest.shlib
+++ b/tests/zfs-tests/include/libtest.shlib
@@ -3090,6 +3090,26 @@ function wait_scrubbed #pool timeout
done
}
+# Wait for vdev 'sit_out' property to be cleared.
+#
+# $1 pool name
+# $2 vdev name
+# $3 timeout
+#
+function wait_sit_out #pool vdev timeout
+{
+ typeset timeout=${3:-300}
+ typeset vdev=$2
+ typeset pool=$1
+ for (( timer = 0; timer < $timeout; timer++ )); do
+ if [[ "$(get_vdev_prop sit_out $pool $vdev)" == "off" ]] ; then
+ return 0
+ fi
+ sleep 1
+ done
+ false
+}
+
# Backup the zed.rc in our test directory so that we can edit it for our test.
#
# Returns: Backup file name. You will need to pass this to zed_rc_restore().
diff --git a/tests/zfs-tests/include/tunables.cfg b/tests/zfs-tests/include/tunables.cfg
index 7d667c553..1e82f6606 100644
--- a/tests/zfs-tests/include/tunables.cfg
+++ b/tests/zfs-tests/include/tunables.cfg
@@ -69,7 +69,7 @@ MULTIHOST_INTERVAL multihost.interval zfs_multihost_interval
OVERRIDE_ESTIMATE_RECORDSIZE send.override_estimate_recordsize zfs_override_estimate_recordsize
PREFETCH_DISABLE prefetch.disable zfs_prefetch_disable
RAIDZ_EXPAND_MAX_REFLOW_BYTES vdev.expand_max_reflow_bytes raidz_expand_max_reflow_bytes
-RAIDZ_READ_SIT_OUT_SECS vdev.raidz_read_sit_out_secs raidz_read_sit_out_secs
+READ_SIT_OUT_SECS vdev.read_sit_out_secs raidz_read_sit_out_secs
REBUILD_SCRUB_ENABLED rebuild_scrub_enabled zfs_rebuild_scrub_enabled
REMOVAL_SUSPEND_PROGRESS removal_suspend_progress zfs_removal_suspend_progress
REMOVE_MAX_SEGMENT remove_max_segment zfs_remove_max_segment
diff --git a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh b/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
index 27c3a8f5f..d859d1681 100755
--- a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
+++ b/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
@@ -29,17 +29,14 @@
# 1. Create various raidz/draid pools
# 2. Inject delays into one of the disks
# 3. Verify disk is set to 'sit out' for awhile.
-# 4. Wait for RAIDZ_READ_SIT_OUT_SECS and verify sit out state is lifted.
+# 4. Wait for READ_SIT_OUT_SECS and verify sit out state is lifted.
#
. $STF_SUITE/include/libtest.shlib
-# Avoid name conflicts with the default pool
-TESTPOOL2=testpool2
-
function cleanup
{
- restore_tunable RAIDZ_READ_SIT_OUT_SECS
+ restore_tunable READ_SIT_OUT_SECS
log_must zinject -c all
destroy_pool $TESTPOOL2
log_must rm -f $TEST_BASE_DIR/vdev.$$.*
@@ -50,8 +47,8 @@ log_assert "Verify sit_out works"
log_onexit cleanup
# shorten sit out period for testing
-save_tunable RAIDZ_READ_SIT_OUT_SECS
-set_tunable32 RAIDZ_READ_SIT_OUT_SECS 5
+save_tunable READ_SIT_OUT_SECS
+set_tunable32 READ_SIT_OUT_SECS 5
log_must truncate -s 150M $TEST_BASE_DIR/vdev.$$.{0..9}
@@ -71,7 +68,7 @@ for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
# Do some reads and wait for us to sit out
for i in {1..100} ; do
- dd if=/$TESTPOOL2/bigfile skip=$i bs=1M count=1 of=/dev/null &> /dev/null
+ dd if=/$TESTPOOL2/bigfile skip=$i bs=1M count=1 of=/dev/null
sit_out=$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)
if [[ "$sit_out" == "on" ]] ; then
@@ -85,8 +82,7 @@ for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
log_must zinject -c all
# Wait for us to exit our sit out period
- sleep 6
- log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "off"
+ log_must wait_sit_out $TESTPOOL2 $BAD_VDEV 10
destroy_pool $TESTPOOL2
done |
I was matching the existing On Linux tunables should have some prefix scope since they all end up in |
also changed 'raidz_read_sit_out_secs' to 'vdev_read_sit_out_secs' Signed-off-by: Don Brady <don.brady@klarasystems.com>
@don-brady I need to move |
@don-brady this gets my test case working again on FreeBSD: DIFFdiff --git a/tests/runfiles/common.run b/tests/runfiles/common.run
index 1ed9478c4..917650739 100644
--- a/tests/runfiles/common.run
+++ b/tests/runfiles/common.run
@@ -702,9 +702,11 @@ tests = ['dio_aligned_block', 'dio_async_always', 'dio_async_fio_ioengines',
'dio_unaligned_block', 'dio_unaligned_filesize']
tags = ['functional', 'direct']
-[tests/functional/events]
+[tests/functional/sit_out]
tests = ['slow_vdev_sit_out']
-tags = ['functional', 'events']
+pre =
+post =
+tags = ['functional', 'sit_out']
[tests/functional/exec]
tests = ['exec_001_pos', 'exec_002_neg']
@@ -904,7 +906,7 @@ tags = ['functional', 'rename_dirs']
tests = ['attach_import', 'attach_multiple', 'attach_rebuild',
'attach_resilver', 'detach', 'rebuild_disabled_feature',
'rebuild_multiple', 'rebuild_raidz', 'replace_import', 'replace_rebuild',
- 'replace_resilver', 'replace_resilver_sit_out' 'resilver_restart_001',
+ 'replace_resilver', 'replace_resilver_sit_out', 'resilver_restart_001',
'resilver_restart_002', 'scrub_cancel']
tags = ['functional', 'replacement']
diff --git a/tests/zfs-tests/tests/Makefile.am b/tests/zfs-tests/tests/Makefile.am
index 42e7bce85..a24d17bc5 100644
--- a/tests/zfs-tests/tests/Makefile.am
+++ b/tests/zfs-tests/tests/Makefile.am
@@ -1497,7 +1497,6 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \
functional/events/events_001_pos.ksh \
functional/events/events_002_pos.ksh \
functional/events/setup.ksh \
- functional/events/slow_vdev_sit_out.ksh \
functional/events/zed_cksum_config.ksh \
functional/events/zed_cksum_reported.ksh \
functional/events/zed_diagnose_multiple.ksh \
@@ -1994,6 +1993,7 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \
functional/scrub_mirror/scrub_mirror_003_pos.ksh \
functional/scrub_mirror/scrub_mirror_004_pos.ksh \
functional/scrub_mirror/setup.ksh \
+ functional/sit_out/slow_vdev_sit_out.ksh \
functional/slog/cleanup.ksh \
functional/slog/setup.ksh \
functional/slog/slog_001_pos.ksh \
diff --git a/tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh b/tests/zfs-tests/tests/functional/sit_out/slow_vdev_sit_out.ksh
similarity index 100%
rename from tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh
rename to tests/zfs-tests/tests/functional/sit_out/slow_vdev_sit_out.ksh |
I've been asked to take over this PR and shepherd it across the line. Since I can't access the original branch, I have opened a new PR #17227 to replace this one. I will close out the remaining review comments here, and then we can all bounce over to that one. Thank you! |
@tonyhutter Don included your first test update, but not this one. I'll wait to see what the FreeBSD builders make of it #17227 and then do the thing, or not. |
Replaced by #17227 |
Motivation and Context
There is a concern, and has been observed in practice, that a slow disk can bring down the overall read performance of raidz. Currently in ZFS, a slow disk is detected by comparing the disk read latency to a custom threshold value, such as 30 seconds. This can be tuned to a lower threshold but that requires understanding the context in which it will be applied. And hybrid pools can have a wide range of expected disk latencies.
What might be a better approach, is to identify the presence of a slow disk outlier based on its latency distance from the latencies of its peers. This would offer a more dynamic solution that can adapt to different type of media and workloads.
Description
The solution proposed here comes in two parts
Detecting Outliers
The most recent latency value for each child is saved in the
vdev_t
. Then periodically, the samples from all the children are sorted and a statistical outlier can be detected if present. The code uses a Tukey's fence, with K = 2, for detecting extreme outliers. This rule defines extreme outliers as data points outside the fence of the third quartile plus two times the Interquartile Range (IQR). This range is the distance between the first and third quartile.Sitting Out
After a vdev has encounter multiple outlier detections (> 50), it is marked for being in a sit out period that by default lasts for 10 minutes.
Each time a slow disk is placed into a sit out period, its
vdev_stat.vs_slow_ios count
is incremented and azevent
classereport.fs.zfs.delay
is posted.The length of the sit out period can be changed using the
raid_read_sit_out_secs
module parameter. Setting it to zero disables slow outlier detection.Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
Tested with various configs, including dRAID.
For an extreme example, an HDD was used in an 8 wide SSD raidz2 and it was compared to taking the HDD offline. This was using a
fio(1)
streaming read workload across 4 threads to 20GB files. Both the record size and IO request size were 1MB.Also measured the cost over time of vdev_child_slow_outlier() where the statistical analysis occurs (every 20ms).
Types of changes
Checklist:
Signed-off-by
.