Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iotune reports lower-than-expected IOPS on some large instances #17261

Closed
travisdowns opened this issue Mar 22, 2024 · 6 comments
Closed

iotune reports lower-than-expected IOPS on some large instances #17261

travisdowns opened this issue Mar 22, 2024 · 6 comments
Assignees
Labels
kind/bug Something isn't working performance

Comments

@travisdowns
Copy link
Member

travisdowns commented Mar 22, 2024

Version & Environment

Redpanda version: 23.3

What went wrong?

When running rpk iotune on large instance types, some results (especially IOPS) are often significantly lower than the vendor advertised numbers.

See i3en for example:

        Instance  Disks    Read IOPS          Read BW   Write IOPS         Write BW
      i3en.large    n/a        42705        328001088        32485        162821712
     i3en.xlarge    n/a        85373        659501824        65265        326548864
    i3en.2xlarge    n/a       170723       1318909056       130508        653094592
    i3en.3xlarge    n/a       242725       2065906688       201103       1012843968
    i3en.6xlarge    n/a       485579       4128679424       402086       2025674368
   i3en.12xlarge    n/a       550798       6819085312       496401       4051611392
   i3en.24xlarge    n/a      1086137       8334144000      1005340       8104002048

Up to and include 6x large the values track closely the advertised IOPS. However 12xlarge reports 550k IOPS versus the advertised value of 1000k and 24xlarge ~1000k versus 2000k advertised.

I believe this is measurement/tuning error, not a fundamental hardware limitation.

This applies to other instance types as well, see #17220 for details.

What should have happened instead?

iotune produces results reflecting the hardware capabilities.

How to reproduce the issue?

  1. Run rpk iotune on i3en.12xlarge instances and observer output

JIRA Link: CORE-1915

@travisdowns
Copy link
Member Author

Perhaps it is not in fact totally iotune.

I investigated a bit more on i3en.12xlarge, which had the following iotune results (as in the OP):

        Instance  Disks    Read IOPS          Read BW   Write IOPS         Write BW
    i3en.6xlarge    n/a       485579       4128679424       402086       2025674368
   i3en.12xlarge    n/a       550798       6819085312       496401       4051611392

(6xlarge also shown for reference: one would expect the 12xlarge numbers to all be double the 6x numbers)

I focused only on read IOPS. Using iotune, I was able to get about 750k-760k read IOPS only using fio regardless of what configuration I tried. This is still better than the 550k reported by iotune but much less than the 1000k we expect. This is using the usual md configuration of RAID0 across the 4 drives.

I then tried creating an array of only the first two drives, this would presumably give identical results to 6x large, 500k R IOPS. However, it did not: in fact it only gave 1/2 the 4-drive output, i.e., the same performance probably was still evident here in the the same proportion as the 4 drive case. The next odd thing is that this effect seems to depend on which drives are paired up in the raid. If drives 1 and 2 are paired, or 3 and 4 are paired they are slow as described (~350k IOPS), but any other pairs result in a fast configuration (~475k IOPS, just slightly shy of the theoretical), shown here based on testing each combination:

$ grep IOPS *
nvme1n1,nvme2n1:  read: IOPS=376k, BW=1468MiB/s (1539MB/s)(86.0GiB/60003msec)
nvme1n1,nvme3n1:  read: IOPS=476k, BW=1858MiB/s (1949MB/s)(109GiB/60005msec)
nvme1n1,nvme4n1:  read: IOPS=475k, BW=1857MiB/s (1947MB/s)(109GiB/60004msec)
nvme2n1,nvme3n1:  read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60004msec)
nvme2n1,nvme4n1:  read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60005msec)
nvme3n1,nvme4n1:  read: IOPS=377k, BW=1472MiB/s (1543MB/s)(86.2GiB/60003msec)

This behavior was consistent and quite stable from run to run and confirmed on 2 different machines.

Finally, I confirmed that this doesn't have anything to do with md: the safe effect is present if you format each drive individually without using md at all, then run separate benchmarks concurrently on two drives: the aggregate performance is as above (and each drive splits the IOPS equally). If you run just 1 test on a single drive you get very close to 250k IOPS, i.e., the expected advertised performance.

@travisdowns
Copy link
Member Author

travisdowns commented Mar 25, 2024

fio configuration:

[file1]
name=fio-seq-write
rw=randread
bs=4K
direct=1
numjobs=8

time_based
runtime=5m
size=10GB
ioengine=libaio
iodepth=128

This isn't a special config: performance is similar under many parameter changes in the above: the main thing is that you need enough total queue depth (which is numjobs * iodepth) - the above has 8k but even 4k is probably enough, and you must have enough jobs to avoid saturating single CPUs (so numjobs=1 doesn't cut it at this level because it will saturate a CPU, but 4 is generally enough).

I didn't notice any changes in performance with increased file size, runtime or depth. read (sequential reads) and randread perform similarly as long as merging is disabled at the block layer (otherwise sequential reads may be merged resulting in many fewer actual IOs and so an inflated IOPS figure).

@travisdowns
Copy link
Member Author

travisdowns commented Mar 25, 2024

Script to rebuild & remount arrays, using for testing all combination of array members:

set -euo pipefail

echo "MD=${MD:=md0}"
echo "DEVICES=${DEVICES:=nvme1n1 nvme2n1}"
echo "MOUNT=${MOUNT:=/mnt/xfs}"
IFS=', ' read -r -a DA <<< "$DEVICES"
echo "DEVICE_COUNT=${#DA[@]}"

MDD=/dev/$MD

sudo umount /mnt/xfs* || true
sudo mdadm --stop /dev/md* || true

# set -x
sudo mdadm --create --run --verbose $MDD --level=0 --raid-devices=2 $(for d in $DEVICES; do echo -n "/dev/$d "; done)

sudo mkfs.xfs -f $MDD
sudo mkdir -p $MOUNT
sudo mount $MDD $MOUNT
sudo chmod a+w $MOUNT


echo "OK - mounted at: $MOUNT"

cat /proc/mdstat

@travisdowns
Copy link
Member Author

Miscellaneous node:

  • I tried Ubuntu 20.04 and 22.04 with their default linux-aws kernels but no other distros or kernels yet. It would be interesting to test say an Amazon Linux 2 AMI.
  • The 12xlarge size has 48 vCPUs and so is the first one which exceeds the maximum NVMe queue count of 32 per drive: so not every CPU has a queue assigned to every drive. However, I couldn't immediately tie this to the effect, though more tests here are warranted.
  • In the slower 2-drive runs, the CPU (fio-reported) use was significantly higher than the faster runs: and since the faster runs are doing more IOPS the CPU use is even higher in a CPU/IOP sense.
  • I didn't yet have time to investigate how drive interrupt count varies in the two scenarios.
  • I have the full fio results for the above runs, I will upload them "at some point" or "on request".

Copy link

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

@github-actions github-actions bot added the stale label Sep 24, 2024
Copy link

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2024
@travisdowns travisdowns removed the stale label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

1 participant