improve performance of zone install #12

jclulow · 2024-02-18T19:55:12Z

This change improves zone install performance through two means:

It seems like std::fs::copy() uses a deeply regrettable 8KB buffer size for each read() and write(), which makes for a lot of system call overhead. I haven't confirmed the details, but it seems like this may perhaps be surprisingly pathological in the case of a ZFS pool with gzip-9 compression. Restructuring the copy code so that we can increase this buffer size to 1MB improves throughput a fair bit.
Our computers are quite wide, so we'll create 16 threads to offload the heavy lifting of the file copy operation.

On the Gimlet where I did a brief smoke test, we go from:

gimlet-sn07 # zoneadm -z test uninstall -F ; ptime -m zoneadm -z test install
INFO: omicron: uninstalling zone test...
INFO: omicron: installing zone test @ "/data/zones/test"...
INFO: omicron: replicating /usr tree...
INFO: omicron: replicating /lib tree...
INFO: omicron: replicating /sbin tree...
INFO: omicron: pruning SMF manifests...
INFO: omicron: pruning global-only files...
INFO: omicron: unpacking baseline archive...
INFO: omicron: copying /etc/default/init...
INFO: omicron: install complete, probably!

real        7.748228113
user        0.256719377
sys         2.142117414
trap        0.000018704
tflt        0.000000000
dflt        0.000003035
kflt        0.000000000
lock        0.004332631
slp        20.370977556
lat         0.094301330
stop        0.000364079

... to ...

gimlet-sn07 # zoneadm -z test uninstall -F ; ptime -m zoneadm -z test install
INFO: omicron: uninstalling zone test...
INFO: omicron: installing zone test @ "/data/zones/test"...
INFO: omicron: replicating /usr tree...
INFO: omicron: replicated /usr: 9514 files, 121518507, bytes, 562 msec
INFO: omicron: replicating /lib tree...
INFO: omicron: replicated /lib: 163 files, 2849749, bytes, 21 msec
INFO: omicron: replicating /sbin tree...
INFO: omicron: replicated /sbin: 0 files, 0, bytes, 1 msec
INFO: omicron: pruning SMF manifests...
INFO: omicron: pruning global-only files...
INFO: omicron: unpacking baseline archive...
INFO: omicron: copying /etc/default/init...
INFO: omicron: install complete, probably!

real        0.945731000
user        0.409957787
sys         3.937423102
trap        0.000085716
tflt        0.000028022
dflt        0.000026368
kflt        0.000041306
lock        4.048764217
slp         2.708285516
lat         0.076202687
stop        0.000471146

... which is at least a bit faster!

NOTE: I have not yet done any serious verification of the actual result of the copies here. It appears to work, and I have spot checked a file here and there, but I'll want to do a more rigorous evaluation of both the metadata and the contents of the files as a result of this change.

bcantrill · 2024-02-20T07:27:25Z

I explored this a bit; here are the results from varying parallelism:

Wondering if batching operations wouldn't make it a tad more efficient, I modified the fix to allow for such batching -- and indeed, it was tad more efficient, but truly only a tad:

And looking at varying thread parallelism at the roughly optimal batch size of 32:

It feels like we should be able to squeeze a bit more out of this, but for our purposes: tuning the parallelism to ~10 (8-12) should result in best performance; allowing the work to be batched will buy another couple of percent.

jclulow · 2024-02-22T00:14:02Z

I have pulled in Bryan's latest additions and moved the tuneables override files into /etc/default. There is now a file, for future investigation:

 $ cat /etc/default/helios-omicron1
#
# Copyright 2024 Oxide Computer Company
#

#
# This file contains various tuning settings used by various Helios "omicron1"
# brand tools that may be useful to override during experiments.  The format of
# this file is not a Committed interface, and it is not expected that end users
# or customers will need to alter it.
#

#
# How many writer threads should we use when copying from the root file
# system to a new zone root?
#
COPY_THREADS=16

#
# What batch size should the writer thread queue use?
#
COPY_BATCH=32

bcantrill · 2024-02-22T03:15:03Z

I wanted to understand why we weren't seeing much improvement from the increased parallelism. To explore thread activity, I
used statemaps with the following script:

#!/usr/sbin/dtrace -Cs 

#pragma D option quiet

#define T_WAKEABLE	0x0002

typedef enum {
	STATE_ON_CPU = 0,
	STATE_ON_CPU_SYSCALL,
	STATE_OFF_CPU_WAITING,
	STATE_OFF_CPU_BLOCKED,
	STATE_OFF_CPU_IO_READ,
	STATE_OFF_CPU_IO_WRITE,
	STATE_OFF_CPU_DEAD,
	STATE_MAX
} state_t;

#define STATE_METADATA(_state, _str, _color) \
	printf("\t\t\"%s\": {\"value\": %d, \"color\": \"%s\" }%s\n", \
	    _str, _state, _color, _state < STATE_MAX - 1 ? "," : "");

BEGIN
{
	wall = walltimestamp;
	printf("{\n\t\"start\": [ %d, %d ],\n",
	    wall / 1000000000, wall % 1000000000);
	printf("\t\"title\": \"installing Omicron brand\",\n");
	printf("\t\"host\": \"%s\",\n", `utsname.nodename);
	printf("\t\"entityKind\": \"Thread\",\n");
	printf("\t\"states\": {\n");

	STATE_METADATA(STATE_ON_CPU, "on-cpu", "#DAF7A6")
	STATE_METADATA(STATE_ON_CPU_SYSCALL, "on-cpu-kernel", "#6d7d51")
	STATE_METADATA(STATE_OFF_CPU_WAITING, "off-cpu-waiting", "#f9f9f9")
	STATE_METADATA(STATE_OFF_CPU_BLOCKED, "off-cpu-blocked", "#C70039")
	STATE_METADATA(STATE_OFF_CPU_IO_READ, "off-cpu-io-read", "#FFC300")
	STATE_METADATA(STATE_OFF_CPU_IO_WRITE, "off-cpu-io-write", "#338AFF")
	STATE_METADATA(STATE_OFF_CPU_DEAD, "off-cpu-dead", "#E0E0E0")

	printf("\t}\n}\n");
	start = timestamp;
}

proc:::lwp-create
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"event\": \"create\", \"target\": \"%d\" }\n",
	    args[0]->pr_lwpid);
}

sched:::wakeup
/execname == "brand" && args[1]->pr_pid == pid/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"event\": \"wakeup\", \"target\": \"%d\" }\n",
	    args[0]->pr_lwpid);
}


syscall::read:entry
/execname == "brand"/
{
	self->state = STATE_OFF_CPU_IO_READ;
}

syscall::write:entry
/execname == "brand"/
{
	self->state = STATE_OFF_CPU_IO_WRITE;
}

syscall:::entry
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
	    timestamp - start, tid, STATE_ON_CPU_SYSCALL);
}

syscall::read:return,
syscall::write:return
/execname == "brand"/
{
	self->state = STATE_ON_CPU;
}

syscall:::return
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
	    timestamp - start, tid, STATE_ON_CPU);
}

sched:::off-cpu
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);

	printf("\"state\": %d }\n", self->state != STATE_ON_CPU ?
	    self->state : curthread->t_flag & T_WAKEABLE ?
	    STATE_OFF_CPU_WAITING : STATE_OFF_CPU_BLOCKED);
}

sched:::on-cpu
/execname == "brand"/
{
	self->state = STATE_ON_CPU;
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"state\": %d }\n", self->state);
}

proc:::lwp-exit
/execname == "brand"/
{
	self->exiting = tid;
}

sched:::off-cpu
/execname != "brand" && self->exiting/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, self->exiting);

	printf("\"state\": %d }\n", STATE_OFF_CPU_DEAD);
	self->exiting = 0;
	self->state = 0;
}

Given this data and resulting statemaps, here is the statemap from running with the default parallelism and batch size:

This shows that we are only parallel for a small amount of the overall copy time. The problem is that while the file copies are being made by worker threads, the symlink generation is not. Making this parallel at all makes this looks much better from a utilization and performance perspective:

(Unsurprisingly, contention is up quite a bit.)

Here is what the relative performance for the two approaches looks like at a batch size of 1, varying the number of worker threads:

A pretty clear win. And increasing the batch size is also a win; here are the results of varying the batch size at different numbers of workers for both approaches:

The best overall performance is with a thread pool of ~8 workers and a batch size of ~128. I have pushed all of this to the must-go-even-faster branch.

jclulow · 2024-03-15T23:10:16Z

I have done a final smoke test of zone installation on the bench Gimlet that underpins the propolis factory for VMs for buildomat, and things appear to function as expected (and are indeed a faster).

jclulow · 2024-03-15T23:15:16Z

New packages built and published:

consolidation/oxide/omicron1-brand-incorporation@1.0.22,5.11:20240315T231141Z
system/zones/brand/omicron1@1.0.22,5.11:20240315T231124Z
system/zones/brand/omicron1/tools@1.0.22,5.11:20240315T231132Z

This was referenced Feb 18, 2024

Improve omicron-brand-zone boot speed oxidecomputer/omicron#3682

Closed

image: add man pages + demangle(1) oxidecomputer/helios#140

Merged

bcantrill mentioned this pull request Feb 19, 2024

enable primarycache on datasets in ramdisk pool oxidecomputer/helios#141

Closed

jclulow and others added 5 commits March 15, 2024 12:06

improve performance of zone install

4245070

add batching + parallelization of link creation

4c79d7b

XXX use /etc/defaults for tuneables

865d6f0

XXX use /etc/defaults for tuneables (missed unused imports)

88aacc1

XXX use final tuning numbers from Bryan

6b62fd2

jclulow force-pushed the must-go-faster branch from e6770d3 to 6b62fd2 Compare March 15, 2024 23:02

jclulow merged commit 0e46273 into main Mar 15, 2024

jclulow assigned jclulow and bcantrill Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of zone install #12

improve performance of zone install #12

jclulow commented Feb 18, 2024 •

edited

Loading

bcantrill commented Feb 20, 2024

jclulow commented Feb 22, 2024

bcantrill commented Feb 22, 2024

jclulow commented Mar 15, 2024

jclulow commented Mar 15, 2024

improve performance of zone install #12

improve performance of zone install #12

Conversation

jclulow commented Feb 18, 2024 • edited Loading

bcantrill commented Feb 20, 2024

jclulow commented Feb 22, 2024

bcantrill commented Feb 22, 2024

jclulow commented Mar 15, 2024

jclulow commented Mar 15, 2024

jclulow commented Feb 18, 2024 •

edited

Loading