Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of zone install #12

Merged
merged 5 commits into from
Mar 15, 2024
Merged

improve performance of zone install #12

merged 5 commits into from
Mar 15, 2024

Conversation

jclulow
Copy link
Collaborator

@jclulow jclulow commented Feb 18, 2024

This change improves zone install performance through two means:

  • It seems like std::fs::copy() uses a deeply regrettable 8KB buffer size for each read() and write(), which makes for a lot of system call overhead. I haven't confirmed the details, but it seems like this may perhaps be surprisingly pathological in the case of a ZFS pool with gzip-9 compression. Restructuring the copy code so that we can increase this buffer size to 1MB improves throughput a fair bit.
  • Our computers are quite wide, so we'll create 16 threads to offload the heavy lifting of the file copy operation.

On the Gimlet where I did a brief smoke test, we go from:

gimlet-sn07 # zoneadm -z test uninstall -F ; ptime -m zoneadm -z test install
INFO: omicron: uninstalling zone test...
INFO: omicron: installing zone test @ "/data/zones/test"...
INFO: omicron: replicating /usr tree...
INFO: omicron: replicating /lib tree...
INFO: omicron: replicating /sbin tree...
INFO: omicron: pruning SMF manifests...
INFO: omicron: pruning global-only files...
INFO: omicron: unpacking baseline archive...
INFO: omicron: copying /etc/default/init...
INFO: omicron: install complete, probably!

real        7.748228113
user        0.256719377
sys         2.142117414
trap        0.000018704
tflt        0.000000000
dflt        0.000003035
kflt        0.000000000
lock        0.004332631
slp        20.370977556
lat         0.094301330
stop        0.000364079

... to ...

gimlet-sn07 # zoneadm -z test uninstall -F ; ptime -m zoneadm -z test install
INFO: omicron: uninstalling zone test...
INFO: omicron: installing zone test @ "/data/zones/test"...
INFO: omicron: replicating /usr tree...
INFO: omicron: replicated /usr: 9514 files, 121518507, bytes, 562 msec
INFO: omicron: replicating /lib tree...
INFO: omicron: replicated /lib: 163 files, 2849749, bytes, 21 msec
INFO: omicron: replicating /sbin tree...
INFO: omicron: replicated /sbin: 0 files, 0, bytes, 1 msec
INFO: omicron: pruning SMF manifests...
INFO: omicron: pruning global-only files...
INFO: omicron: unpacking baseline archive...
INFO: omicron: copying /etc/default/init...
INFO: omicron: install complete, probably!

real        0.945731000
user        0.409957787
sys         3.937423102
trap        0.000085716
tflt        0.000028022
dflt        0.000026368
kflt        0.000041306
lock        4.048764217
slp         2.708285516
lat         0.076202687
stop        0.000471146

... which is at least a bit faster!

NOTE: I have not yet done any serious verification of the actual result of the copies here. It appears to work, and I have spot checked a file here and there, but I'll want to do a more rigorous evaluation of both the metadata and the contents of the files as a result of this change.

@bcantrill
Copy link
Contributor

I explored this a bit; here are the results from varying parallelism:

2024-02-19-23-24-07_2022x1150

Wondering if batching operations wouldn't make it a tad more efficient, I modified the fix to allow for such batching -- and indeed, it was tad more efficient, but truly only a tad:

2024-02-19-23-25-17_1994x1126

And looking at varying thread parallelism at the roughly optimal batch size of 32:

2024-02-19-23-26-03_1985x1124

It feels like we should be able to squeeze a bit more out of this, but for our purposes: tuning the parallelism to ~10 (8-12) should result in best performance; allowing the work to be batched will buy another couple of percent.

@jclulow
Copy link
Collaborator Author

jclulow commented Feb 22, 2024

I have pulled in Bryan's latest additions and moved the tuneables override files into /etc/default. There is now a file, for future investigation:

 $ cat /etc/default/helios-omicron1
#
# Copyright 2024 Oxide Computer Company
#

#
# This file contains various tuning settings used by various Helios "omicron1"
# brand tools that may be useful to override during experiments.  The format of
# this file is not a Committed interface, and it is not expected that end users
# or customers will need to alter it.
#

#
# How many writer threads should we use when copying from the root file
# system to a new zone root?
#
COPY_THREADS=16

#
# What batch size should the writer thread queue use?
#
COPY_BATCH=32

@bcantrill
Copy link
Contributor

I wanted to understand why we weren't seeing much improvement from the increased parallelism. To explore thread activity, I
used statemaps with the following script:

#!/usr/sbin/dtrace -Cs 

#pragma D option quiet

#define T_WAKEABLE	0x0002

typedef enum {
	STATE_ON_CPU = 0,
	STATE_ON_CPU_SYSCALL,
	STATE_OFF_CPU_WAITING,
	STATE_OFF_CPU_BLOCKED,
	STATE_OFF_CPU_IO_READ,
	STATE_OFF_CPU_IO_WRITE,
	STATE_OFF_CPU_DEAD,
	STATE_MAX
} state_t;

#define STATE_METADATA(_state, _str, _color) \
	printf("\t\t\"%s\": {\"value\": %d, \"color\": \"%s\" }%s\n", \
	    _str, _state, _color, _state < STATE_MAX - 1 ? "," : "");

BEGIN
{
	wall = walltimestamp;
	printf("{\n\t\"start\": [ %d, %d ],\n",
	    wall / 1000000000, wall % 1000000000);
	printf("\t\"title\": \"installing Omicron brand\",\n");
	printf("\t\"host\": \"%s\",\n", `utsname.nodename);
	printf("\t\"entityKind\": \"Thread\",\n");
	printf("\t\"states\": {\n");

	STATE_METADATA(STATE_ON_CPU, "on-cpu", "#DAF7A6")
	STATE_METADATA(STATE_ON_CPU_SYSCALL, "on-cpu-kernel", "#6d7d51")
	STATE_METADATA(STATE_OFF_CPU_WAITING, "off-cpu-waiting", "#f9f9f9")
	STATE_METADATA(STATE_OFF_CPU_BLOCKED, "off-cpu-blocked", "#C70039")
	STATE_METADATA(STATE_OFF_CPU_IO_READ, "off-cpu-io-read", "#FFC300")
	STATE_METADATA(STATE_OFF_CPU_IO_WRITE, "off-cpu-io-write", "#338AFF")
	STATE_METADATA(STATE_OFF_CPU_DEAD, "off-cpu-dead", "#E0E0E0")

	printf("\t}\n}\n");
	start = timestamp;
}

proc:::lwp-create
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"event\": \"create\", \"target\": \"%d\" }\n",
	    args[0]->pr_lwpid);
}

sched:::wakeup
/execname == "brand" && args[1]->pr_pid == pid/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"event\": \"wakeup\", \"target\": \"%d\" }\n",
	    args[0]->pr_lwpid);
}


syscall::read:entry
/execname == "brand"/
{
	self->state = STATE_OFF_CPU_IO_READ;
}

syscall::write:entry
/execname == "brand"/
{
	self->state = STATE_OFF_CPU_IO_WRITE;
}

syscall:::entry
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
	    timestamp - start, tid, STATE_ON_CPU_SYSCALL);
}

syscall::read:return,
syscall::write:return
/execname == "brand"/
{
	self->state = STATE_ON_CPU;
}

syscall:::return
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
	    timestamp - start, tid, STATE_ON_CPU);
}

sched:::off-cpu
/execname == "brand"/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);

	printf("\"state\": %d }\n", self->state != STATE_ON_CPU ?
	    self->state : curthread->t_flag & T_WAKEABLE ?
	    STATE_OFF_CPU_WAITING : STATE_OFF_CPU_BLOCKED);
}

sched:::on-cpu
/execname == "brand"/
{
	self->state = STATE_ON_CPU;
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, tid);
	printf("\"state\": %d }\n", self->state);
}

proc:::lwp-exit
/execname == "brand"/
{
	self->exiting = tid;
}

sched:::off-cpu
/execname != "brand" && self->exiting/
{
	printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
	    timestamp - start, self->exiting);

	printf("\"state\": %d }\n", STATE_OFF_CPU_DEAD);
	self->exiting = 0;
	self->state = 0;
}

Given this data and resulting statemaps, here is the statemap from running with the default parallelism and batch size:

brand-statemap

This shows that we are only parallel for a small amount of the overall copy time. The problem is that while the file copies are being made by worker threads, the symlink generation is not. Making this parallel at all makes this looks much better from a utilization and performance perspective:

brand-statemap-symlinks

(Unsurprisingly, contention is up quite a bit.)

Here is what the relative performance for the two approaches looks like at a batch size of 1, varying the number of worker threads:

batch-size-1

A pretty clear win. And increasing the batch size is also a win; here are the results of varying the batch size at different numbers of workers for both approaches:

all

The best overall performance is with a thread pool of ~8 workers and a batch size of ~128. I have pushed all of this to the must-go-even-faster branch.

@jclulow
Copy link
Collaborator Author

jclulow commented Mar 15, 2024

I have done a final smoke test of zone installation on the bench Gimlet that underpins the propolis factory for VMs for buildomat, and things appear to function as expected (and are indeed a faster).

@jclulow jclulow merged commit 0e46273 into main Mar 15, 2024
@jclulow
Copy link
Collaborator Author

jclulow commented Mar 15, 2024

New packages built and published:

consolidation/oxide/omicron1-brand-incorporation@1.0.22,5.11:20240315T231141Z
system/zones/brand/omicron1@1.0.22,5.11:20240315T231124Z
system/zones/brand/omicron1/tools@1.0.22,5.11:20240315T231132Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants