-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve performance of zone install #12
Conversation
I explored this a bit; here are the results from varying parallelism: Wondering if batching operations wouldn't make it a tad more efficient, I modified the fix to allow for such batching -- and indeed, it was tad more efficient, but truly only a tad: And looking at varying thread parallelism at the roughly optimal batch size of 32: It feels like we should be able to squeeze a bit more out of this, but for our purposes: tuning the parallelism to ~10 (8-12) should result in best performance; allowing the work to be batched will buy another couple of percent. |
I have pulled in Bryan's latest additions and moved the tuneables override files into
|
I wanted to understand why we weren't seeing much improvement from the increased parallelism. To explore thread activity, I #!/usr/sbin/dtrace -Cs
#pragma D option quiet
#define T_WAKEABLE 0x0002
typedef enum {
STATE_ON_CPU = 0,
STATE_ON_CPU_SYSCALL,
STATE_OFF_CPU_WAITING,
STATE_OFF_CPU_BLOCKED,
STATE_OFF_CPU_IO_READ,
STATE_OFF_CPU_IO_WRITE,
STATE_OFF_CPU_DEAD,
STATE_MAX
} state_t;
#define STATE_METADATA(_state, _str, _color) \
printf("\t\t\"%s\": {\"value\": %d, \"color\": \"%s\" }%s\n", \
_str, _state, _color, _state < STATE_MAX - 1 ? "," : "");
BEGIN
{
wall = walltimestamp;
printf("{\n\t\"start\": [ %d, %d ],\n",
wall / 1000000000, wall % 1000000000);
printf("\t\"title\": \"installing Omicron brand\",\n");
printf("\t\"host\": \"%s\",\n", `utsname.nodename);
printf("\t\"entityKind\": \"Thread\",\n");
printf("\t\"states\": {\n");
STATE_METADATA(STATE_ON_CPU, "on-cpu", "#DAF7A6")
STATE_METADATA(STATE_ON_CPU_SYSCALL, "on-cpu-kernel", "#6d7d51")
STATE_METADATA(STATE_OFF_CPU_WAITING, "off-cpu-waiting", "#f9f9f9")
STATE_METADATA(STATE_OFF_CPU_BLOCKED, "off-cpu-blocked", "#C70039")
STATE_METADATA(STATE_OFF_CPU_IO_READ, "off-cpu-io-read", "#FFC300")
STATE_METADATA(STATE_OFF_CPU_IO_WRITE, "off-cpu-io-write", "#338AFF")
STATE_METADATA(STATE_OFF_CPU_DEAD, "off-cpu-dead", "#E0E0E0")
printf("\t}\n}\n");
start = timestamp;
}
proc:::lwp-create
/execname == "brand"/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
timestamp - start, tid);
printf("\"event\": \"create\", \"target\": \"%d\" }\n",
args[0]->pr_lwpid);
}
sched:::wakeup
/execname == "brand" && args[1]->pr_pid == pid/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
timestamp - start, tid);
printf("\"event\": \"wakeup\", \"target\": \"%d\" }\n",
args[0]->pr_lwpid);
}
syscall::read:entry
/execname == "brand"/
{
self->state = STATE_OFF_CPU_IO_READ;
}
syscall::write:entry
/execname == "brand"/
{
self->state = STATE_OFF_CPU_IO_WRITE;
}
syscall:::entry
/execname == "brand"/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
timestamp - start, tid, STATE_ON_CPU_SYSCALL);
}
syscall::read:return,
syscall::write:return
/execname == "brand"/
{
self->state = STATE_ON_CPU;
}
syscall:::return
/execname == "brand"/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", \"state\": %d }\n",
timestamp - start, tid, STATE_ON_CPU);
}
sched:::off-cpu
/execname == "brand"/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
timestamp - start, tid);
printf("\"state\": %d }\n", self->state != STATE_ON_CPU ?
self->state : curthread->t_flag & T_WAKEABLE ?
STATE_OFF_CPU_WAITING : STATE_OFF_CPU_BLOCKED);
}
sched:::on-cpu
/execname == "brand"/
{
self->state = STATE_ON_CPU;
printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
timestamp - start, tid);
printf("\"state\": %d }\n", self->state);
}
proc:::lwp-exit
/execname == "brand"/
{
self->exiting = tid;
}
sched:::off-cpu
/execname != "brand" && self->exiting/
{
printf("{ \"time\": \"%d\", \"entity\": \"%d\", ",
timestamp - start, self->exiting);
printf("\"state\": %d }\n", STATE_OFF_CPU_DEAD);
self->exiting = 0;
self->state = 0;
} Given this data and resulting statemaps, here is the statemap from running with the default parallelism and batch size: This shows that we are only parallel for a small amount of the overall copy time. The problem is that while the file copies are being made by worker threads, the symlink generation is not. Making this parallel at all makes this looks much better from a utilization and performance perspective: (Unsurprisingly, contention is up quite a bit.) Here is what the relative performance for the two approaches looks like at a batch size of 1, varying the number of worker threads: A pretty clear win. And increasing the batch size is also a win; here are the results of varying the batch size at different numbers of workers for both approaches: The best overall performance is with a thread pool of ~8 workers and a batch size of ~128. I have pushed all of this to the must-go-even-faster branch. |
I have done a final smoke test of zone installation on the bench Gimlet that underpins the propolis factory for VMs for buildomat, and things appear to function as expected (and are indeed a faster). |
New packages built and published:
|
This change improves zone install performance through two means:
On the Gimlet where I did a brief smoke test, we go from:
... to ...
... which is at least a bit faster!
NOTE: I have not yet done any serious verification of the actual result of the copies here. It appears to work, and I have spot checked a file here and there, but I'll want to do a more rigorous evaluation of both the metadata and the contents of the files as a result of this change.