Skip to content

Add option to skip unpacking the same layer in parallel#6318

Closed
dcantah wants to merge 1 commit intocontainerd:mainfrom
dcantah:sameunpack
Closed

Add option to skip unpacking the same layer in parallel#6318
dcantah wants to merge 1 commit intocontainerd:mainfrom
dcantah:sameunpack

Conversation

@dcantah
Copy link
Member

@dcantah dcantah commented Dec 2, 2021

With the way things work right now, there's nothing stopping a parallel unpack of the exact
same layer to a snapshot. The first one to get committed will live on while the other(s) get garbage collected
so in the end things work out, but regardless of this it's wasted work. The real issue is that while unpack
should be pretty cheap on Linux, the opposite is true for the Windows and lcow formats. Kicking off
10 parallel pulls of the same image brings my 6 core machine to a halt and pushes 100% cpu utilization.
What all of this ends up causing is exponentially slower parallel pull times for images that either share layers,
or just pulling the same image.

I'm not sure if this is a "sound" way to approach this, or if there's possibly a much easier way to go about this change. I tried
to model it in a way that wouldn't disrupt things from a clients perspective, so the logic lives in the metadata snapshotter
layer. The gist of this change is if a new RemoteContext option is specified, the snapshotter now keeps track of what active
snapshots are "in progress". Any other snapshots that call Prepare with the same key as a snapshot that is already in progress
will now simply wait for one of two things to occur.

  1. The first active snapshot it's waiting on gets removed via Remove (so it was never committed). For this case there
    was likely an error during setup for the first snapshot/unpack, so any waiters continue as normal for this branch and create a new snapshot.
  2. First active snapshot gets committed and will notify any snapshots currently waiting that commit has succeeded and
    we can simply exit (as the layer now exists, so no need to create a new snapshot+unpack again). ErrAlreadyExists is returned from
    all of the snapshots waiting to let the client know that there's already a snapshot that exists with this content.

Below are some numbers from testing this fix vs. Containerd built from main.:

Single Pull With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..1 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
23.2647532
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
25.3406896
25.1486873
25.0266887
24.8586885
24.7116929
24.5726943
24.4816912
24.375693
24.250694
24.1176895
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With Containerd Built Off Main:
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
134.416318
134.2893209
134.1643238
134.027323
133.8813211
133.7333303
133.5593271
133.4253269
133.2963282
133.1643288

@k8s-ci-robot
Copy link

Hi @dcantah. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcantah
Copy link
Member Author

dcantah commented Dec 2, 2021

Checking CI signal, not ready for review (if the draft wasn't a good signal haha). This is an old change that I'd played around with that needs some cleanup.

With the way things work right now, there's nothing stopping a parallel unpack of the exact
same layer to a snapshot. The first one to get committed will live on while the other(s) get garbage collected
so in the end things work out, but regardless of this it's wasted work. The real issue is that while unpack
should be pretty cheap on Linux, the opposite is true for the Windows and lcow formats. Kicking off
10 parallel pulls of the same image brings my 6 core machine to a halt and pushes 100% cpu utilization.
What all of this ends up causing is exponentially slower parallel pull times for images that either share layers,
or just pulling the same image.

I'm not sure if this is a "sound" way to approach this, or if there's possibly a much easier way to go about this change. I tried
to model it in a way that wouldn't disrupt things from a clients perspective, so the logic lives in the metadata snapshotter
layer. The gist of this change is if a new RemoteContext option is specified, the snapshotter now keeps track of what active
snapshots are "in progress". Any other snapshots that call Prepare with the same key as a snapshot that is already in progress
will now simply wait for one of two things to occur.
1. The first active snapshot it's waiting on gets removed via `Remove` (so it was never committed). For this case there
was likely an error during setup for the first snapshot/unpack, so any waiters continue as normal for this branch and create a new snapshot.
2. First active snapshot gets committed and will notify any snapshots currently waiting that commit has succeeded and
we can simply exit (as the layer now exists, so no need to create a new snapshot+unpack again). ErrAlreadyExists is returned from
all of the snapshots waiting to let the client know that there's already a snapshot that exists with this content.

Below are some numbers from testing this fix vs. Containerd built from main.:

Single Pull With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..1 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
23.2647532
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
25.3406896
25.1486873
25.0266887
24.8586885
24.7116929
24.5726943
24.4816912
24.375693
24.250694
24.1176895
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With Containerd Built Off Main:
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
134.416318
134.2893209
134.1643238
134.027323
133.8813211
133.7333303
133.5593271
133.4253269
133.2963282
133.1643288

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
@dmcgowan
Copy link
Member

dmcgowan commented Dec 2, 2021

Kicking off 10 parallel pulls of the same image...

Is this being seen happen naturally? You could change this to say "Kicking off 10 parallel pulls of different images" and expect bad performance as well. We do have limiters at least on number of parallel fetch requests, is this a case where the content now exists for all and the unpack becomes a thundering herd?

@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 2, 2021

Build succeeded.

@dcantah
Copy link
Member Author

dcantah commented Dec 6, 2021

@dmcgowan Sorry for the late reply! We had some services in prod that would get workloads allocated with the same image which would kick off N pulls of the same image all at the same time. Obviously in that scenario the client can just be a little smarter on when to issue a pull (no need if we just kicked one off for the same image) but I'd still thought there'd be some sort of de-duping for unpack.

The other issue is if you kicked off a bunch of pulls for different images that share layers and the same layer starts getting unpacked in parallel. This is more of an issue on Windows where the base layer of an image is typically the "windows" layer that contains the operating system bits and is fairly large and expensive to unpack. There's a high chance that the first layer of Windows images will be the same as they need to match the hosts build, so there's really only a handful of images to choose from to base the rest of your image off of. Obviously the problem isn't only with unpacking the Windows layer N times, unpacking any of the same layer N times isn't good, but the pain is felt more on the first 😆.

For instance if you pulled mcr.microsoft.com/windows/nanoserver:1809 and mycoolwindowsimage:latest (where mycoolwindowsimage:latest is based off of nanoserver:1809) in parallel, you'd still end up in the scenario where the first "Windows" layer is going to get unpacked twice when it shouldn't have to be.

We do have limiters at least on number of parallel fetch requests, is this a case where the content now exists for all and the unpack becomes a thundering herd?

Right. Using the 10 parallel pulls example, the fetch requests complete and then 10 unpacks occur all at the same shortly after. Then they all complete and then shortly after that the other 9 all get garbage collected which just seems like wasted work.

fuweid added a commit to fuweid/containerd that referenced this pull request Mar 19, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 19, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 20, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 20, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 20, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@dcantah
Copy link
Member Author

dcantah commented Mar 22, 2022

Closing as #6702 is out to solve the same problem and is a much better approach that looks to be client side, thanks @fuweid!!

@dcantah dcantah closed this Mar 22, 2022
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 22, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 23, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 24, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 24, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 26, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 27, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Apr 5, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Apr 5, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Apr 5, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Apr 5, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
dcantah pushed a commit to dcantah/containerd that referenced this pull request Apr 11, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
dims pushed a commit to dims/containerd that referenced this pull request Apr 15, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
dims pushed a commit to dims/containerd that referenced this pull request Apr 19, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
dewjam pushed a commit to dewjam/containerd that referenced this pull request Apr 22, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
(cherry picked from commit 1764ea9)
Signed-off-by: Jim DeWaard <dewaard@amazon.com>
dewjam pushed a commit to dewjam/containerd that referenced this pull request Apr 22, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
(cherry picked from commit 1764ea9)
Signed-off-by: Jim DeWaard <dewaard@amazon.com>
ambarve pushed a commit to ambarve/containerd that referenced this pull request Jun 28, 2022
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 8113758)
Signed-off-by: Amit Barve <ambarve@microsoft.com>
kiashok pushed a commit to kiashok/containerd that referenced this pull request Oct 23, 2024
Background:

With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue containerd#4937.

After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.

For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request containerd#6318.

```text
	Request 1	Request 2

	Prepare
	   |
	   |
	   |
	   |		Prepare
	Commit		   |
			   |
			   |
			   |
			Commit(failed on exist)
```

Both content backoff retry and unnecessary unpack impacts the performance.

Solution:

Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.

Test Result:

Before enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	1m6.172s
user	0m0.268s
sys	0m0.193s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.324s
user	0m0.441s
sys	0m0.316s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	1m47.657s
user	0m0.284s
sys	0m0.224s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.381s
user	0m0.488s
sys	0m0.358s
```

With this enhancement:

```bash
➜  /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...

real	0m1.140s
user	0m0.243s
sys	0m0.178s

docker pull localhost:5000/redis:latest (x20) takes ...

real	0m1.239s
user	0m0.463s
sys	0m0.275s

➜  /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...

real	0m5.546s
user	0m0.217s
sys	0m0.219s

docker pull localhost:5000/golang:latest (x20) takes ...

real	0m6.090s
user	0m0.501s
sys	0m0.331s
```

Test Script:

localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.

```bash

image_name="${1}"
pull_times="${2:-10}"

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```

Fixes: containerd#4937
Close: containerd#4985
Close: containerd#6318

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants