Add option to skip unpacking the same layer in parallel by dcantah · Pull Request #6318 · containerd/containerd

dcantah · 2021-12-02T17:56:57Z

With the way things work right now, there's nothing stopping a parallel unpack of the exact
same layer to a snapshot. The first one to get committed will live on while the other(s) get garbage collected
so in the end things work out, but regardless of this it's wasted work. The real issue is that while unpack
should be pretty cheap on Linux, the opposite is true for the Windows and lcow formats. Kicking off
10 parallel pulls of the same image brings my 6 core machine to a halt and pushes 100% cpu utilization.
What all of this ends up causing is exponentially slower parallel pull times for images that either share layers,
or just pulling the same image.

I'm not sure if this is a "sound" way to approach this, or if there's possibly a much easier way to go about this change. I tried
to model it in a way that wouldn't disrupt things from a clients perspective, so the logic lives in the metadata snapshotter
layer. The gist of this change is if a new RemoteContext option is specified, the snapshotter now keeps track of what active
snapshots are "in progress". Any other snapshots that call Prepare with the same key as a snapshot that is already in progress
will now simply wait for one of two things to occur.

The first active snapshot it's waiting on gets removed via Remove (so it was never committed). For this case there
was likely an error during setup for the first snapshot/unpack, so any waiters continue as normal for this branch and create a new snapshot.
First active snapshot gets committed and will notify any snapshots currently waiting that commit has succeeded and
we can simply exit (as the layer now exists, so no need to create a new snapshot+unpack again). ErrAlreadyExists is returned from
all of the snapshots waiting to let the client know that there's already a snapshot that exists with this content.

Below are some numbers from testing this fix vs. Containerd built from main.:

Single Pull With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..1 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
23.2647532
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With This Commit
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
25.3406896
25.1486873
25.0266887
24.8586885
24.7116929
24.5726943
24.4816912
24.375693
24.250694
24.1176895
PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest
Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest

10 Parallel Pulls With Containerd Built Off Main:
PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$.psendtime-$.psbegintime} | % totalseconds
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e
134.416318
134.2893209
134.1643238
134.027323
133.8813211
133.7333303
133.5593271
133.4253269
133.2963282
133.1643288

k8s-ci-robot · 2021-12-02T17:57:07Z

Hi @dcantah. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dcantah · 2021-12-02T17:57:14Z

Checking CI signal, not ready for review (if the draft wasn't a good signal haha). This is an old change that I'd played around with that needs some cleanup.

With the way things work right now, there's nothing stopping a parallel unpack of the exact same layer to a snapshot. The first one to get committed will live on while the other(s) get garbage collected so in the end things work out, but regardless of this it's wasted work. The real issue is that while unpack should be pretty cheap on Linux, the opposite is true for the Windows and lcow formats. Kicking off 10 parallel pulls of the same image brings my 6 core machine to a halt and pushes 100% cpu utilization. What all of this ends up causing is exponentially slower parallel pull times for images that either share layers, or just pulling the same image. I'm not sure if this is a "sound" way to approach this, or if there's possibly a much easier way to go about this change. I tried to model it in a way that wouldn't disrupt things from a clients perspective, so the logic lives in the metadata snapshotter layer. The gist of this change is if a new RemoteContext option is specified, the snapshotter now keeps track of what active snapshots are "in progress". Any other snapshots that call Prepare with the same key as a snapshot that is already in progress will now simply wait for one of two things to occur. 1. The first active snapshot it's waiting on gets removed via `Remove` (so it was never committed). For this case there was likely an error during setup for the first snapshot/unpack, so any waiters continue as normal for this branch and create a new snapshot. 2. First active snapshot gets committed and will notify any snapshots currently waiting that commit has succeeded and we can simply exit (as the layer now exists, so no need to create a new snapshot+unpack again). ErrAlreadyExists is returned from all of the snapshots waiting to let the client know that there's already a snapshot that exists with this content. Below are some numbers from testing this fix vs. Containerd built from main.: Single Pull With This Commit PS C:\Users\dcanter\Desktop\ctrd> $a=1..1 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e 23.2647532 PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest 10 Parallel Pulls With This Commit PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e 25.3406896 25.1486873 25.0266887 24.8586885 24.7116929 24.5726943 24.4816912 24.375693 24.250694 24.1176895 PS C:\Users\dcanter\Desktop\ctrd> .\crictl.exe rmi cplatpublic.azurecr.io/nanoserver_many_layers:latest Deleted: cplatpublic.azurecr.io/nanoserver_many_layers:latest 10 Parallel Pulls With Containerd Built Off Main: PS C:\Users\dcanter\Desktop\ctrd> $a=1..10 | %{start-job {C:\Users\dcanter\Desktop\ctrd\crictl.exe pull cplatpublic.azurecr.io/nanoserver_many_layers:latest}}; $a | wait-job | receive-job; $a | %{$_.psendtime-$_.psbegintime} | % totalseconds Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e Image is up to date for sha256:f86afbf779d83cf5aa2cc8377ea0b9f1cea140de22e1189270634d3a8ca4cf0e 134.416318 134.2893209 134.1643238 134.027323 133.8813211 133.7333303 133.5593271 133.4253269 133.2963282 133.1643288 Signed-off-by: Daniel Canter <dcanter@microsoft.com>

dmcgowan · 2021-12-02T18:09:39Z

Kicking off 10 parallel pulls of the same image...

Is this being seen happen naturally? You could change this to say "Kicking off 10 parallel pulls of different images" and expect bad performance as well. We do have limiters at least on number of parallel fetch requests, is this a case where the content now exists for all and the unpack becomes a thundering herd?

theopenlab-ci · 2021-12-02T18:30:40Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 27s (non-voting)
containerd-test-arm64 : SUCCESS in 5m 38s (non-voting)
containerd-integration-test-arm64 : SUCCESS in 24m 16s (non-voting)

dcantah · 2021-12-06T22:59:55Z

@dmcgowan Sorry for the late reply! We had some services in prod that would get workloads allocated with the same image which would kick off N pulls of the same image all at the same time. Obviously in that scenario the client can just be a little smarter on when to issue a pull (no need if we just kicked one off for the same image) but I'd still thought there'd be some sort of de-duping for unpack.

The other issue is if you kicked off a bunch of pulls for different images that share layers and the same layer starts getting unpacked in parallel. This is more of an issue on Windows where the base layer of an image is typically the "windows" layer that contains the operating system bits and is fairly large and expensive to unpack. There's a high chance that the first layer of Windows images will be the same as they need to match the hosts build, so there's really only a handful of images to choose from to base the rest of your image off of. Obviously the problem isn't only with unpacking the Windows layer N times, unpacking any of the same layer N times isn't good, but the pain is felt more on the first 😆.

For instance if you pulled mcr.microsoft.com/windows/nanoserver:1809 and mycoolwindowsimage:latest (where mycoolwindowsimage:latest is based off of nanoserver:1809) in parallel, you'd still end up in the scenario where the first "Windows" layer is going to get unpacked twice when it shouldn't have to be.

We do have limiters at least on number of parallel fetch requests, is this a case where the content now exists for all and the unpack becomes a thundering herd?

Right. Using the 10 parallel pulls example, the fetch requests complete and then 10 unpacks occur all at the same shortly after. Then they all complete and then shortly after that the other 9 all get garbage collected which just seems like wasted work.

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com>

dcantah · 2022-03-22T11:21:56Z

Closing as #6702 is out to solve the same problem and is a much better approach that looks to be client side, thanks @fuweid!!

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Daniel Canter <dcanter@microsoft.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Davanum Srinivas <davanum@gmail.com> (cherry picked from commit 1764ea9) Signed-off-by: Jim DeWaard <dewaard@amazon.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Amit Barve <ambarve@microsoft.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com>

k8s-ci-robot added the needs-ok-to-test label Dec 2, 2021

dcantah force-pushed the sameunpack branch from 2cf680c to 30518db Compare December 2, 2021 18:01

dcantah force-pushed the sameunpack branch from 30518db to 8f8ebd6 Compare December 2, 2021 18:04

fuweid mentioned this pull request Mar 7, 2022

content/helper: use semaphore for OpenWriter #4985

Closed

fuweid mentioned this pull request Mar 19, 2022

CRI: improve image pulling performance #6702

Merged

dcantah closed this Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to skip unpacking the same layer in parallel#6318

Add option to skip unpacking the same layer in parallel#6318
dcantah wants to merge 1 commit intocontainerd:mainfrom
dcantah:sameunpack

dcantah commented Dec 2, 2021 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 2, 2021

Uh oh!

dcantah commented Dec 2, 2021 •

edited

Loading

Uh oh!

dmcgowan commented Dec 2, 2021

Uh oh!

theopenlab-ci bot commented Dec 2, 2021

Uh oh!

dcantah commented Dec 6, 2021 •

edited

Loading

Uh oh!

dcantah commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dcantah commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 2, 2021

Uh oh!

dcantah commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmcgowan commented Dec 2, 2021

Uh oh!

theopenlab-ci bot commented Dec 2, 2021

Uh oh!

dcantah commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcantah commented Mar 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcantah commented Dec 2, 2021 •

edited

Loading

dcantah commented Dec 2, 2021 •

edited

Loading

dcantah commented Dec 6, 2021 •

edited

Loading