Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ostree-rs-ext 0.5.0 #3139

Merged
merged 1 commit into from
Nov 15, 2021
Merged

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Sep 26, 2021

Comment on lines 28 to 30
match &imgref.sigverify {
ostree_container::SignatureSource::OstreeRemote(_) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is that right now if we're pulling an ostree-remote-image then since we don't support layered, ostree-signed imports, we default to un-encapsulation. If we're pulling a not-ostree-signed image, then we use layering.

But basically this is ostreedev/ostree-rs-ext#99 (comment)

@cgwalters cgwalters marked this pull request as draft October 6, 2021 17:38
@cgwalters cgwalters changed the title WIP: Update to ostree-rs-ext 0.4 Use ostree-rs-ext (0.4.alpha) from git main for now Oct 7, 2021
@cgwalters
Copy link
Member Author

OK, updated. But, this blocks on resolution to #3160

@cgwalters
Copy link
Member Author

OK, I updated this. Turned out the tests required some changes because we now default to the "container store" path, which right now creates a new commit even when there's only one layer.

That'd be nice to fix, but I need to think about how to do so most elegantly.

For now, I tweaked the test to compare content checksums.

@cgwalters
Copy link
Member Author

Blah, https://bodhi.fedoraproject.org/updates/FEDORA-2021-1854bd6a04 has to hit bodhi stable first.

@cgwalters
Copy link
Member Author

Oh fun, I hit:

root@cosa-devsh ~]# rpm-ostree rebase --experimental ostree-unverified-image:registry:quay.io/cgwalters/fcos-derivation-example
Pulling manifest: ostree-unverified-image:docker://quay.io/cgwalters/fcos-derivation-example
Importing: ostree-unverified-image:docker://quay.io/cgwalters/fcos-derivation-example (digest: sha256:865f204378bee8060b83881cb2d6c858d12ee970ae4e3c6f020155676977a454)
Downloading base layer: sha256:32d62d73008776158de3532b560bb4d1c087f9600eecc732594a5a31cdccaf5b (745.5 MB)
Downloading layer: sha256:809c66ca43fa2ac72c15a83adb27b1e1794a7f5ae8463b3a8c559164a9201d8b (1.1 MB)
Downloading layer: sha256:0f79a34559f64a74b08fd5bbad34072f8c1a7784f33bd750d878c7f387f1b5a1 (302 bytes)
Downloading layer: sha256:655734cf6267d3ba1506d4dac15329679216c9169093a5a99b60b7484e4320d4 (6.4 MB)
Staging deployment... done
**
rpm-ostreed:ERROR:src/lib/rpmostree-package.c:278:_rpm_ostree_package_list_for_commit: assertion failed (pkglist->len > 0): (0 > 0)
Aborted (core dumped)

Which happens I think because the derived commit written by ostree doesn't have the packagelist metadata. Hmmmm...I guess in theory, what we need is for the ostree APIs here to expose a hook/callback so that rpm-ostree can inject that data.

But, we should also be robust to the absence of it.

cgwalters added a commit to cgwalters/os that referenced this pull request Oct 19, 2021
@cgwalters cgwalters force-pushed the bump-ostree-rs-ext-0.4 branch 2 times, most recently from 7de4033 to 772c272 Compare October 28, 2021 18:51
@cgwalters
Copy link
Member Author

/test all

OK, this is updated now.

@cgwalters
Copy link
Member Author

Argh, buildroot container out of date again.

cgwalters added a commit to cgwalters/os that referenced this pull request Oct 29, 2021
cgwalters added a commit to cgwalters/os that referenced this pull request Oct 29, 2021
@cgwalters cgwalters marked this pull request as ready for review November 2, 2021 13:57
@cgwalters
Copy link
Member Author

Lifting draft on this one.

@cgwalters cgwalters changed the title Use ostree-rs-ext (0.4.alpha) from git main for now Use ostree-rs-ext 0.4.0 Nov 2, 2021
@cgwalters cgwalters force-pushed the bump-ostree-rs-ext-0.4 branch 3 times, most recently from 5d09d75 to 597ac8d Compare November 9, 2021 19:50
@cgwalters
Copy link
Member Author

OK 🆕 - the ex-container test was basically duplicating the container-image test; I've condensed them down to one, and extended that to test layering.

In the process of doing that I hit a few issues e.g. ostreedev/ostree-rs-ext#153 would make it much more ergonomic to do local builds, but I've worked around that by copying the image again.

@cgwalters
Copy link
Member Author

/retest

if (cur_digest)
auto refspec_s = std::string(refspec);
auto import = CXX_TRY_VAL(pull_container(*self->repo, *cancellable, refspec_s), error);
if (!import->changed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change detection logic correctly handle the case where a newer image was already pulled but not yet deployed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, good catch. I think the bug here dates to 7ac7ef4

AFAICS, the return there was always unnecessary. But testing now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, actually being able to remove that in turn means we can remove the now-unnecessary changed boolean which in turn means we can drop that wrapper struct, etc. 🎉

Cargo.toml Outdated
@@ -53,7 +53,7 @@ openat = "0.1.21"
openat-ext = "^0.2.2"
openssl = "0.10.38"
os-release = "0.1.0"
ostree-ext = "0.3.0"
ostree-ext = { git = "https://github.com/ostreedev/ostree-rs-ext", branch = "main" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message says 0.5.0 but this is pointing to the main branch. Are you planning to cut another release and update this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll do that soon, I wanted to have green CI here, which I think we mostly have. But then I started doing more API changes, so will land those first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's done!

jlebon
jlebon previously approved these changes Nov 10, 2021
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cgwalters
Copy link
Member Author

Oops, forgot to push some local bits. Done now

@lucab
Copy link
Contributor

lucab commented Nov 12, 2021

/retest

@cgwalters
Copy link
Member Author

Hmm. This may be reproducing a hang I've seen sometimes locally, but not managed to track down. I thought it would at least be surfaced by containers/containers-image-proxy-rs#3 but perhaps not.

@cgwalters
Copy link
Member Author

OK reproduced the hang locally in a kola run, but ssh'ing into the VM and rerunning rpm-ostree upgrade works...and...oh, OK if I delete the cached ref this reproduces.

Nov 12 15:26:16 qemu0 rpm-ostree[1641]: Initiated txn Upgrade for client(id:cli dbus:1.164 unit:session-13.scope uid:0): /org/projectatomic/rpmostree1/fedora_coreos
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: Process [pid: 3110 uid: 0 unit: session-13.scope] connected to transaction progress
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.794771Z DEBUG rpmostree_rust::origin: Failed to roundtrip origin: Fetching origin/container-image-reference: Key file does not have key “container-image-reference” in group “origin”; Unexpected new key: origin/refspec
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.795055Z DEBUG rpmostree_rust::origin: Failed to roundtrip origin: Fetching origin/container-image-reference: Key file does not have key “container-image-reference” in group “origin”; Unexpected new key: origin/refspec
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.795275Z DEBUG rpmostree_rust::origin: Failed to roundtrip origin: Fetching origin/container-image-reference: Key file does not have key “container-image-reference” in group “origin”; Unexpected new key: origin/refspec
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.795474Z DEBUG rpmostree_rust::origin: Failed to roundtrip origin: Fetching origin/container-image-reference: Key file does not have key “container-image-reference” in group “origin”; Unexpected new key: origin/refspec
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.795644Z DEBUG rpmostree_rust::origin: Failed to roundtrip origin: Fetching origin/container-image-reference: Key file does not have key “container-image-reference” in group “origin”; Unexpected new key: origin/refspec
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.796521Z TRACE new_with_config{config=ImageProxyConfig { authfile: None, insecure_skip_tls_verification: None }}: mio::poll: registering event source with poller: token=Token(805306369), interests=READABLE | WRITABLE
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.796640Z DEBUG new_with_config{config=ImageProxyConfig { authfile: None, insecure_skip_tls_verification: None }}: containers_image_proxy: Spawned skopeo pid=Some(3116)
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.796743Z TRACE new_with_config{config=ImageProxyConfig { authfile: None, insecure_skip_tls_verification: None }}:impl_request{self=ImageProxy method="Initialize"}: containers_image_proxy: sending request Initialize
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.816539Z TRACE new_with_config{config=ImageProxyConfig { authfile: None, insecure_skip_tls_verification: None }}:impl_request{self=ImageProxy method="Initialize"}: containers_image_proxy: completed request
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.816683Z DEBUG open_image{self=ImageProxy imgref="oci:/var/tmp/fcos"}: containers_image_proxy: opening image
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.816739Z TRACE open_image{self=ImageProxy imgref="oci:/var/tmp/fcos"}:impl_request{self=ImageProxy method="OpenImage"}: containers_image_proxy: sending request OpenImage
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817102Z TRACE open_image{self=ImageProxy imgref="oci:/var/tmp/fcos"}:impl_request{self=ImageProxy method="OpenImage"}: containers_image_proxy: completed request
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817168Z TRACE impl_request{self=ImageProxy method="GetManifest"}: containers_image_proxy: sending request GetManifest
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817647Z TRACE impl_request{self=ImageProxy method="GetManifest"}: containers_image_proxy: completed request
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817732Z DEBUG finish_pipe{self=ImageProxy pipeid=8}: containers_image_proxy: closing pipe
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817793Z TRACE finish_pipe{self=ImageProxy pipeid=8}:impl_request{self=ImageProxy method="FinishPipe"}: containers_image_proxy: sending request FinishPipe
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.817948Z TRACE finish_pipe{self=ImageProxy pipeid=8}:impl_request{self=ImageProxy method="FinishPipe"}: containers_image_proxy: completed request
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.818285Z DEBUG fetch_layer_decompress: ostree_ext::container::unencapsulate: fetching sha256:9e0cc0cd599302d557267a020563166445cd395fba56ac7f91413b30c1d6fc3b
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.818722Z DEBUG fetch_layer_decompress:get_blob{self=ImageProxy img=OpenedImage(1) digest="sha256:9e0cc0cd599302d557267a020563166445cd395fba56ac7f91413b30c1d6fc3b" size=1614709339}: containers_image_proxy: fetching blob
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.818857Z TRACE fetch_layer_decompress:get_blob{self=ImageProxy img=OpenedImage(1) digest="sha256:9e0cc0cd599302d557267a020563166445cd395fba56ac7f91413b30c1d6fc3b" size=1614709339}:impl_request{self=ImageProxy method="GetBlob"}: containers_image_proxy: sending request GetBlob
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.819119Z TRACE fetch_layer_decompress:get_blob{self=ImageProxy img=OpenedImage(1) digest="sha256:9e0cc0cd599302d557267a020563166445cd395fba56ac7f91413b30c1d6fc3b" size=1614709339}:impl_request{self=ImageProxy method="GetBlob"}: containers_image_proxy: completed request
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.819224Z DEBUG finish_pipe{self=ImageProxy pipeid=9}: containers_image_proxy: closing pipe
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.819307Z TRACE finish_pipe{self=ImageProxy pipeid=9}:impl_request{self=ImageProxy method="FinishPipe"}: containers_image_proxy: sending request FinishPipe
Nov 12 15:26:16 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:16.820941Z DEBUG ostree_ext::tar::import: Imported 554ac1bc0b4fc85d86525b6d0b54f437c16493bc74d54149faf93d003171c1ab.commit
Nov 12 15:26:21 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:21.986290Z TRACE finish_pipe{self=ImageProxy pipeid=9}:impl_request{self=ImageProxy method="FinishPipe"}: containers_image_proxy: completed request
Nov 12 15:26:21 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:21.988996Z DEBUG finalize{self=ImageProxy}: containers_image_proxy: sent shutdown request
Nov 12 15:26:21 qemu0 rpm-ostree[1641]: 2021-11-12T15:26:21.989781Z TRACE finalize{self=ImageProxy}: mio::poll: deregistering event source from poller

@cgwalters
Copy link
Member Author

                Stack trace of thread 1641:
                #0  0x00007f913a3a07ff __poll (libc.so.6 + 0x1057ff)
                #1  0x00007f913ae2624c g_main_context_iterate.constprop.0 (libglib-2.0.so.0 + 0xaa24c)
                #2  0x00007f913adce9e3 g_main_context_iteration (libglib-2.0.so.0 + 0x529e3)
                #3  0x000056513c898dcc rpmostreed_daemon_run_until_idle_exit (rpm-ostree + 0x6c9dcc)
                #4  0x000056513c8907ec rpmostree_builtin_start_daemon (rpm-ostree + 0x6c17ec)
                #5  0x000056513c887782 _ZN12rpmostreecxx14rpmostree_mainEN4rust10cxxbridge15SliceIKNS1_3StrEEE (rpm-ostree + 0x6b8782)
                #6  0x000056513c884880 operator() (rpm-ostree + 0x6b5880)
                #7  0x000056513c592ded _ZN97_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$4poll17h482ae2c176c6a5a1E (rpm-ostree + 0x3c3ded)
                #8  0x000056513c5cd536 _ZN10rpm_ostree4main17h1efb03c0dd703546E (rpm-ostree + 0x3fe536)
                #9  0x000056513c551eb3 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17he8fed9103c1d6223E (rpm-ostree + 0x382eb3)
                #10 0x000056513c5ca522 main (rpm-ostree + 0x3fb522)
                #11 0x00007f913a2c8560 __libc_start_call_main (libc.so.6 + 0x2d560)
                #12 0x00007f913a2c860c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d60c)
                #13 0x000056513c2f658e _start (rpm-ostree + 0x12758e)
                
                Stack trace of thread 1642:
                #0  0x00007f913a3acc7e epoll_wait (libc.so.6 + 0x111c7e)
                #1  0x000056513c8432ec _ZN5tokio2io6driver6Driver4turn17ha023eb265e1b6f8cE.llvm.9700796789054656221 (rpm-ostree + 0x6742ec)
                #2  0x000056513c84fe41 _ZN5tokio4time6driver15Driver$LT$P$GT$13park_internal17hf13ff95fbf3f0ce0E.llvm.1214285945216753108 (rpm-ostree + 0x680e41)
                #3  0x000056513c84a3e3 _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$4park17h6bdfcfddf3e3c68bE (rpm-ostree + 0x67b3e3)
                #4  0x000056513c853780 _ZN5tokio7runtime11thread_pool6worker7Context12park_timeout17h33c175c2c7952417E (rpm-ostree + 0x684780)
                #5  0x000056513c847916 _ZN5tokio7runtime4task3raw4poll17ha60e29aebe400a98E.llvm.255271045138269416 (rpm-ostree + 0x678916)
                #6  0x000056513c842369 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h6bb5387d1cc9e0f1E.llvm.9700796789054656221 (rpm-ostree + 0x673369)
                #7  0x000056513c857bdb _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h0fc9f53d851a613aE (rpm-ostree + 0x688bdb)
                #8  0x000056513c82a2d5 _ZN3std3sys4unix6thread6Thread3new12thread_start17h1fbdb50adbde1cabE (rpm-ostree + 0x65b2d5)
                #9  0x00007f913a328b17 start_thread (libc.so.6 + 0x8db17)
                #10 0x00007f913a3ad6a0 __clone3 (libc.so.6 + 0x1126a0)
                
                Stack trace of thread 1643:
                #0  0x00007f913a3a07ff __poll (libc.so.6 + 0x1057ff)
                #1  0x00007f913ae2624c g_main_context_iterate.constprop.0 (libglib-2.0.so.0 + 0xaa24c)
                #2  0x00007f913adce9e3 g_main_context_iteration (libglib-2.0.so.0 + 0x529e3)
                #3  0x00007f913adcea31 glib_worker_main (libglib-2.0.so.0 + 0x52a31)
                #4  0x00007f913ae00122 g_thread_proxy (libglib-2.0.so.0 + 0x84122)
                #5  0x00007f913a328b17 start_thread (libc.so.6 + 0x8db17)
                #6  0x00007f913a3ad6a0 __clone3 (libc.so.6 + 0x1126a0)
                
                Stack trace of thread 1644:
                #0  0x00007f913a3a07ff __poll (libc.so.6 + 0x1057ff)
                #1  0x00007f913ae2624c g_main_context_iterate.constprop.0 (libglib-2.0.so.0 + 0xaa24c)
                #2  0x00007f913add0903 g_main_loop_run (libglib-2.0.so.0 + 0x54903)
                #3  0x00007f913b02583a gdbus_shared_thread_func.lto_priv.0 (libgio-2.0.so.0 + 0x11383a)
                #4  0x00007f913ae00122 g_thread_proxy (libglib-2.0.so.0 + 0x84122)
                #5  0x00007f913a328b17 start_thread (libc.so.6 + 0x8db17)
                #6  0x00007f913a3ad6a0 __clone3 (libc.so.6 + 0x1126a0)
                
                Stack trace of thread 3030:
                #0  0x00007f913a3a5f3d syscall (libc.so.6 + 0x10af3d)
                #1  0x000056513c4242a9 _ZN11parking_lot7condvar7Condvar19wait_until_internal17h7464b111c075020cE (rpm-ostree + 0x2552a9)
                #2  0x000056513c851f77 _ZN75_$LT$tokio..park..thread..CachedParkThread$u20$as$u20$tokio..park..Park$GT$4park17h41d642008114c83bE (rpm-ostree + 0x682f77)
                #3  0x000056513c6c505d rpmostreecxx$cxxbridge1$pull_container (rpm-ostree + 0x4f605d)
                #4  0x000056513c884ed0 _ZN12rpmostreecxx14pull_containerER10OstreeRepoR13_GCancellableN4rust10cxxbridge13StrE (rpm-ostree + 0x6b5ed0)
                #5  0x000056513c8e6b49 rpmostree_sysroot_upgrader_pull_base (rpm-ostree + 0x717b49)
                #6  0x000056513c8a698a deploy_transaction_execute (rpm-ostree + 0x6d798a)
                #7  0x000056513c8a0035 transaction_execute_thread (rpm-ostree + 0x6d1035)
                #8  0x00007f913afc33ef g_task_thread_pool_thread (libgio-2.0.so.0 + 0xb13ef)
                #9  0x00007f913ae03054 g_thread_pool_thread_proxy.lto_priv.0 (libglib-2.0.so.0 + 0x87054)
                #10 0x00007f913ae00122 g_thread_proxy (libglib-2.0.so.0 + 0x84122)
                #11 0x00007f913a328b17 start_thread (libc.so.6 + 0x8db17)
                #12 0x00007f913a3ad6a0 __clone3 (libc.so.6 + 0x1126a0)

I think the problem here may go to #3082

Basically we have a mess due using both the glib mainloop and C threads, as well as tokio.

Hmm. I think what we should be doing is doing tokio::spawn_blocking for inner_async_main.

@cgwalters
Copy link
Member Author

OK I am disappointed I didn't debug this today. But, I made some progress in understanding things. One specific discovery is that tokio takes over the global SIGCHLD handler, which will totally break glib's use of it.

We may need to add some sort of hook in tokio to allow deferring waitpid() invocations...or in general better bridging between tokio and glib's mainloop.

Now that said, I'm not totally sure that's the problem yet.

What's really maddening is that for some reason this reliably reproduces when run in kola test but almost never when run in my interactive setup where I have debug tools.

Using this tokio patch:

diff --git a/tokio/src/process/mod.rs b/tokio/src/process/mod.rs
index 6eeefdbe..2ac6fa39 100644
--- a/tokio/src/process/mod.rs
+++ b/tokio/src/process/mod.rs
@@ -1062,7 +1062,9 @@ impl Child {
         match &mut self.child {
             FusedChild::Done(exit) => Ok(*exit),
             FusedChild::Child(child) => {
+                eprintln!("Awaiting child exit");
                 let ret = child.await;
+                eprintln!("child exited");
 
                 if let Ok(exit) = ret {
                     self.child = FusedChild::Done(exit);
diff --git a/tokio/src/process/unix/reap.rs b/tokio/src/process/unix/reap.rs
index f7f4d3cc..c3ce4089 100644
--- a/tokio/src/process/unix/reap.rs
+++ b/tokio/src/process/unix/reap.rs
@@ -86,9 +86,12 @@ where
             // should not cause significant issues with parent futures.
             let registered_interest = self.signal.poll_recv(cx).is_pending();
 
+            eprintln!("Trying wait, interest: {}", registered_interest);
             if let Some(status) = self.inner_mut().try_wait()? {
+                eprintln!("Child is ready");
                 return Poll::Ready(Ok(status));
             }
+            eprintln!("Child is not ready");
 
             // If our attempt to poll for the next signal was not ready, then
             // we've arranged for our task to get notified and we can bail out.

I get:

Nov 12 22:53:35 qemu0 rpm-ostree[795]: 2021-11-12T22:53:35.289919Z TRACE finish_pipe{self=ImageProxy pipeid=10}:impl_request{self=ImageProxy method="FinishPipe"}: containers_image_proxy: completed request
Nov 12 22:53:35 qemu0 rpm-ostree[795]: 2021-11-12T22:53:35.292127Z DEBUG finalize{self=ImageProxy}: containers_image_proxy: sent shutdown request
Nov 12 22:53:35 qemu0 rpm-ostree[795]: 2021-11-12T22:53:35.292180Z TRACE finalize{self=ImageProxy}: containers_image_proxy: got child process lock
Nov 12 22:53:35 qemu0 rpm-ostree[795]: Trying wait, interest: true
Nov 12 22:53:35 qemu0 rpm-ostree[795]: Child is not ready
Nov 12 22:53:35 qemu0 rpm-ostree[795]: 2021-11-12T22:53:35.293163Z TRACE mio::poll: deregistering event source from poller
Nov 12 22:53:35 qemu0 rpm-ostree[795]: Trying wait, interest: true
Nov 12 22:53:35 qemu0 rpm-ostree[795]: Child is not ready
Nov 12 22:53:35 qemu0 rpm-ostree[795]: 2021-11-12T22:53:35.293182Z TRACE mio::poll: deregistering event source from poller

And the skopeo process has clearly exited:
root 1313 1.3 0.0 0 0 ? Z 22:53 0:03 \_ [skopeo] <defunct>

So...I think this is something going wrong in the guts of the skopeo child process machinery, possibly due to interaction with the glib mainloop.

I started hacking on forking off the whole thing as a subprocess (from the daemon) which would be "sustainably hacky" but what is somewhat annoying about this is that ostree doesn't expose the whole "remount /sysroot as writable" bits. Hmm, I guess we could fork via systemd-run. Will potentially look at that.

@cgwalters
Copy link
Member Author

@cgwalters
Copy link
Member Author

containers/containers-image-proxy-rs#13
once that merges, will cut a containers-image-proxy 0.3.1 release, then bump the Cargo.lock here.

@cgwalters
Copy link
Member Author

cgwalters commented Nov 15, 2021

What's really maddening is that for some reason this reliably reproduces when run in kola test but almost never when run in my interactive setup where I have debug tools.

What seems obvious in retrospect is that the flow here is:

  • tokio pull container
  • tokio installs SIGCHLD handler (at most one attempt per process though!)
  • We spawn bwrap to do the sanity check using glib
  • glib installs SIGCHLD handler

And then rerunning that loop will fail...but if we auto-exit (or if one restarts the process) then it will work fine. And some of my interactive debugging either got the auto-exit timeout or I was e.g. adding some new debug code via systemctl restart rpm-ostreed.

Whereas the kola run is fast enough that it's hitting the same daemon instance multiple times reliably.

jlebon
jlebon previously approved these changes Nov 15, 2021
rust/src/main.rs Outdated Show resolved Hide resolved
@cgwalters cgwalters merged commit bd1cdb7 into coreos:main Nov 15, 2021
@cgwalters
Copy link
Member Author

cgwalters commented Nov 15, 2021

Just for reference I also tried the below.

Originally, I thought we could simply do gio::Subprocess::spawn("true") but right now glib refcounts its use of SIGCHLD, and uninstalls the handler if there are no children, which defeats that hack. So then I did...

From 3bdc1bc117065934b1ba901d4d6003b7de17cd2d Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Mon, 15 Nov 2021 15:21:37 -0500
Subject: [PATCH] main: Force glib to install SIGCHLD handler early

This is a distinct workaround for
https://github.com/containers/containers-image-proxy-rs/commit/6e232a659a9c4e0bec627847d67edca740bfa20e

By having GLib install its handler before we set up the tokio
runtime, we ensure that `signal-hook` as used by tokio will correctly
chain to the GLib handler.
---
 Cargo.lock       |  4 ++--
 rust/src/main.rs | 13 +++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/Cargo.lock b/Cargo.lock
index 6f6df906..ebcab8bb 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -261,9 +261,9 @@ dependencies = [
 
 [[package]]
 name = "containers-image-proxy"
-version = "0.3.1"
+version = "0.3.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "cd1740fa7c4f7f0cbee0bb65e98f6ea0262de0c0e6bf26f62e6a394ec7ce94a2"
+checksum = "cb05ae39fc87403530466dffb6936d98a318ff2dd7bce1cf05323fff522f814a"
 dependencies = [
  "anyhow",
  "fn-error-context",
diff --git a/rust/src/main.rs b/rust/src/main.rs
index 226f65bf..03e40625 100644
--- a/rust/src/main.rs
+++ b/rust/src/main.rs
@@ -3,6 +3,7 @@
 
 use anyhow::{Context, Result};
 use nix::sys::signal;
+use ostree_ext::glib;
 use std::io::Write;
 use std::os::unix::prelude::CommandExt;
 use termcolor::WriteColor;
@@ -84,6 +85,18 @@ fn inner_main() -> Result<i32> {
         .with_writer(std::io::stderr)
         .init();
     tracing::trace!("starting");
+
+    // Force glib to install its SIGCHLD handler now, because we also use tokio in process
+    // which also installs a SIGCHLD handler, but tokio uses signal-hook which knows
+    // how to chain to the glib handler, but the reverse is currently not true.
+    // https://github.com/tokio-rs/tokio/issues/3520#issuecomment-969225304
+    let _watch = glib::child_watch_source_new(
+        glib::Pid(nix::unistd::getpid().into()),
+        None,
+        glib::Priority::default(),
+        |_, _| {},
+    );
+
     // Gather our arguments.
     let args: Result<Vec<String>> = std::env::args_os()
         .map(|s| -> Result<String> {
-- 
2.33.1

but it currently causes glib to error out later. We'd have to have a persistent sleep infinity child process etc. Eww.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants