Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project or editor crashes randomly from xcb XInitThreads #75308

Open
Cykyrios opened this issue Mar 25, 2023 · 44 comments
Open

Project or editor crashes randomly from xcb XInitThreads #75308

Cykyrios opened this issue Mar 25, 2023 · 44 comments

Comments

@Cykyrios
Copy link
Contributor

Godot version

v4.1.dev.custom_build [0291fcd]

System information

Linux Manjaro, kernel 6.1.19, X11

Issue description

The editor or the running project sometimes crashes with the following error:

[xcb] Unknown sequence number while processing queue
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot.linuxbsd.editor.x86_64: xcb_io.c:278: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.

Crashes are more common while a project is running, but the editor also crashed because of this a couple of times over the past week or so.

I am not using any thread-related functions in my project, physics/rendering are not threaded, the project I'm working on as this happens is a simple GUI-based game.

Steps to reproduce

This seems to happen fairly randomly.

Minimal reproduction project

N/A

@Calinou
Copy link
Member

Calinou commented Mar 26, 2023

I haven't been able to reproduce this so far on Fedora 37 KDE (GeForce RTX 4090 with NVIDIA 525.89.02).

What graphics card model, driver version and desktop environment are you using?

Edit: As of November 2023, I've started to be able to reproduce this issue on the same setup as mentioned above (with Fedora 38 and then 39).

@Cykyrios
Copy link
Contributor Author

Oh right, I forgot about GPU-related info. I have an AMD 7900 XT, running on the open-source amdgpu drivers with Mesa 22.3.5 (amdgpu version is "kernel").
The desktop is Plasma 5.26.5.

@geowarin
Copy link
Contributor

Saw this happening (only once) on a totally different config:
RTX 2080Ti
archlinux
i3wm

@cg9999
Copy link
Contributor

cg9999 commented Mar 28, 2023

I have this too, again. Seems very reminiscent of #69352
Happens quite frequently here.
The message varies a bit, last one I got is:

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot: xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

Godot: v4.0.1.stable.arch_linux
libx11: 1.8.4-1
arch linux 64 bit, kernel 6.2.8-zen1
On a laptop with Intel HD Graphics 620

@DrRevert
Copy link
Contributor

DrRevert commented Apr 8, 2023

Manjaro kernel version: 6.1.22-1
As for GPUs: AMD RX 6800 XT but also Intel UHD Graphics 770 (CPU i7-12700)
I have my screens connected to integrated graphics as I'm doing some GPU passthrough, this setup used to cause issues during the beta whenever I opened a new window or a submenu.
Happened like 4 times mostly randomly when the editor was idling.

Managed to reproduce it when connected to gdb, adding backtrace as the attachment
gdb.txt

Forgot to mention Godot version: custom build based on 4.0.2 stable

@Eraph
Copy link

Eraph commented Apr 9, 2023

Seeing the same thing on Manjaro here, I have integrated Intel graphics (Intel i7-1165G7). Gnome on Wayland. Common factor seems to be Arch based distros?

Full backtrace:

handle_crash: Program crashed with signal 11
Engine version: Godot Engine v4.0.2.stable.mono.official (7a0977ce2c558fe6219f0a14f8bd4d05aea8f019)
Dumping the backtrace. Please include this when reporting the bug to the project developer.
[1] /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.3/libcoreclr.so(+0x4a90a4) [0x7f5d0658b0a4] (??:0)
[2] /usr/lib/libc.so.6(+0x38f50) [0x7f5d35225f50] (??:0)
[3] /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.3/libcoreclr.so(+0x49296b) [0x7f5d0657496b] (??:0)
[4] /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.3/libcoreclr.so(+0x4a8d18) [0x7f5d0658ad18] (??:0)
[5] /usr/lib/libc.so.6(+0x38f50) [0x7f5d35225f50] (??:0)
[6] /usr/lib/libc.so.6(+0x878ec) [0x7f5d352748ec] (??:0)
[7] /usr/lib/libc.so.6(gsignal+0x18) [0x7f5d35225ea8] (??:0)
[8] /usr/lib/libc.so.6(abort+0xd7) [0x7f5d3520f53d] (??:0)
[9] /usr/lib/libc.so.6(+0x2245c) [0x7f5d3520f45c] (??:0)
[10] /usr/lib/libc.so.6(+0x319f6) [0x7f5d3521e9f6] (??:0)
[11] /usr/lib/libX11.so.6(+0x3eb8f) [0x7f5d2d888b8f] (??:0)
[12] /usr/lib/libX11.so.6(+0x41995) [0x7f5d2d88b995] (??:0)
[13] /usr/lib/libX11.so.6(_XEventsQueued+0x62) [0x7f5d2d88e642] (??:0)
[14] /usr/lib/libX11.so.6(XFlush+0x1f) [0x7f5d2d86bc1f] (??:0)
[15] /opt/godot-mono-bin/godot/Godot_v4.0.2-stable_mono_linux.x86_64() [0x4d62051] (??:0)
[16] /opt/godot-mono-bin/godot/Godot_v4.0.2-stable_mono_linux.x86_64() [0xe792eb] (??:0)
[17] /opt/godot-mono-bin/godot/Godot_v4.0.2-stable_mono_linux.x86_64() [0x4217f35] (??:0)
[18] /opt/godot-mono-bin/godot/Godot_v4.0.2-stable_mono_linux.x86_64() [0x4e38160] (??:0)
[19] /usr/lib/libc.so.6(+0x85bb5) [0x7f5d35272bb5] (??:0)
[20] /usr/lib/libc.so.6(+0x107d90) [0x7f5d352f4d90] (??:0)
-- END OF BACKTRACE --

@vypxl
Copy link

vypxl commented Apr 12, 2023

Having the same issue, Manjaro with Hyprland / Wayland here. Also Godot 4.0.1 stable, libx11 v1.8.4-1, intel integrated graphics.

@ilmagico
Copy link

Can confirm on Manjaro with kernel 6.1.23, X11 (no wayland) with libx11 1.8.4-1 as well, intel integrated, godot 4.1 compiled from source (from a fork not far from master, but judging from this report the issue is in godot, I could confirm if necessary), backtrace is exactly the same as @Eraph above.

Also, I noticed this is with .NET 7.0.3, while if I download the official stable godot mono from godotengine.org (not from Manjaro's pacman) it never crashes this way, and it's on .NET6, not sure if it matters.

Any other info I could provide to help debug this?

@akien-mga akien-mga added this to the 4.1 milestone Apr 20, 2023
@ju5tevg3niy
Copy link
Contributor

ju5tevg3niy commented Apr 23, 2023

Same issue.

godot:  4.0.2.stable.official.7a0977ce2
render: Vulkan API 1.3.230 - Forward Mobile - Using Vulkan Device #0: Intel - Intel(R) HD Graphics 620 (KBL GT2)
os:     Gentoo
kernel: 6.1.22
de:     Xfce 4.18 / X11
libX11: 1.8.4-r1

@comminux
Copy link

comminux commented Jun 5, 2023

It seems that the problem no longer occurs in version 1.8.5 (Arch Linux official extra repository).

UPD. The problem appears again on libX11 1.8.7

@jivvy
Copy link

jivvy commented Jul 23, 2023

Just had this happen

swaywm
Arch Linux
AMD 6700XT
Godot v4.2.dev.custom_build [f8dbed4]
libx11 1.8.6-1

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot.linuxbsd.editor.x86_64: xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

@krendil
Copy link

krendil commented Oct 4, 2023

I'm getting the same error, with a recent libX11 and non-Arch Linux

Godot Engine v4.1.1.stable.custom_build
OpenGL API 4.6 (Core Profile) Mesa 23.1.3 - Compatibility - Using Device: AMD - AMD Radeon RX 6600 (navi23, LLVM 15.0.7, DRM 3.52, 6.3.13_1)

Void Linux
XFCE4 / xfwm 4.18.0_1
libX11 1.8.6_1
libxcb 1.16_1

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot: xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

@Leshy-YA
Copy link

Leshy-YA commented Oct 6, 2023

Confirmed on Fedora running KDE with Mesa Intel® Xe Graphics.
It would seem there's a regression in libX11 1.8.7, probably related to https://gitlab.freedesktop.org/xorg/lib/libx11/-/issues/170
Downgrading libX11 to 1.8.4 removes the issue.

@ZwieBit
Copy link

ZwieBit commented Oct 18, 2023

Can also confirm that a downgrade to libX11 1.8.4 fixed the issue. I already thought that godot is somewhat unstable but now even 4.2 beta1 works like a charm :-)

@Pshy0
Copy link

Pshy0 commented Oct 19, 2023

I am having similar crashes involving xcb_in.c. They are unpredictable, sometimes crashing the project, sometimes crashing the editor, sometime just displaying in logs without a crash.

Ubuntu 23.04
Godot 4.2.dev4.official.549fcce5f
libx11 1.8.4-2ubuntu0.3
libxcb 1.15-1

The error messages are as follow:

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot: ../../src/xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.

or

[xcb] Unknown sequence number while awaiting reply
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot: ../../src/xcb_io.c:374: poll_for_response: Assertion `!xcb_xlib_threads_sequence_lost' failed.

or

godot: ../../src/xcb_in.c:757: xcb_request_check: Assertion `!reply' failed.

It also sometimes crashes without an error message.

@YuriSizov YuriSizov modified the milestones: 4.2, 4.3 Nov 15, 2023
@vvvvvvitor
Copy link

Keeps happening here all the time, it makes editor basically unusable due to how often it happens. It's really frustrating to the point of me not wanting to work on my project anymore.

[xcb] Unknown sequence number while awaiting reply [xcb] You called XInitThreads, this is not your fault [xcb] Aborting, sorry about that. 4.1.3.x86_64: xcb_io.c:374: poll_for_response: Assertion !xcb_xlib_threads_sequence_lost' failed.
`

@ygingras
Copy link

ygingras commented Nov 16, 2023

On Ubuntu 23.04, all of Godot 4.1.1, 4.1.2, 4.1.3 crash about three times per hour. If I downgrade xserver-xorg-core from 2:21.1.7-1ubuntu3.1 to 2:21.1.7-1ubuntu3, then the crashes happen only once every few days.

Edit: fixed the version numbers

@Lamby777
Copy link

Yep, now I have a reason not to feel bad about catching myself subconsciously spamming CTRL+S.

Milestone shows 4.3, would the fix last or do you guys think an update gonna break it again soon after? Might just downgrade libx11 if it gets really annoying but I'm kinda too busy to troubleshoot stuff rn if doing so happens to break any of my other packages. ;-;

@granitrocky
Copy link

granitrocky commented Dec 17, 2023

@Lamby777 Upgrading libx11 works, too. I compiled and installed the latest libx11 from master on fedora 39 by doing

https://gitlab.freedesktop.org/xorg/lib/libx11/-/tree/master

./autogen.sh
./configure --prefix=/usr
make
sudo make install

and then reboot and I haven't had a crash yet.

@QueenOfSquiggles
Copy link

Gonna share I'm experiencing this regularly on my Manjaro KDE machine.
System details here:

Operating System: Manjaro Linux 
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.113.0
Qt Version: 5.15.11
Kernel Version: 6.5.13-7-MANJARO (64-bit)
Graphics Platform: X11
Processors: 4 × AMD Athlon(tm) X4 880K Quad Core Processor
Memory: 15.6 GiB of RAM
Graphics Processor: NVIDIA GeForce GTX 1070/PCIe/SSE2
Manufacturer: Gigabyte Technology Co., Ltd.

Currently working with Godot 4.2-Stable running it from the command line so I get more interesting details.

Error from terminal:

[xcb] Unknown sequence number while processing queue
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
Godot_v4.2-stable_linux.x86_64: xcb_io.c:278: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.
zsh: IOT instruction (core dumped)  $GODOT4_BIN -e .

@acolbert1986
Copy link

System info...

Operating System: Manjaro Linux 
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.113.0
Qt Version: 5.15.11
Kernel Version: 6.1.69-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 12 × 12th Gen Intel® Core™ i5-12500T
Memory: 7.5 GiB of RAM
Graphics Processor: Mesa Intel® UHD Graphics 770
Manufacturer: Dell Inc.
Product Name: OptiPlex 3000

Error in terminal...

Godot Engine v4.2.1.stable.arch_linux - https://godotengine.org

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
godot: xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.
[1]    2223 IOT instruction (core dumped)  godot --editor --verbose --single-window

@akien-mga
Copy link
Member

akien-mga commented Jan 2, 2024

@Lamby777 Upgrading libx11 works, too. I compiled and installed the latest libx11 from master on fedora 39 by doing

https://gitlab.freedesktop.org/xorg/lib/libx11/-/tree/master

./autogen.sh ./configure --prefix=/usr make sudo make install

and then reboot and I haven't had a crash yet.

That's weird, as the latest master commit of libx11 is the 1.8.7 release which is shipped by Fedora.

So there's a difference between the Fedora package for libX11-1.8.7-1.fc39 and the one you compiled locally. It can be a different set of install dependencies so that your local builds adds or removes a feature, or this patch that Fedora is chugging along https://src.fedoraproject.org/rpms/libX11/blob/rawhide/f/dont-forward-keycode-0.patch, or any of the other custom tweaks in their .spec file, though I don't see much that sounds relevant: https://src.fedoraproject.org/rpms/libX11/blob/rawhide/f/libX11.spec

Either way, I suggest also opening a Fedora bug report, as the upstream libX11 report isn't getting any traction and we now have evidence that a self-compiled libx11 performs differently.

@Leshy-YA
Copy link

Leshy-YA commented Jan 2, 2024

Either way, I suggest also opening a Fedora bug report, as the upstream libX11 report isn't getting any traction and we now have evidence that a self-compiled libx11 performs differently.

Done: https://bugzilla.redhat.com/show_bug.cgi?id=2256495

@ionenwks
Copy link

ionenwks commented Jan 3, 2024

Possible CFLAGS are involved and some UB is causing this? aka when you built libX11 manually from master I'd assume you didn't use the same (possibly just a plain -O2). I don't really keep up with fedora, but believe they use LTO nowadays for one? (Edit: as for autoconf options, they don't pass anything notable I can see, the keycode patch sounds harmless too -- not that I looked too closely)

Haven't run into crashes with 1.8.7 myself on Gentoo, albeit I may not use it enough to run into these (I just test godot a bit for packaging, haven't got bug reports either way).

@akien-mga
Copy link
Member

@ionenwks That's a good call. Here are the build flags used on Fedora (as of Fedora 38 on my VM, but it's likely similar on F39).

$ rpm --eval %build_cflags
-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64  -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer 

$ rpm --eval %build_cxxflags
-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64  -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer 

$ rpm --eval %build_ldflags
-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -Wl,--build-id=sha1

@granitrocky

This comment was marked as off-topic.

@akien-mga

This comment was marked as off-topic.

@ZwieBit
Copy link

ZwieBit commented Jan 17, 2024

Just another confirmation: After having crashes every 10 minutes, i'm using godot editor now for almost 4 hours without any issues after compiling libx11 by myself.

@novalis
Copy link
Contributor

novalis commented Feb 27, 2024

(FWIW, this isn't RedHat-specific: I just got it with Debian, using libx11 1.8.7-1. FWIW I'm running Wayland.)

The code in xcb_io.c looks like some hairy thing that's trying hard to be thread-safe. But if it's not successful, then this is the sort of error one might expect to see. And certainly different compilation options could affect how aggressively stuff gets reordered, which could affect thread safety. So I thought, why doesn't Godot just put a mutex around the call? And then I looked, and it already had -- in most cases.

But consider the following rule (from xcb_io.c):

  • A single thread cannot be both the the event-reading and the
  • reply-reading thread at the same time.

So we would expect the call to XQueryTree (which, inside libX11, calls _XReply, which seems to be "reply-reading") to also lock the same mutex -- but it doesn't.

So it seems possible that there's a race there. XGetWindowProperty is another possible culprit (the ones in screen_get_usable_rect only) . And XGetInputFocus. I haven't fully audited to see if there are other cases than these three.

I have only read the code, so this could be totally bogus. But it seems plausible.

@Lamby777
Copy link

@Lamby777 Upgrading libx11 works, too. I compiled and installed the latest libx11 from master on fedora 39 by doing

https://gitlab.freedesktop.org/xorg/lib/libx11/-/tree/master

./autogen.sh ./configure --prefix=/usr make sudo make install

and then reboot and I haven't had a crash yet.

it's been annoying me so much that i finally decided to go looking for this thread again... :P

sadly, it doesn't work :(

Not that compiling your own version doesn't work; that I don't know. The actual compiling part doesn't work. I tried ./autogen.sh and it was giving some error about xorg macros not being installed so i installed this package called xorg-util-macros and now it's complaining about some macro XTRANS_CONNECTION_FLAGS being possibly undefined... Is the macro package I installed just outdated? Cuz i just pulled libx11 source from master so maybe they changed some macros that haven't been put onto arch repos yet. Is that even the right package to install? Seems to be, since the error's gone, but idk

@Lamby777
Copy link

At least the error message apologizes, which I found somewhat amusing.

@imaducklol
Copy link

I'll add another data point I guess, got the same crash on two different machines:
Godot 4.2.1, Fedora 39 (Gnome 45 on Wayland), libX11 1.8.7, i3-12100F, 5700xt, (Running godot under xwayland)
Godot 4.2.1, Arch (Gnome 46 on Xorg), libX11 1.8.9, i7-1185G7, Iris Xe, (Running godot under xorg)

Happened on both machines at around twice per hour. Wasn't able to pick up on anything specifically that caused them.
I'll report back if I compile libX11 from source and that fixes anything.

@Tichau
Copy link

Tichau commented May 28, 2024

Don't know if it's useful, but here is a new repro:

[xcb] Unknown request in queue while dequeuing
[xcb] You called XInitThreads, this is not your fault
[xcb] Aborting, sorry about that.
TheGuild: ../../src/xcb_io.c:175: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed.
[1] 631819 segmentation fault ./Path/To/Exe

Debian 12
Gnome 43.9
X.Org version: 1.22.1.9

If any other info is needed, I can edit this post.

@ttencate
Copy link
Contributor

// NOTE: Generated from Xlib 1.6.9.

That version is almost five years old. Is it possible that we're looking at some ABI incompatibility here?

@akien-mga
Copy link
Member

// NOTE: Generated from Xlib 1.6.9.

That version is almost five years old. Is it possible that we're looking at some ABI incompatibility here?

That's a good question.

This hypothesis could be tested by someone who can reproduce the issue reliably, by making a custom build with scons use_sowrap=no, which will disable the dynamic library wrappers and link the system libraries instead. To compile successfully, you might need to install more dev libraries (the ones from https://docs.godotengine.org/en/latest/contributing/development/compiling/compiling_for_linuxbsd.html#distro-specific-one-liners, which wasn't updated now that we default to dlopen'ing these deps).

@ttencate
Copy link
Contributor

Before I read your comment, I tried another track: I replaced the vendored Xlib.h, XKBlib.h and Xutil.h by the ones from my system (Arch Linux, libx11 1.8.9), and re-ran the generator (version cb59cc4fc69a3f05aed6ca6fa998a934788794f4, which is the first one marked as "0.3" in the source) as instructed in the header. The differences are only additions and one replacement of a char* argument by const char* in XkbOpenDisplay. It still crashes.

I can reproduce it fairly reliably at the moment: my game usually crashes within tens of seconds. The editor fares better, but also crashes about once an hour or so.

With use_sowrap=no, it initially seemed a bit better, but after a few minutes it also crashed.

For the record, here are the commands I used to build (it gets simpler without mono):

$ git checkout 4.2.2-stable
$ scons platform=linuxbsd target=editor arch=x86_64 module_mono_enabled=yes use_sowrap=no
$ bin/godot.linuxbsd.editor.x86_64.mono --headless --generate-mono-glue modules/mono/glue
$ ./modules/mono/build_scripts/build_assemblies.py --godot-output-dir=./bin

@ttencate
Copy link
Contributor

ttencate commented May 29, 2024

Summarizing the reports above:

  • libx11 1.8.2 - OK (unless patched)
  • libx11 1.8.3 - broken
  • libx11 1.8.4 - mixed
  • libx11 1.8.5 - OK (but see below)
  • libx11 1.8.6 - broken
  • libx11 1.8.7 - broken
  • libx11 1.8.8 - unknown
  • libx11 1.8.9 - broken

The only difference between 1.8.5 and 1.8.6 is 304a654, which seems unrelated to me. So I'm inclined to assume that there was only one breakage, not two, and 1.8.5 is broken as well.

I tried rebuilding the Arch package from the official PKGBUILD. Even with this, I could not trigger the crash! So for Arch users, this is a local workaround. After reinstalling the official binary package, I got a crash within a minute or two.

The two libX11.so.6.4.0 files are indeed different, but I can't tell if the differences are meaningful. Addresses and orders are different, but the list of exported symbols is the same. The two libX11-xcb.so.1.0.0 files are the same size (13976 bytes), and the diff is small:

--- official-xcb.hex	2024-05-29 12:23:39.711349095 +0200
+++ mine-xcb.hex	2024-05-29 12:23:45.944791641 +0200
@@ -45,8 +45,8 @@
 000002c0: 0300 0000 0000 0000 0100 01c0 0400 0000  ................
 000002d0: 0100 0000 0000 0000 0200 01c0 0400 0000  ................
 000002e0: 0000 0000 0000 0000 0400 0000 1400 0000  ................
-000002f0: 0300 0000 474e 5500 cf41 1b11 8b82 2d5d  ....GNU..A....-]
-00000300: ad43 7b2d 9bad ab79 884a bc70 0000 0000  .C{-...y.J.p....
+000002f0: 0300 0000 474e 5500 7e7a c198 eaf0 26ab  ....GNU.~z....&.
+00000300: 1636 87ec 7be5 6f04 a5b9 943b 0000 0000  .6..{.o....;....
 00000310: 0200 0000 0500 0000 0100 0000 0600 0000  ................
 00000320: 0000 0200 0005 0008 0500 0000 0600 0000  ................
 00000330: 6be4 cc2e 3b9a cb9a 0000 0000 0000 0000  k...;...........
@@ -767,10 +767,10 @@
 00002fe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00002ff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00003000: 0040 0000 0000 0000 4743 433a 2028 474e  .@......GCC: (GN
-00003010: 5529 2031 332e 322e 3120 3230 3233 3038  U) 13.2.1 202308
-00003020: 3031 0000 6c69 6258 3131 2d78 6362 2e73  01..libX11-xcb.s
+00003010: 5529 2031 342e 312e 3120 3230 3234 3035  U) 14.1.1 202405
+00003020: 3037 0000 6c69 6258 3131 2d78 6362 2e73  07..libX11-xcb.s
 00003030: 6f2e 312e 302e 302e 6465 6275 6700 0000  o.1.0.0.debug...
-00003040: 164f e2c3 002e 7368 7374 7274 6162 002e  .O....shstrtab..
+00003040: fdef bdc9 002e 7368 7374 7274 6162 002e  ......shstrtab..
 00003050: 6e6f 7465 2e67 6e75 2e70 726f 7065 7274  note.gnu.propert
 00003060: 7900 2e6e 6f74 652e 676e 752e 6275 696c  y..note.gnu.buil
 00003070: 642d 6964 002e 676e 752e 6861 7368 002e  d-id..gnu.hash..

This does give a clue: apparently the official binary package was compiled with GCC 13.2.1, whereas I'm using GCC 14.1.1. This explains the differences in libX11.so.6.4.0 as well. But I don't think GCC is to blame here – it's probably just a subtle difference that causes the actual (probably thread-related) bug to manifest or not.

Not being able to reproduce this in my own build, even before adding debug information, makes this thing very hard to debug, but I'll keep trying.

@ttencate
Copy link
Contributor

I installed the gcc13 package and used it to compile libX11 again from the official PKGBUILD, but modified with CC=gcc-13 CPP=cpp-13 AR=gcc-ar-13 NM=gcc-nm-13 RANLIB=gcc-ranlib-13 before the ./configure command. (Not sure all of these are necessary or even correct; CC is the main one.) Even this didn't help to reproduce the crash.


Something I found in the core dump: at the time of the crash, there were two threads interacting with xcb. The main thread, that aborted:

...
#19 0x00007418b78c1c67 in __assert_fail (
    assertion=assertion@entry=0x7418b6d64528 "!xcb_xlib_unknown_req_in_deq", 
    file=file@entry=0x7418b6d644df "xcb_io.c", line=line@entry=175, 
    function=function@entry=0x7418b6d77310 <__PRETTY_FUNCTION__.6> "dequeue_pending_request") at assert.c:103
#20 0x00007418b6cfbcef in dequeue_pending_request (dpy=dpy@entry=0x62c40271a710, req=req@entry=0x74187000c270)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:175
#21 0x00007418b6cfec95 in poll_for_response (dpy=dpy@entry=0x62c40271a710)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:381
#22 0x00007418b6d019b2 in _XEventsQueued (dpy=0x62c40271a710, mode=<optimized out>)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:441
#23 0x00007418b6cdecdf in XFlush (dpy=0x62c40271a710) at /usr/src/debug/libx11/libX11-1.8.9/src/Flush.c:39
#24 0x000062c3f636ae5c in DisplayServerX11::_wait_for_events (this=this@entry=0x62c4026fbe50)
    at platform/linuxbsd/x11/display_server_x11.cpp:4048
#25 0x000062c3f636d070 in DisplayServerX11::_poll_events (this=0x62c4026fbe50)
    at platform/linuxbsd/x11/display_server_x11.cpp:4074
#26 0x000062c3f9fc3e2d in Thread::callback (p_caller_id=<optimized out>, p_settings=..., 
    p_callback=0x62c3f636d0b0 <DisplayServerX11::_poll_events_thread(void*)>, p_userdata=0x62c4026fbe50)
    at core/os/thread.cpp:61
#27 0x000062c3fa8a60e4 in execute_native_thread_routine ()
#28 0x00007418b791fded in start_thread (arg=<optimized out>) at pthread_create.c:447
#29 0x00007418b79a30dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

And a thread that appears to belong to AMD's Vulkan driver:

#0  0x00007418b799539d in __GI___poll (fds=fds@entry=0x74187edffae8, nfds=nfds@entry=1, 
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007418b6ca420b in poll (__timeout=-1, __nfds=1, __fds=0x74187edffae8) at /usr/include/bits/poll2.h:39
#2  _xcb_conn_wait (c=c@entry=0x62c40271b9d0, vector=vector@entry=0x0, count=count@entry=0x0, 
    cond=<optimized out>) at /usr/src/debug/libxcb/libxcb-1.17.0/src/xcb_conn.c:510
#3  0x00007418b6ca629b in _xcb_conn_wait (count=0x0, vector=0x0, cond=<optimized out>, c=0x62c40271b9d0)
    at /usr/src/debug/libxcb/libxcb-1.17.0/src/xcb_conn.c:476
#4  xcb_wait_for_special_event (c=0x62c40271b9d0, se=0x62c402b86190)
    at /usr/src/debug/libxcb/libxcb-1.17.0/src/xcb_in.c:806
#5  0x00007418a31c18f0 in ?? () from /usr/lib/amdvlk64.so
#6  0x00007418a31bd495 in ?? () from /usr/lib/amdvlk64.so
#7  0x00007418a31df714 in ?? () from /usr/lib/amdvlk64.so
#8  0x00007418a322ed61 in ?? () from /usr/lib/amdvlk64.so
#9  0x00007418b791fded in start_thread (arg=<optimized out>) at pthread_create.c:447
#10 0x00007418b79a30dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

The latter is hanging in a poll call, so it wasn't actively racing at the time of the crash, but it's an interesting tidbit that might be a reason why Godot suffers from this bug and other applications don't. I tried with the vkcube spinning cube Vulkan demo; I couldn't get it to crash, but upon killing it with SIGQUIT (Ctrl+), this shows the same amdvlk backtrace in its coredump as well.

@ttencate
Copy link
Contributor

ttencate commented May 29, 2024

Line numbers refer to libx11 1.8.9, although the file src/xcb_io.c hasn't been touched in two years.

On line 319 in poll_for_response(), we set:

        req = dpy->xcb->pending_requests;

There is no code that modifies the req pointer in the meantime. Then, if there is actually a pending request and some other conditions hold, the pending requests is dequeued:

        dequeue_pending_request(dpy, req);

And the first thing that function does, is to fail the assertion:

    if (req != dpy->xcb->pending_requests)
        throw_thread_fail_assert("Unknown request in queue while "
                                 "dequeuing",
                                 xcb_xlib_unknown_req_in_deq);

Since req is a local variable and hasn't been changed, this must mean that dpy->xcb->pending_requests has been changed in the meantime. The culprit must have been either some invalid memory access on the same thread, or a race condition from a different thread. My money is on the latter. (It could theoretically also have been some callback that performed a reentrant libx11 call, but I don't see any place where callbacks are invoked here; also, it would imply a lack of locking somewhere, same as a threading issue.)

It should be noted that we are in an XFlush() call, which is a critical section, calling LockDisplay() at the start and UnlockDisplay() at the end. So if this is a threading issue, we'd want to look for places that modify pending_requests without issuing such a lock.

There are only two such places that matter: append_pending_request and dequeue_pending_request. So I set a conditional breakpoint in both, with the condition dpy->lock->mutex->__data->__owner == 0 (relying on some pthread internals to check if the mutex is locked). After a few minutes, the breakpoint was hit, yielding the following stack trace:

#0  dequeue_pending_request (dpy=dpy@entry=0x55555cd1fde0, req=req@entry=0x55556a1df6f0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:174
#1  0x00007ffff7103343 in _XReply (dpy=0x55555cd1fde0, rep=0x7fffffffdb00, extra=0, discard=0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:736
#2  0x00007ffff70e40f4 in XGetWindowProperty (dpy=0x55555cd1fde0, w=25165826, property=372, offset=0, 
    length=32, delete=<optimized out>, req_type=4, actual_type=0x7fffffffdbb8, actual_format=0x7fffffffdbb4, 
    nitems=0x7fffffffdbc0, bytesafter=0x7fffffffdbc8, prop=0x7fffffffdbd0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/GetProp.c:69
#3  0x0000555555af1360 in DisplayServerX11::_window_minimize_check (this=this@entry=0x55555ccfc9f0, 
    p_window=p_window@entry=0) at platform/linuxbsd/x11/display_server_x11.cpp:2375
#4  0x0000555555af167f in DisplayServerX11::window_get_mode (this=0x55555ccfc9f0, p_window=0)
    at platform/linuxbsd/x11/display_server_x11.cpp:2705
#5  0x0000555555aeba48 in DisplayServerX11::can_any_window_draw (this=0x55555ccfc9f0)
    at platform/linuxbsd/x11/display_server_x11.cpp:2912
#6  0x0000555555b45426 in Main::iteration () at main/main.cpp:3685
#7  0x0000555555ad7311 in OS_LinuxBSD::run (this=this@entry=0x7fffffffddb0)
    at platform/linuxbsd/os_linuxbsd.cpp:958
#8  0x0000555555ac5176 in main (argc=<optimized out>, argv=0x7fffffffe398)
    at platform/linuxbsd/godot_linuxbsd.cpp:74

When continuing the program after the breakpoint is hit, it immediately crashes apologetically.

The API function XGetWindowProperty called from Godot does lock the mutex, but _XReply transiently unlocks it for a while. And apparently, by the time dequeue_pending_request is called here, the mutex is somehow not locked.

This is as far as I got for today. I tried setting more breakpoints in _XReply to find out where the lock is lost, but the breakpoints all end up a the top of the function for some reason, and also seem to interfere with my ability to trigger the crash. Stupid Heisenbug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests