Arch linux xcb_xlib_threads_sequence_lost assertion failed.

Started by qwattash, July 13, 2021, 05:43:33 PM

Previous topic - Next topic

qwattash

Hi All,

Anybody else got the following crash at startup?
[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
RimWorldLinux: xcb_io.c:269: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.
/home/qwattash/.local/share/Steam/steamapps/common/RimWorld/start_RimWorld.sh: line 27: 21149 Aborted                 (core dumped) LC_ALL=C
./$GAMEFILE $LOG


I attempted to have a clean RimWorld install without anything subscribed in the workshop but does not appear to help.
Offending versions:
Rimworld 1.2.3005 rev1191
libxcb: 1.14-1

Update: I ran RimWorld under gdb and got a stacktrace for the SIGABRT. Will follow up if it brings me somewhere.
#0  0x00007ffff7c6ad22 in raise () at /usr/lib/libc.so.6
#1  0x00007ffff7c54862 in abort () at /usr/lib/libc.so.6
#2  0x00007ffff7c54747 in _nl_load_domain.cold () at /usr/lib/libc.so.6
#3  0x00007ffff7c63616 in  () at /usr/lib/libc.so.6
#4  0x00007ffff4f1ad2d in  () at /usr/lib/libX11.so.6
#5  0x00007ffff4f1adc8 in  () at /usr/lib/libX11.so.6
#6  0x00007ffff4f1b182 in _XEventsQueued () at /usr/lib/libX11.so.6
#7  0x00007ffff4f1e176 in _XGetRequest () at /usr/lib/libX11.so.6
#8  0x00007ffff4f09395 in XNoOp () at /usr/lib/libX11.so.6
#9  0x00007ffff49036e3 in  () at /usr/lib/libGLX_mesa.so.0
#10 0x00007ffff4ab5428 in  () at /usr/lib/libGLX.so.0
#11 0x00000000013c3674 in  ()
#12 0x000000000138b1f1 in  ()
#13 0x0000000000dde8a8 in  ()
#14 0x0000000000de0bee in  ()
#15 0x0000000000de0c90 in  ()
#16 0x0000000000dcb8da in  ()
#17 0x0000000000436446 in  ()
#18 0x00007ffff7c55b25 in __libc_start_main () at /usr/lib/libc.so.6
#19 0x0000000000445d93 in  ()
#20 0x00007fffffffdb08 in  ()
#21 0x000000000000001c in  ()
#22 0x0000000000000003 in  ()
#23 0x00007fffffffdeec in  ()
#24 0x00007fffffffdf36 in  ()
#25 0x00007fffffffdf40 in  ()
#26 0x0000000000000000 in  ()

The error message mentions XInitThreads not being called but I can confirm that it is being called before the crash, so the issue is different.

qwattash

Update:
Looks like a mesa update is at fault here. I downgraded to mesa 20.3.4-3, this appears to fix the issue although I have not yet been able to track down the bug into mesa GLX / libx11.
The breakage was likely introduced with mesa 21.x, I tested both 21.1.2 and 21.1.4 and both cause the crash.
With mesa 20.3.4 I'm currently getting a SIGSEGV when closing the game, from `XCloseDisplay()` which ends up calling `fclose()` from i965_dri.so intel direct rendering library. But at least the game is runnable.

qwattash

Update 2:
So I got a debug build of mesa. I believe the addition of the call to `XNoOp()` to `glXCreateContextAttribsARB(..)` is the cause of the symptom. It seems to have been introduced here:
commit 960c86d6787437b643825baa230bc0cd7f9f7540
Author: Bastian Beranek <[email protected]>
Date:   Sat May 1 09:52:01 2021 +0200

    glx: Assign unique serial number to GLXBadFBConfig error

    Since commit f39fd3dce72 a new GLX error is issued in case context creation
    fails. This broke wine on certain hardware: While wine installs an error handler
    to ignore this kind of error, it does not function because it expects the
    dpy->request serial number of the error to be incremented since the installation
    of the handler.

    Workaround this by artificially increasing the request number. This also
    guarantees a unique serial number for the error.

    Fixes: f39fd3dce72eaef59ab39a23b75030ef9efc2a40
    Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3969
    Signed-off-by: Bastian Beranek <[email protected]>
    Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10565>

diff --git a/src/glx/create_context.c b/src/glx/create_context.c
index e3a513f58f6..7e1cec98c64 100644
--- a/src/glx/create_context.c
+++ b/src/glx/create_context.c
@@ -146,6 +146,9 @@ glXCreateContextAttribsARB(Display *dpy, GLXFBConfig config,
        * somehow on the client side. clean up the server resource and panic.
        */
       xcb_glx_destroy_context(c, xid);
+      /* increment dpy->request in order to give a unique serial number to the
+       * error */
+      XNoOp(dpy);
       __glXSendError(dpy, GLXBadFBConfig, xid, 0, False);
    } else {
       gc->xid = xid;


I am unsure whether the issue lies with the caller not expecting to get into the libx11 event polling from here or there is something else going on.

qwattash

Update 3:
The bug appears to be racey, I now have debug builds for both mesa and libx11. Setting a breakpoint into `XNoOp` appears to sometimes skip past the issue. I'll debug this offline and consider this off-topic for this thread at this point.

delirium

I'm very likely not as proficient as you are on Arch, but have you considered trying out proprietary (HERESY) graphics drivers? I found them to deliver better performance and stability - haven't actually tried it with rimworld as I keep my Arch for work only.

delirium

Btw kudos for debugging this all by yourself! I wouldn't wanna touch GDB with a stick.

qwattash

Thanks for the suggestion! I have Intel graphics so AFAIK the open source drivers are directly supported by Intel in MESA, I'm not aware of proprietary intel drivers but I will look it up!
GDB is not that terrible, but I get that it's not the friendliest debugger out there...  ;) I'd like to switch to radare2 at some point but I haven't gotten around to it yet. If you get plugins or a GUI for easier visualization is more bearable, although without debugging symbols it's kind of a mess, that's why I ended up rebuilding libx11 and mesa locally.

Btw If you are curious I have debugged a bit more. I think the issue is triggered by a synchronization issue with libxcb. Essentially Xlib uses xcb to exchange messages with the X server, where xcb manages the IPC channels for the messages that the client sends to the X server. At some point Xlib grabs the IPC channel to write to it directly and to do so it needs to synchronize the message sequence numbers. I suspect this mess is triggered when the `XNoOp()` happens to cross the 1-byte boundary of the message sequence number, from 0xff to 0x100. I have no idea why this is though.