Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - qwattash

#2
Thanks for the suggestion! I have Intel graphics so AFAIK the open source drivers are directly supported by Intel in MESA, I'm not aware of proprietary intel drivers but I will look it up!
GDB is not that terrible, but I get that it's not the friendliest debugger out there...  ;) I'd like to switch to radare2 at some point but I haven't gotten around to it yet. If you get plugins or a GUI for easier visualization is more bearable, although without debugging symbols it's kind of a mess, that's why I ended up rebuilding libx11 and mesa locally.

Btw If you are curious I have debugged a bit more. I think the issue is triggered by a synchronization issue with libxcb. Essentially Xlib uses xcb to exchange messages with the X server, where xcb manages the IPC channels for the messages that the client sends to the X server. At some point Xlib grabs the IPC channel to write to it directly and to do so it needs to synchronize the message sequence numbers. I suspect this mess is triggered when the `XNoOp()` happens to cross the 1-byte boundary of the message sequence number, from 0xff to 0x100. I have no idea why this is though.
#3
Hi All,

I have been experiencing crashes when closing RimWorld for some time. After some debugging I think I tracked down what is happening.

System: Arch Linux arch-dell 5.12.15-arch1-1 #1 SMP PREEMPT Wed, 07 Jul 2021 23:35:29 +0000 x86_64 GNU/Linux
Rimword version: 1.2.3005 rev1191
MESA version: 21.1.4 (bug reproduced with a MESA debug build at commit fae28b0fce7)

Reproducing:
Just run the game and quit to OS from the main menu.

Symptom:
When closing to OS the game crashes with SIGSEGV in fclose() as called from the mesa disk_cache handling due to a NULL pointer dereference.
Thread 1 "RimWorldLinux" received signal SIGSEGV, Segmentation fault.
0x00007ffff7ca263b in fclose@@GLIBC_2.2.5 () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7ca263b in fclose@@GLIBC_2.2.5 () at /usr/lib/libc.so.6
#1  0x00007ffff3ab5eec in foz_destroy (foz_db=foz_db@entry=0x23f3e30)
    at ../src/util/fossilize_db.c:337
#2  0x00007ffff3ab4404 in disk_cache_destroy (cache=0x23f3d10)
    at ../src/util/disk_cache.c:238
#3  0x00007ffff3a45811 in brw_destroy_screen (sPriv=0x23e8360)
    at ../src/mesa/drivers/dri/i965/brw_screen.c:1747
#4  0x00007ffff3ab2a1f in driDestroyScreen (psp=0x23e8360)
    at ../src/mesa/drivers/dri/common/dri_util.c:238
#5  0x00007ffff4b2dd47 in dri3_destroy_screen (base=0x23b8620) at ../src/glx/dri3_glx.c:619
#6  0x00007ffff4b1f28a in FreeScreenConfigs (priv=0x23b5210) at ../src/glx/glxext.c:259
#7  glx_display_free (priv=0x23b5210) at ../src/glx/glxext.c:282
#8  0x00007ffff4b1f3dd in __glXCloseDisplay (dpy=0x2371480, codes=<optimized out>)
    at ../src/glx/glxext.c:331
#9  0x00007ffff4f0230b in XCloseDisplay (dpy=0x2371480) at ClDisplay.c:65
#10 0x000000000138deed in  ()
#11 0x000000000138be0e in  ()
#12 0x00000000013798fc in  ()


Cause:
The call to `fclose()` is made via `disk_cache_destroy()` which only triggers the code path if the environment variable `MESA_DISK_CACHE_SINGLE_FILE` is set.
This assumes that the `disk_cache_create` function initializes the file pointer under the same conditions.
During RimWorld startup the following happens:
- the `disk_cache_create()` function is called by the DRI layer during a call to `glXChooseVisual()`. This occurs before the MESA env var is set, causing the initialization to be skipped.
#0  disk_cache_create (gpu_name=gpu_name@entry=0x7fffffffc225 "i965_0166",
    driver_id=driver_id@entry=0x7fffffffc230 "fece63fd0bf0705104a035fbcf0dce8de6d956e6",
    driver_flags=0) at ../src/util/disk_cache.c:74
#1  0x00007ffff3a14699 in brw_disk_cache_init (screen=screen@entry=0x23e9640)
    at ../src/mesa/drivers/dri/i965/brw_disk_cache.c:415
#2  0x00007ffff3a48595 in brw_init_screen (dri_screen=<optimized out>)
    at ../src/mesa/drivers/dri/i965/brw_screen.c:2836
#3  0x00007ffff3ab2c46 in driCreateNewScreen2 (scrn=0, fd=4, extensions=<optimized out>,
    driver_extensions=<optimized out>, driver_configs=0x7fffffffce78, data=0x23b8620)
    at ../src/mesa/drivers/dri/common/dri_util.c:160
#4  0x00007ffff4b2e33c in dri3_create_screen (screen=0, priv=<optimized out>)
    at ../src/glx/dri3_glx.c:930
#5  0x00007ffff4b1f689 in AllocAndFetchScreenConfigs (priv=0x23b5210, dpy=0x2371480)
    at ../src/glx/glxext.c:830
#6  __glXInitialize (dpy=dpy@entry=0x2371480) at ../src/glx/glxext.c:953
#7  0x00007ffff4b1ae04 in GetGLXPrivScreenConfig (ppsc=<synthetic pointer>,
    ppriv=<synthetic pointer>, scrn=0, dpy=0x2371480) at ../src/glx/glxcmds.c:173
#8  glXChooseVisual (dpy=0x2371480, screen=0, attribList=0x7fffffffd0c0)
    at ../src/glx/glxcmds.c:1259
#9  0x00000000013c2907 in ?? ()

- the `MESA_DISK_CACHE_SINGLE_FILE` environment variable is set by `SteamAPI_Init()`, as detected by breaking on `setenv`.
Thread 1 "RimWorldLinux" hit Breakpoint 4, 0x00007ffff7c6d0f0 in setenv ()
   from /usr/lib/libc.so.6
(gdb) printf "%s\n", $rdi         /* arg0 */
MESA_DISK_CACHE_SINGLE_FILE
(gdb) p/c *$rsi                       /* arg1 */
$6 = 49 '1'
(gdb) bt
#0  0x00007ffff7c6d0f0 in setenv () at /usr/lib/libc.so.6
#1  0x00007fff81ef4a04 in  () at /home/qwattash/.local/share/Steam/linux64/steamclient.so
#2  0x00007ffff180622b in SteamAPI_Init ()
    at /home/qwattash/.local/share/Steam/steamapps/common/RimWorld/RimWorldLinux_Data/Plugins/libsteam_api.so
#3  0x000000004002a077 in  ()
#4  0x0000000004f4c070 in  ()
#5  0x0000000000000000 in  ()
(gdb)


I do not know enough about how RimWorld works internally and about the Steam API, but it looks like the call ordering may be wrong. If so, a simple reordering of the call to `SteamAPI_Init()` might fix the issue.

Hope this helps!
#4
Update 3:
The bug appears to be racey, I now have debug builds for both mesa and libx11. Setting a breakpoint into `XNoOp` appears to sometimes skip past the issue. I'll debug this offline and consider this off-topic for this thread at this point.
#5
Update 2:
So I got a debug build of mesa. I believe the addition of the call to `XNoOp()` to `glXCreateContextAttribsARB(..)` is the cause of the symptom. It seems to have been introduced here:
commit 960c86d6787437b643825baa230bc0cd7f9f7540
Author: Bastian Beranek <[email protected]>
Date:   Sat May 1 09:52:01 2021 +0200

    glx: Assign unique serial number to GLXBadFBConfig error

    Since commit f39fd3dce72 a new GLX error is issued in case context creation
    fails. This broke wine on certain hardware: While wine installs an error handler
    to ignore this kind of error, it does not function because it expects the
    dpy->request serial number of the error to be incremented since the installation
    of the handler.

    Workaround this by artificially increasing the request number. This also
    guarantees a unique serial number for the error.

    Fixes: f39fd3dce72eaef59ab39a23b75030ef9efc2a40
    Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3969
    Signed-off-by: Bastian Beranek <[email protected]>
    Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10565>

diff --git a/src/glx/create_context.c b/src/glx/create_context.c
index e3a513f58f6..7e1cec98c64 100644
--- a/src/glx/create_context.c
+++ b/src/glx/create_context.c
@@ -146,6 +146,9 @@ glXCreateContextAttribsARB(Display *dpy, GLXFBConfig config,
        * somehow on the client side. clean up the server resource and panic.
        */
       xcb_glx_destroy_context(c, xid);
+      /* increment dpy->request in order to give a unique serial number to the
+       * error */
+      XNoOp(dpy);
       __glXSendError(dpy, GLXBadFBConfig, xid, 0, False);
    } else {
       gc->xid = xid;


I am unsure whether the issue lies with the caller not expecting to get into the libx11 event polling from here or there is something else going on.
#6
Update:
Looks like a mesa update is at fault here. I downgraded to mesa 20.3.4-3, this appears to fix the issue although I have not yet been able to track down the bug into mesa GLX / libx11.
The breakage was likely introduced with mesa 21.x, I tested both 21.1.2 and 21.1.4 and both cause the crash.
With mesa 20.3.4 I'm currently getting a SIGSEGV when closing the game, from `XCloseDisplay()` which ends up calling `fclose()` from i965_dri.so intel direct rendering library. But at least the game is runnable.
#7
Hi All,

Anybody else got the following crash at startup?
[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
RimWorldLinux: xcb_io.c:269: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.
/home/qwattash/.local/share/Steam/steamapps/common/RimWorld/start_RimWorld.sh: line 27: 21149 Aborted                 (core dumped) LC_ALL=C
./$GAMEFILE $LOG


I attempted to have a clean RimWorld install without anything subscribed in the workshop but does not appear to help.
Offending versions:
Rimworld 1.2.3005 rev1191
libxcb: 1.14-1

Update: I ran RimWorld under gdb and got a stacktrace for the SIGABRT. Will follow up if it brings me somewhere.
#0  0x00007ffff7c6ad22 in raise () at /usr/lib/libc.so.6
#1  0x00007ffff7c54862 in abort () at /usr/lib/libc.so.6
#2  0x00007ffff7c54747 in _nl_load_domain.cold () at /usr/lib/libc.so.6
#3  0x00007ffff7c63616 in  () at /usr/lib/libc.so.6
#4  0x00007ffff4f1ad2d in  () at /usr/lib/libX11.so.6
#5  0x00007ffff4f1adc8 in  () at /usr/lib/libX11.so.6
#6  0x00007ffff4f1b182 in _XEventsQueued () at /usr/lib/libX11.so.6
#7  0x00007ffff4f1e176 in _XGetRequest () at /usr/lib/libX11.so.6
#8  0x00007ffff4f09395 in XNoOp () at /usr/lib/libX11.so.6
#9  0x00007ffff49036e3 in  () at /usr/lib/libGLX_mesa.so.0
#10 0x00007ffff4ab5428 in  () at /usr/lib/libGLX.so.0
#11 0x00000000013c3674 in  ()
#12 0x000000000138b1f1 in  ()
#13 0x0000000000dde8a8 in  ()
#14 0x0000000000de0bee in  ()
#15 0x0000000000de0c90 in  ()
#16 0x0000000000dcb8da in  ()
#17 0x0000000000436446 in  ()
#18 0x00007ffff7c55b25 in __libc_start_main () at /usr/lib/libc.so.6
#19 0x0000000000445d93 in  ()
#20 0x00007fffffffdb08 in  ()
#21 0x000000000000001c in  ()
#22 0x0000000000000003 in  ()
#23 0x00007fffffffdeec in  ()
#24 0x00007fffffffdf36 in  ()
#25 0x00007fffffffdf40 in  ()
#26 0x0000000000000000 in  ()

The error message mentions XInitThreads not being called but I can confirm that it is being called before the crash, so the issue is different.