linux

Rant : SystemD is too complex for Linux distributors

I could say too complex for Linux distributions, but it looks more like a lack of understanding of how SystemD works, and the fact that unit files get installed without asking leading to blocking issues.

The story

During the beginning of September, this server started to be inaccessible. I guess it’s during this period. Turns out that I checked my blog during August and it was still up !
Then holidays ended, I went back to work and… I checked my blog back at the start of October.

Note that, I receive emails from the hosting company I use, if the servers were to be “inaccessible” from their point of view.
But they only check their local network.

So, I tried to reach https://miouyouyou.fr … “Connection refused”….
Uh-oh…
Ok, let’s start a SSH shell ! “Connection refused”.
!!? What’s happening !? Was my server hijacked !? What the fuck !

I then tried to use my provider (Scaleway) “Web console” : Nothing.
(Spoiler alert : Turns out that Scaleway web console sucks)

Ugh… Okay… I got locked out from my server ? Is that some kind of “HA-HA ! You forgot to secure this part of your server ! pwned you, m0r0n !” ?

Let’s try a nmap -sS -v on my server !

Discovered open port 22/tcp

Wait, what !? What’s running on port 22 !? The SSH server is supposed to run on another port !

… I changed my client ~/.ssh/config to use port 22 instead of the “configured” port. Then I retried to get a SSH Shell and…
Got a shell on my server !

… ?

Alright… ps auxww… nothing unusual…
Checked dmesg, checked journalctl… Nothing unusual !

Threw a tcpdump not port 22 for kicks… nothing unusual…

Maybe the machine wasn’t hacked ? I’m flairing some retarded system update now…

… Let’s see if something happens if we put the the system back together for the moment… I mean, I’m just hosting a static blog which content is available on a public git repository, which I can redeploy at any moment so, if it blows, I’ll order a new server unit.

docker container ls … Ok, the containers are down… iptables -L … The firewall was reset ?

Fine, ran my script to restablish the firewall, deleted all the containers, updated the docker images, deployed the new instances with docker-compose AAAND, then, I tried to reach https://miouyouyou.fr
Success !
I got my blog back !

Updated the SSL certificates, the OCSP staples : Alright !

Then I tried to create a script to automate the blog updates as much as possible, discovered that there’s some big differences between the Hugo 0.46 on my machine and the Hugo 0.56 on the server, which fucked up my templates real bad.
After a few hours, checking each Hugo release note, to understand which update fucked up the templates, I pinpointed the issue, updated the templates and Voila : My blog is restored !

Now, I can focus on the real issue !

WHY THE FUCK IS MY SSH SERVER LISTENING ON PORT 22 !

Checked /etc/ssh/sshd_service … It’s clearly written to listen on another port.

I don’t get it… Either sshd is executed with specific instructions to not listen on this port or there’s a rogue ssh server !

root# ps auxww | grep ssh
root      2119  0.0  0.3  14272  7572 ?        Ss   18:29   0:00 sshd: root@pts/0

… ? sshd: root@pts/0 … ? Whaaat ?

Maybe there’s an environment variable fucking with the sshd server ?

root# ps auxwwe | grep sshd
root      2119  0.0  0.3  14272  7572 ?        Ss   18:29   0:01 sshd: root@pts/0                                                                                                                                                                                                                                          =
root# systemctl status ssh
● ssh.service - OpenBSD Secure Shell server
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:sshd(8)
           man:sshd_config(5)

DEAD !!!? WHO’S LAUNCHING THIS SERVER THEN ?

… Why don’t I see any real sshd server with ps auxwww ? Here, it’s listing my connection as a server… ??? Whut ?
I can’t see anything related to sshd “as-is”.

Maybe it’s executed from the initrd file ? Hmm… This server is certainly booted from the network so I’m not going to access the initrd easily… (Well, turns out that I could actually, but didn’t know that back then).
But even then, I should see it in the list…

Wait, if I run the ssh service with systemctl start ssh… and then try to connect on the good port… Success !

WHHAAAAT !?

Then WHY !? Why systemd refused to execute this service on startup !?

… Maybe it’s not executed on startup ?

How do I check the services ran on startup again ?

root# systemctl -t service --state=active
UNIT                                               LOAD   ACTIVE SUB     DESCRIPTION                                                                  
blk-availability.service                           loaded active exited  Availability of block devices                                                
containerd.service                                 loaded active running containerd container runtime                                                 
cron.service                                       loaded active running Regular background program processing daemon                                 
dbus.service                                       loaded active running D-Bus System Message Bus                                                     
docker.service                                     loaded active running Docker Application Container Engine                                          
exim4.service                                      loaded active running LSB: exim Mail Transport Agent                                               
getty@tty1.service                                 loaded active running Getty on tty1                                                                
getty@ttyAMA0.service                              loaded active running Getty on ttyAMA0                                                             
haveged.service                                    loaded active running Entropy daemon using the HAVEGE algorithm                                    
kmod-static-nodes.service                          loaded active exited  Create list of required static device nodes for the current kernel           
lvm2-monitor.service                               loaded active exited  Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
ntp.service                                        loaded active running Network Time Service                                                         
polkit.service                                     loaded active running Authorization Manager                                                        
rsyslog.service                                    loaded active running System Logging Service                                                       
serial-getty@ttyAMA0.service                       loaded active running Serial Getty on ttyAMA0                                                      
ssh@0-10.X.Y.Z:22-A.B.C.D:39924.service loaded active running OpenBSD Secure Shell server per-connection daemon                            
sysstat.service                                    loaded active exited  Resets System Activity Data Collector                                        
systemd-journal-flush.service                      loaded active exited  Flush Journal to Persistent Storage                                          
systemd-journald.service                           loaded active running Journal Service                                                              
systemd-logind.service                             loaded active running Login Service                                                                
systemd-modules-load.service                       loaded active exited  Load Kernel Modules                                                          
systemd-networkd-wait-online.service               loaded active exited  Wait for Network to be Configured                                            
systemd-networkd.service                           loaded active running Network Service                                                              
systemd-random-seed.service                        loaded active exited  Load/Save Random Seed                                                        
systemd-remount-fs.service                         loaded active exited  Remount Root and Kernel File Systems                                         
systemd-resolved.service                           loaded active running Network Name Resolution                                                      
systemd-sysctl.service                             loaded active exited  Apply Kernel Variables                                                       
systemd-sysusers.service                           loaded active exited  Create System Users                                                          
systemd-tmpfiles-setup-dev.service                 loaded active exited  Create Static Device Nodes in /dev                                           
systemd-tmpfiles-setup.service                     loaded active exited  Create Volatile Files and Directories                                        
systemd-udev-trigger.service                       loaded active exited  udev Coldplug all Devices                                                    
systemd-udevd.service                              loaded active running udev Kernel Device Manager                                                   
systemd-update-utmp.service                        loaded active exited  Update UTMP about System Boot/Shutdown                                       
systemd-user-sessions.service                      loaded active exited  Permit User Sessions                                                         
ufw.service                                        loaded active exited  Uncomplicated firewall                                                       
unattended-upgrades.service                        loaded active running Unattended Upgrades Shutdown                                                 
user-runtime-dir@0.service                         loaded active exited  User Runtime Directory /run/user/0                                           
user@0.service                                     loaded active running User Manager for UID 0     

The IP were replaced by 10.X.Y.Z and A.B.C.D in this copy.

… What the fuck ? There’s a service executed for my IP, but there’s no ssh.service running the server ?

This shit is insane ! What the fuck is wrong with SystemD !?

No wait… maybe I’m blaming SystemD while it’s actually System-V generating issues.

root# grep sshd /etc/* -r
/etc/default/ssh:# Options to pass to sshd
/etc/init.d/ssh:# Provides:             sshd
/etc/init.d/ssh:test -x /usr/sbin/sshd || exit 0
/etc/init.d/ssh:( /usr/sbin/sshd -\? 2>&1 | grep -q OpenSSH ) 2>/dev/null || exit 0
/etc/init.d/ssh:    # forget it if we're trying to start, and /etc/ssh/sshd_not_to_be_run exists
/etc/init.d/ssh:    if [ -e /etc/ssh/sshd_not_to_be_run ]; then 
/etc/init.d/ssh:            log_action_msg "OpenBSD Secure Shell server not in use (/etc/ssh/sshd_not_to_be_run)" || true
/etc/init.d/ssh:    if [ ! -d /run/sshd ]; then
/etc/init.d/ssh:        mkdir /run/sshd
/etc/init.d/ssh:        chmod 0755 /run/sshd
/etc/init.d/ssh:    if [ ! -e /etc/ssh/sshd_not_to_be_run ]; then
/etc/init.d/ssh:        /usr/sbin/sshd $SSHD_OPTS -t || exit 1
/etc/init.d/ssh:        log_daemon_msg "Starting OpenBSD Secure Shell server" "sshd" || true
/etc/init.d/ssh:        if start-stop-daemon --start --quiet --oknodo --chuid 0:0 --pidfile /run/sshd.pid --exec /usr/sbin/sshd -- $SSHD_OPTS; then
/etc/init.d/ssh:        log_daemon_msg "Stopping OpenBSD Secure Shell server" "sshd" || true
/etc/init.d/ssh:        if start-stop-daemon --stop --quiet --oknodo --pidfile /run/sshd.pid --exec /usr/sbin/sshd; then
/etc/init.d/ssh:        log_daemon_msg "Reloading OpenBSD Secure Shell server's configuration" "sshd" || true
/etc/init.d/ssh:        if start-stop-daemon --stop --signal 1 --quiet --oknodo --pidfile /run/sshd.pid --exec /usr/sbin/sshd; then
/etc/init.d/ssh:        log_daemon_msg "Restarting OpenBSD Secure Shell server" "sshd" || true
/etc/init.d/ssh:        start-stop-daemon --stop --quiet --oknodo --retry 30 --pidfile /run/sshd.pid --exec /usr/sbin/sshd
/etc/init.d/ssh:        if start-stop-daemon --start --quiet --oknodo --chuid 0:0 --pidfile /run/sshd.pid --exec /usr/sbin/sshd -- $SSHD_OPTS; then
/etc/init.d/ssh:        log_daemon_msg "Restarting OpenBSD Secure Shell server" "sshd" || true
/etc/init.d/ssh:        start-stop-daemon --stop --quiet --retry 30 --pidfile /run/sshd.pid --exec /usr/sbin/sshd || RET="$?"
/etc/init.d/ssh:                if start-stop-daemon --start --quiet --oknodo --chuid 0:0 --pidfile /run/sshd.pid --exec /usr/sbin/sshd -- $SSHD_OPTS; then
/etc/init.d/ssh:        status_of_proc -p /run/sshd.pid /usr/sbin/sshd sshd && exit 0 || exit $?
grep: /etc/motd: No such file or directory
/etc/pam.d/sshd:# access limits that are hard to express in sshd_config.
/etc/passwd:sshd:x:109:65534::/run/sshd:/usr/sbin/nologin
/etc/passwd-:sshd:x:109:65534::/run/sshd:/usr/sbin/nologin
/etc/shadow:sshd:*:17346:0:99999:7:::
/etc/shadow-:sshd:*:17346:0:99999:7:::
/etc/ssh/sshd_config:# See the sshd_config(5) manpage for details
/etc/ssh/sshd_config:# Use these options to restrict which interfaces/protocols sshd will bind to
/etc/ssh/sshd_config.ucf-dist:# $OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22 tj Exp $
/etc/ssh/sshd_config.ucf-dist:# This is the sshd server system-wide configuration file.  See
/etc/ssh/sshd_config.ucf-dist:# sshd_config(5) for more information.
/etc/ssh/sshd_config.ucf-dist:# This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
/etc/ssh/sshd_config.ucf-dist:# The strategy used for options in the default sshd_config shipped with
/etc/ssh/sshd_config.ucf-dist:#PidFile /var/run/sshd.pid
/etc/systemd/system/scw-generate-ssh-keys.service:Before=sshd.service
/etc/systemd/system/scw-fetch-ssh-keys.service:Before=sshd.service

Ha ! /etc/init.d/ssh ! Maybe it’s that stupid service that’s generating issues ! … Why is Debian mixing System-V with SystemD ? Either go System-V or go SystemD… Don’t do both, it’s irritating…

Ok, let’s check if this script is actually run ! Let’s edit it and add a echo MEOW > /tmp/stoopid after set -e.

root# /etc/init.d/ssh restart
[ ok ] Restarting ssh (via systemctl): ssh.service.
root# cat /tmp/stoopid
MEOW
root# systemctl status ssh
● ssh.service - OpenBSD Secure Shell server
   Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-10-25 19:48:28 UTC; 13min ago
     Docs: man:sshd(8)
           man:sshd_config(5)
  Process: 3914 ExecStartPre=/usr/sbin/sshd -t (code=exited, status=0/SUCCESS)
 Main PID: 3915 (sshd)
    Tasks: 1 (limit: 2377)
   Memory: 1016.0K
   CGroup: /system.slice/ssh.service
           └─3915 /usr/sbin/sshd -D

Oct 25 19:48:27 myy-blargh systemd[1]: Starting OpenBSD Secure Shell server...
Oct 25 19:48:28 myy-blargh sshd[3915]: Server listening on 0.0.0.0 port N.
Oct 25 19:48:28 myy-blargh sshd[3915]: Server listening on :: port N.
Oct 25 19:48:28 myy-blargh systemd[1]: Started OpenBSD Secure Shell server.

The actual port was replaced by N in this copy of the logs.

Hmm… If reload it, SystemD is considering the service loaded as well… Both systems are cooperating correctly then ?

Alright, let’s reboot and see if the file is present on reboot !

After reboot :

root# cat /tmp/stoopid
cat: /tmp/stoopid: No such file or directory

… This script is not executed ?? … Maybe it’s the initrd after all ?

Turns out that Scaleway mount the initrd file in /run/initramfs.

root# grep ssh /run/initramfs/*
Binary file /run/initramfs/bin/busybox matches
/run/initramfs/functions:start_sshd() {
/run/initramfs/functions:    run mkdir -p /etc/dropbear /root/.ssh
/run/initramfs/functions:    run chmod 700 /root/.ssh
/run/initramfs/functions:    run sh -ec "scw-metadata --cached | grep 'SSH_PUBLIC_KEYS_.*_KEY' | cut -d'=' -f 2- | tr -d \' > /root/.ssh/authorized_keys"
/run/initramfs/functions:    run sh -ec "scw-metadata --cached | grep 'TAGS_.*=AUTHORIZED_KEY' | cut -d'=' -f 3- | sed 's/_/\ /g' >> /root/.ssh/authorized_keys"
/run/initramfs/init:log_begin_msg "Checking metadata for debug sshd (dropbear)"
/run/initramfs/init:    log_success_msg "Starting a debug sshd"
/run/initramfs/init:    start_sshd
/run/initramfs/init:    ewarn "You can connect to your server with 'scw' or 'ssh'"
/run/initramfs/init:    ewarn " -- ssh root@${PUBLIC_IP_ADDRESS}"
/run/initramfs/init:    ewarn "You can connect to your server with 'ssh'"
/run/initramfs/init:    ewarn " -- ssh root@${PUBLIC_IP_ADDRESS}"
/run/initramfs/init:# Ensure sshd is killed if running
Binary file /run/initramfs/lib/aarch64-linux-gnu/libtinfo.so.5.9 matches
Binary file /run/initramfs/usr/sbin/dropbear matches
Binary file /run/initramfs/usr/bin/dropbearkey matches

Ooooh, here you go ! Let’s edit /run/initramfs/init and check how they # Ensure sshd is killed if running :

# Ensure sshd is killed if running
if [ "$(pgrep dropbear)" != "" ]; then
    run killall dropbear
fi

Hmm :

root# ps auxww | grep drop
root      4016  0.0  0.0   5796   648 pts/0    S+   20:07   0:00 grep drop

Ok… Nothing… I don’t get it…
Let’s check for rootkits, just in case.

root# apt install chkrootkit
root# chkrootkit

No problems detected…

I’m tired… I’m tired of this shit, there’s a fucking SSH server running on my machine, I have NO idea who’s spawning it !

Oh wait ! I forgot about lsof !

lsof -t

No, wrong one… Couldn’t they use netstat syntax ?

lsof -i
systemd      1            root   52u  IPv6  17371      0t0  TCP TCP *:22 (LISTEN)

O_O … …

O_O !

YOU’RE FUCKING KIDDING ME !? SystemD ITSELF IS LISTENING ON PORT 22 !? But the SSH service is dead ! HOW ?

A little search on the internet, got me this gem : https://bbs.archlinux.org/viewtopic.php?id=166582

“So, ive found out that sshd.socket was enabled and this was the cause”

root# find /etc -name "ssh*.socket"
/etc/systemd/system/sockets.target.wants/ssh.socket
root# cat /etc/systemd/system/sockets.target.wants/ssh.socket
[Unit]
Description=OpenBSD Secure Shell server socket
Before=ssh.service
Conflicts=ssh.service
ConditionPathExists=!/etc/ssh/sshd_not_to_be_run

[Socket]
ListenStream=22
Accept=yes

[Install]
WantedBy=sockets.target

FUCK YOU ! FUCK THIS SHIT !
I AM DONE WITH SYSTEMD !

The real issue with SystemD

It’s OVERLY COMPLEX ! If you don’t remember all the commands and, most importantly, ALL THE WAYS SYSTEMD CAN START A SERVICE, YOU WILL NOT be able to understand what’s going on.

systemctl status ssh was indicating the status of the ssh.service file, not the status or presence of ssh.socket.

Understand that I threw a journactl and searched for SSH and saw sshd being started and receiving connections as sshd[PID], but not why, nor how it was started ! And the logs were about sshd[PID] not systemd[PID] !

So until you understand that SystemD can start services using socket connections (Overkill feature for simple servers), you will NEVER KNOW :

  • Why is the SSH service not started
  • Why is there a SSH server listening on another port than the one provided in the configuration

Understand that I had a running server until start of September, the server got rebooted for whatever reasons and THEN, POOF, the SSH server started to listen on port 22.

So, the real issue with SystemD is that it’s TOO COMPLEX FOR DISTRIBUTORS !.

It can do a ton of things and maybe, once you understand it perfectly, you’ll be happy to use all the bells and whistles…
If you need them, of course…

However :

  • systemctl status ssh just shown the ssh service as DEAD !
  • journalctl had some “SSH connections” entries, but didn’t show why it started an SSH service !
  • ps auxww | grep sshd made it look like SSH servers are run “on the fly”.
  • MY ACTUAL SSHD CONFIGURATION WAS COMPLETELY IGNORED.

It’s those things combined which make me hate SystemD for server management.
It’s UNRELIABLE !
If something breaks, I want AS MUCH INFORMATION AS POSSIBLE !

If SystemD start a sshd server, systemctl status ssh or systemctl status sshd should show information about it !
I don’t give two shits about the file extension of the unit file triggering the execution of “sshd” !

That said, the issue here is not only due to SystemD overcomplexity. It’s also the fact that some distributors thought :

“Hey ! Let’s put a ssh.socket in /etc/systemd/system/sockets.target.wants/ , so that the user will never know why his SSH server start ignoring /etc/ssh/sshd_config !
This will be so much fun !”

The fact is : even if I remove this .socket file, it might just come back after an update ! And fuck up my system again !

Do you understand the issue here ? I’M LOSING CONTROL OF THE SYSTEM ! Because distributors started to ship with an OVERCOMPLEX INIT, and started using features of this init system without understanding the consequences !

If I put CONFIGURATION DIRECTIVES in a CONFIGURATION FILE, I don’t want them to be OVERRIDEN BY SOME RANDOM .whatever FILE used by my init system. If I want a different configuration, I’ll either edit the configuration file, or force the daemon to use another configuration file by editing the appropriate init file.

I’d appreciate if distributions using SystemD went with the “least amount of SystemD unit files” on server configurations.

Look at “Clear Linux”, there’s a way to use SystemD while making things “Lean & Clean”.
Just check the /etc/ folder after installing Clear Linux :
It’s clean !
They don’t add tons of .service, .socket, .mount or .whatever extension systemd reacts on !
No, they put a clean and lean /etc directory, with only the strictly necessary files.
And it works !

That said, an Init system so complex do not interest me. I’ll brush up my systemd-foo on Arch Linux, because I really need it. But, still, I don’t give it a shit about SystemD on my servers ATM.
If you run tons of microservices, maybe you do.
But me ? Nope. All my main public services run in docker containers and the only daemons and configurations I care about on my system are :

  • The SSH server
  • The firewall
  • The Docker containers

I could manage this with a busybox ‘init’ file…

So I’ll start looking for anoter distro that I can deploy on my server, and which doesn’t use SystemD.
And I can’t find one, I’ll look for one using SystemD with the least amount of units files.

I’m done with Debian and SystemD.
And I’d like to be done playing detective to understand :

  • Why my server is not responding anymore ?
  • Why services are executing while ignoring their configurations ?
  • How I can I avoid traps added by systemd updates ?

apulse libpulse-simple.so: undefined symbol: pa_threaded_mainloop_new

Long story short

At the moment, compile my patched version of apulse like this :

git clone https://github.com/Miouyouyou/apulse
mkdir build
cd build
cmake ../apulse
cmake --build .
sudo make install
cd ..

After that, if that’s the first time you install them by hand, you can add the libraries to the dynamic linker search path, so that these libraries are found automatically, without having to use the apulse wrapper everytime :

echo "/usr/local/lib/apulse" >> /etc/ld.so.conf.d/apulse.conf

Note that this only compiles the 64-bits version of apulse.

I currently have no idea how to compile the 32-bits version with cmake.
UPDATE Maybe I should read the README.md before hand…
It’s clearly written how to compile the 32-bits version…
Ugh…

Here’s how to compile the 32-bits version (on Gentoo at least), assuming you’ve already cloned apulse’s git repository :

mkdir build32
cd build32
PKG_CONFIG_LIBDIR=/usr/lib32/pkgconfig CFLAGS=-m32 cmake -DAPULSEPATH=/usr/local/lib32/apulse ../apulse/
cmake --build .
sudo make install
cd ..

Redefining the PKG_CONFIG_LIBDIR is required to get rid of compilation errors like size of array '_GStaticAssertCompileTimeAssertion_0' is negative, when compiling for 32-bits versions with CFLAGS=-m32.

Again, if it’s the first time you install them by hand and don’t want to prefix apulse for invoking softwares using PulseAudio, here’s how you can add the apulse libraries to the standard dynamic linker search path :

echo "/usr/local/lib32/apulse" >> /etc/ld.so.conf.d/apulse.conf

This patched apulse version might solve issues like crashes after seeing this :

g_hash_table_lookup_node: assertion failed: (hash_table->ref_count > 0)

I don’t trust you that much

Fine !

Here’s the patch applied on the official repository. Here’s the same patch on Github gists.

And here’s how to apply it :

wget https://gist.githubusercontent.com/Miouyouyou/a6f460e03e046478f92e6a82d9e4dc79/raw/90bec9ad5424a4b6abf86a7b9923420eecb9a2f4/0001-stream-Check-the-key-before-invoking-g_hash_table_re.patch
git clone https://github.com/i-rinat/pulse
cd pulse
git am ../0001-stream-Check-the-key-before-invoking-g_hash_table_re.patch
cd ..

Then you can recompile and install using the standard CMake build procedure :

64 bits version

mkdir build
cd build
cmake ../apulse
cmake --build .
sudo make install
cd ..

32 bits version

mkdir build32
cd build32
PKG_CONFIG_LIBDIR=/usr/lib32/pkgconfig CFLAGS=-m32 cmake -DAPULSEPATH=/usr/local/lib32/apulse ../apulse/
cmake --build .
sudo make install
cd ..

Additional quicktips

If you have to use gdb on a Unity3D game, here’s a quicktip. Before calling run, disable some signal handlers with the following command :

handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint

This will make debugging way easier. Works also when debugging Mono/C# executables.

Short story long

I hit that bug while testing a Steam game named “Wizards of Legends” on my Gentoo box, however this bug came back with a lot of Unity3D games distributed on Itch.io .

Basically the game crashed with some “Unable to preload the following plugins” messages.
As always, these messages were RED HERRINGS ! I LOVE error messages that distract me from the real issue ! So much fun !

The game also wrote some logs located in ~/.config/unity3d/$COMPANYNAME/$GAMETITLE/Player.log.
The Player.log contained the following message :

undefined symbol: pa_threaded_mainloop_new

So I did a : nm -D /usr/lib/apulse/libpulse-simple.so.0, saw that U pa_threaded_mainloop_new was listed as an Unresolved symbol…
Meaning that it should load a library that will lead to resolving this symbol… however I can only guess it didn’t, given the error message.

So I then tried to look for this symbol in the other libraries provided by apulse.
nm -D /usr/lib/apulse/libpulse.so.0 returned 000000000000ec10 T pa_threaded_mainloop_new, which indicates that pa_threaded_mainloop_new is provided by libpulse.so.0.

So : I now know where the symbol is !

The next question is then :

Is libpulse.so.0 loaded correctly when loading libpulse-simple.so.0

Let’s have a look at the dynamic libraries chain-loaded by libpulse-simple.so.0 with readelf -d :

readelf -d /usr/lib/apulse/libpulse-simple.so.0 

Dynamic section at offset 0x3dc0 contains 29 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libglib-2.0.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libpulse-simple.so.0]
 0x000000000000000c (INIT)               0x1220
 0x000000000000000d (FINI)               0x26ac
 0x0000000000000019 (INIT_ARRAY)         0x203db0
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x203db8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x190
 0x0000000000000005 (STRTAB)             0x788
 0x0000000000000006 (SYMTAB)             0x200
 0x000000000000000a (STRSZ)              1165 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x204000
 0x0000000000000002 (PLTRELSZ)           1056 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xe00
 0x0000000000000007 (RELA)               0xd28
 0x0000000000000008 (RELASZ)             216 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffc (VERDEF)             0xc90
 0x000000006ffffffd (VERDEFNUM)          2
 0x000000006ffffffe (VERNEED)            0xcc8
 0x000000006fffffff (VERNEEDNUM)         2
 0x000000006ffffff0 (VERSYM)             0xc16
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

So… it chainloads libglib-2.0.so.0, libpthread.so.0 and libc.so.6 but… no libpulse.so.0.

… I’d love to add libpulse.so.0 to the list by hacking the binary, but that’s not going to happen…

In this case, the next sensible choice is to compile the latest version of the library.
Which reminds me how OLD and OUTDATED most of the main Gentoo ebuilds are…

If we can recompile the library, it should be possible to, at least, hack around to generate a fixed version of libpulse-simple.so.0.

Anyway, I went out to compile the official version like this :

git clone https://github.com/i-rinat/apulse
mkdir build
cmake ../apulse
cmake --build .
sudo make install
echo "/usr/local/lib/apulse" >> /etc/ld.so.conf.d/apulse.conf

This installed a new version of libpulse-simple.so.0 with the correct exports this time !

readelf -d /usr/local/lib/apulse/libpulse-simple.so.0 

Dynamic section at offset 0x3d80 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libpulse.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libglib-2.0.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libasound.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libpulse-simple.so.0]
 0x000000000000000c (INIT)               0x1220
 0x000000000000000d (FINI)               0x29e4
 0x0000000000000019 (INIT_ARRAY)         0x203d70
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x203d78
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x190
 0x0000000000000005 (STRTAB)             0x760
 0x0000000000000006 (SYMTAB)             0x1d8
 0x000000000000000a (STRSZ)              1191 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x204000
 0x0000000000000002 (PLTRELSZ)           1056 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xe00
 0x0000000000000007 (RELA)               0xd28
 0x0000000000000008 (RELASZ)             216 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffc (VERDEF)             0xc80
 0x000000006ffffffd (VERDEFNUM)          2
 0x000000006ffffffe (VERNEED)            0xcb8
 0x000000006fffffff (VERNEEDNUM)         3
 0x000000006ffffff0 (VERSYM)             0xc08
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

See how libpulse.so.0 is now correctly chainloaded ?

So I restarted the game AND… another crash…
Which involved using gdb to understand what went wrong this time.

If you have to use gdb on a Unity3D game, here’s a quicktip :
Before calling run, use this command to disable some signal handlers :

handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint

This will make debugging the binary easier. This works for any Mono/C# executable.

So I went like this :

cd "~/.local/share/Steam/steamapps/common/Wizard of Legend"
gdb ./WizardOfLegend.x86_64
(gdb) handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint
Signal        Stop      Print   Pass to program Description
SIGXCPU       No        No      Yes             CPU time limit exceeded
SIGPWR        No        No      Yes             Power fail/restart
SIG33         No        No      Yes             Real-time event 33
SIG35         No        No      Yes             Real-time event 35
(gdb) run
...
Thread 75 "WizardOfLegend." received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffee1ff3700 (LWP 11529)]
0x00007ffff6aab74b in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff6aab74b in raise () from /lib64/libc.so.6
#1  0x00007ffff6a948cd in abort () from /lib64/libc.so.6
#2  0x00007fff8ae44a23 in ?? () from /usr/lib64/libglib-2.0.so.0
#3  0x00007fff8aea0f4a in g_assertion_message_expr () from /usr/lib64/libglib-2.0.so.0
#4  0x00007fff8ae65396 in ?? () from /usr/lib64/libglib-2.0.so.0
#5  0x00007fff740ea520 in pa_stream_unref () from /usr/local/lib/apulse/libpulse.so.0
#6  0x00007fff740e8283 in deh_stream_first_readwrite_callback () from /usr/local/lib/apulse/libpulse.so.0
#7  0x00007fff740e6ce8 in pa_mainloop_dispatch () from /usr/local/lib/apulse/libpulse.so.0
#8  0x00007fff740e6e27 in pa_mainloop_iterate () from /usr/local/lib/apulse/libpulse.so.0
#9  0x00007fff740e747e in pa_mainloop_run () from /usr/local/lib/apulse/libpulse.so.0
#10 0x00007fff740eabf4 in mainloop_thread () from /usr/local/lib/apulse/libpulse.so.0
#11 0x00007ffff79b71d8 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff6b7d2cf in clone () from /lib64/libc.so.6
(gdb) quit

So, basically, something went wrong in pa_stream_unref AND it went wrong when calling a glib function !
In such situations, you can guess that the latest function was called with bogus arguments.

Alright, since I had to clone apulse git repository, I might as well see and edit the code of pa_stream_unref and try to debug this.

Let’s look at the code of pa_stream_unref :

APULSE_EXPORT
void
pa_stream_unref(pa_stream *s)
{
    trace_info_f("F %s s=%p\n", __func__, s);

    s->ref_cnt--;
    if (s->ref_cnt == 0) {
        g_hash_table_remove(s->c->streams_ht, GINT_TO_POINTER(s->idx));
        ringbuffer_free(s->rb);
        free(s->peek_buffer);
        free(s->write_buffer);
        free(s->name);
        free(s);
    }
}

The only glib function called is g_hash_table_remove so… I can only guess that it’s the one causing the crash.

So, let’s look on the internet for g_hash_table_remove and see how it is supposed to be called :

gboolean
g_hash_table_remove (GHashTable *hash_table,
                     gconstpointer key);

Alright…

Then let’s add some logs… I need to check the environment.

When you have to add some logs in someone else code, look around the code to see how the developper logs things generally. See if there’s a log_error, warn or, in this case, trace_error function.

Try the error logging functions first. If you go for debug logging functions, you might not see anything unless you enable some specific flags during compilation… Errors, though, tend to be displayed in every configuration.

Now, in order to understand what’s passed to the function, I added this before calling g_hash_table_remove :

		trace_error("s->c->streams_ht : %p - %d",
			s->c->streams_ht,
			s->idx);

Which led to these errors appearing the Player.log of the Unity3D game :

[apulse] [error] s->c->streams_ht : 0x3eed240 - 0
[apulse] [error] s->c->streams_ht : 0x3eed240 - 0

Ok…

Now, I want to know if the hash_table actually have something stored. Let’s check the documentation, see if there’s a way to get the size… or length… size !
g_hash_table_size : Returns the number of elements contained in the GHashTable.

guint
g_hash_table_size (GHashTable *hash_table);

Ok, let’s log the number of elements too

		trace_error("s->c->streams_ht : %p - (%u elements) %d",
			s->c->streams_ht,
			g_hash_table_size(s->c->streams_ht),
			s->idx);

Recompile, reinstall the apulse library, relaunch the game…

This led to new errors in the game’s Player.log :

[apulse] [error] s->c->streams_ht : 0x3d36640 (0 elements) - 0
[apulse] [error] s->c->streams_ht : 0x3d36640 (0 elements) - 0

Yeah, ok, the hash is empty so calling remove functions will only lead to issues.

Now, I could have searched for “why it’s empty and why this function is called with an empty hash table”. But, I went for the quick fix instead.

My first idead of the quick fix was :
Let’s check if the element to be deleted is actually stored in the hash table before trying to remove it.

So I tried using g_hash_table_lookup for that matter.

I modified the code like this :

		GHashTable * __restrict const streams_ht =
			s->c->streams_ht;
		void const * key = GINT_TO_POINTER(s->idx);
		if (g_hash_table_lookup(streams_ht, key))
			g_hash_table_remove(streams_ht, key);
			

This… led to another crash that required me to reuse gdb to catch the bug.
gdb returned this :

#0  0x00007ffff6aab74b in raise () from /lib64/libc.so.6
#1  0x00007ffff6a948cd in abort () from /lib64/libc.so.6
#2  0x00007fff8ae44a23 in ?? () from /usr/lib64/libglib-2.0.so.0
#3  0x00007fff8aea0f4a in g_assertion_message_expr () from /usr/lib64/libglib-2.0.so.0
#4  0x00007fff8ae65976 in g_hash_table_lookup () from /usr/lib64/libglib-2.0.so.0
#5  0x00007fff740ea531 in pa_stream_unref () from /usr/local/lib/apulse/libpulse.so.0
#6  0x00007fff740e8283 in deh_stream_first_readwrite_callback () from /usr/local/lib/apulse/libpulse.so.0
#7  0x00007fff740e6ce8 in pa_mainloop_dispatch () from /usr/local/lib/apulse/libpulse.so.0
#8  0x00007fff740e6e27 in pa_mainloop_iterate () from /usr/local/lib/apulse/libpulse.so.0
#9  0x00007fff740e747e in pa_mainloop_run () from /usr/local/lib/apulse/libpulse.so.0
#10 0x00007fff740eac1d in mainloop_thread () from /usr/local/lib/apulse/libpulse.so.0
#11 0x00007ffff79b71d8 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff6b7d2cf in clone () from /lib64/libc.so.6

Oh, yeah, okay… g_hash_table_lookup also generated a crash…

Wait, in the previous logs, s->idx was 0 and this index is turned into a key using GINT_TO_POINTER
0 converted into a pointer will most likely generate a NULL pointer, by definition, so here’s the new catch :

Let’s remove the element if the key is not 0 AND if the element is actually stored in the hash table.

        GHashTable * __restrict const streams_ht =
            s->c->streams_ht;
        void const * key = GINT_TO_POINTER(s->idx);
        if (key && g_hash_table_lookup(streams_ht, key))
            g_hash_table_remove(streams_ht, key);

This time IT WORKED ! The game launched and I was able to hear the music and sound effects !
YAAAY.

After that, I forked the apulse project, integrated this quick patch and then sent a pull request.