apulse libpulse-simple.so: undefined symbol: pa_threaded_mainloop_new

Long story short

At the moment, compile my patched version of apulse like this :

git clone https://github.com/Miouyouyou/apulse
mkdir build
cd build
cmake ../apulse
cmake --build .
sudo make install
cd ..

After that, if that’s the first time you install them by hand, you can add the libraries to the dynamic linker search path, so that these libraries are found automatically, without having to use the apulse wrapper everytime :

echo "/usr/local/lib/apulse" >> /etc/ld.so.conf.d/apulse.conf

Note that this only compiles the 64-bits version of apulse.

I currently have no idea how to compile the 32-bits version with cmake.
UPDATE Maybe I should read the README.md before hand…
It’s clearly written how to compile the 32-bits version…
Ugh…

Here’s how to compile the 32-bits version (on Gentoo at least), assuming you’ve already cloned apulse’s git repository :

mkdir build32
cd build32
PKG_CONFIG_LIBDIR=/usr/lib32/pkgconfig CFLAGS=-m32 cmake -DAPULSEPATH=/usr/local/lib32/apulse ../apulse/
cmake --build .
sudo make install
cd ..

Redefining the PKG_CONFIG_LIBDIR is required to get rid of compilation errors like size of array '_GStaticAssertCompileTimeAssertion_0' is negative, when compiling for 32-bits versions with CFLAGS=-m32.

Again, if it’s the first time you install them by hand and don’t want to prefix apulse for invoking softwares using PulseAudio, here’s how you can add the apulse libraries to the standard dynamic linker search path :

echo "/usr/local/lib32/apulse" >> /etc/ld.so.conf.d/apulse.conf

This patched apulse version might solve issues like crashes after seeing this :

g_hash_table_lookup_node: assertion failed: (hash_table->ref_count > 0)

I don’t trust you that much

Fine !

Here’s the patch applied on the official repository. Here’s the same patch on Github gists.

And here’s how to apply it :

wget https://gist.githubusercontent.com/Miouyouyou/a6f460e03e046478f92e6a82d9e4dc79/raw/90bec9ad5424a4b6abf86a7b9923420eecb9a2f4/0001-stream-Check-the-key-before-invoking-g_hash_table_re.patch
git clone https://github.com/i-rinat/pulse
cd pulse
git am ../0001-stream-Check-the-key-before-invoking-g_hash_table_re.patch
cd ..

Then you can recompile and install using the standard CMake build procedure :

64 bits version

mkdir build
cd build
cmake ../apulse
cmake --build .
sudo make install
cd ..

32 bits version

mkdir build32
cd build32
PKG_CONFIG_LIBDIR=/usr/lib32/pkgconfig CFLAGS=-m32 cmake -DAPULSEPATH=/usr/local/lib32/apulse ../apulse/
cmake --build .
sudo make install
cd ..

Additional quicktips

If you have to use gdb on a Unity3D game, here’s a quicktip. Before calling run, disable some signal handlers with the following command :

handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint

This will make debugging way easier. Works also when debugging Mono/C# executables.

Short story long

I hit that bug while testing a Steam game named “Wizards of Legends” on my Gentoo box, however this bug came back with a lot of Unity3D games distributed on Itch.io .

Basically the game crashed with some “Unable to preload the following plugins” messages.
As always, these messages were RED HERRINGS ! I LOVE error messages that distract me from the real issue ! So much fun !

The game also wrote some logs located in ~/.config/unity3d/$COMPANYNAME/$GAMETITLE/Player.log.
The Player.log contained the following message :

undefined symbol: pa_threaded_mainloop_new

So I did a : nm -D /usr/lib/apulse/libpulse-simple.so.0, saw that U pa_threaded_mainloop_new was listed as an Unresolved symbol…
Meaning that it should load a library that will lead to resolving this symbol… however I can only guess it didn’t, given the error message.

So I then tried to look for this symbol in the other libraries provided by apulse.
nm -D /usr/lib/apulse/libpulse.so.0 returned 000000000000ec10 T pa_threaded_mainloop_new, which indicates that pa_threaded_mainloop_new is provided by libpulse.so.0.

So : I now know where the symbol is !

The next question is then :

Is libpulse.so.0 loaded correctly when loading libpulse-simple.so.0

Let’s have a look at the dynamic libraries chain-loaded by libpulse-simple.so.0 with readelf -d :

readelf -d /usr/lib/apulse/libpulse-simple.so.0 

Dynamic section at offset 0x3dc0 contains 29 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libglib-2.0.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libpulse-simple.so.0]
 0x000000000000000c (INIT)               0x1220
 0x000000000000000d (FINI)               0x26ac
 0x0000000000000019 (INIT_ARRAY)         0x203db0
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x203db8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x190
 0x0000000000000005 (STRTAB)             0x788
 0x0000000000000006 (SYMTAB)             0x200
 0x000000000000000a (STRSZ)              1165 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x204000
 0x0000000000000002 (PLTRELSZ)           1056 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xe00
 0x0000000000000007 (RELA)               0xd28
 0x0000000000000008 (RELASZ)             216 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffc (VERDEF)             0xc90
 0x000000006ffffffd (VERDEFNUM)          2
 0x000000006ffffffe (VERNEED)            0xcc8
 0x000000006fffffff (VERNEEDNUM)         2
 0x000000006ffffff0 (VERSYM)             0xc16
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

So… it chainloads libglib-2.0.so.0, libpthread.so.0 and libc.so.6 but… no libpulse.so.0.

… I’d love to add libpulse.so.0 to the list by hacking the binary, but that’s not going to happen…

In this case, the next sensible choice is to compile the latest version of the library.
Which reminds me how OLD and OUTDATED most of the main Gentoo ebuilds are…

If we can recompile the library, it should be possible to, at least, hack around to generate a fixed version of libpulse-simple.so.0.

Anyway, I went out to compile the official version like this :

git clone https://github.com/i-rinat/apulse
mkdir build
cmake ../apulse
cmake --build .
sudo make install
echo "/usr/local/lib/apulse" >> /etc/ld.so.conf.d/apulse.conf

This installed a new version of libpulse-simple.so.0 with the correct exports this time !

readelf -d /usr/local/lib/apulse/libpulse-simple.so.0 

Dynamic section at offset 0x3d80 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libpulse.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libglib-2.0.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libasound.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libpulse-simple.so.0]
 0x000000000000000c (INIT)               0x1220
 0x000000000000000d (FINI)               0x29e4
 0x0000000000000019 (INIT_ARRAY)         0x203d70
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x203d78
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x190
 0x0000000000000005 (STRTAB)             0x760
 0x0000000000000006 (SYMTAB)             0x1d8
 0x000000000000000a (STRSZ)              1191 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x204000
 0x0000000000000002 (PLTRELSZ)           1056 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xe00
 0x0000000000000007 (RELA)               0xd28
 0x0000000000000008 (RELASZ)             216 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffc (VERDEF)             0xc80
 0x000000006ffffffd (VERDEFNUM)          2
 0x000000006ffffffe (VERNEED)            0xcb8
 0x000000006fffffff (VERNEEDNUM)         3
 0x000000006ffffff0 (VERSYM)             0xc08
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

See how libpulse.so.0 is now correctly chainloaded ?

So I restarted the game AND… another crash…
Which involved using gdb to understand what went wrong this time.

If you have to use gdb on a Unity3D game, here’s a quicktip :
Before calling run, use this command to disable some signal handlers :

handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint

This will make debugging the binary easier. This works for any Mono/C# executable.

So I went like this :

cd "~/.local/share/Steam/steamapps/common/Wizard of Legend"
gdb ./WizardOfLegend.x86_64
(gdb) handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint
Signal        Stop      Print   Pass to program Description
SIGXCPU       No        No      Yes             CPU time limit exceeded
SIGPWR        No        No      Yes             Power fail/restart
SIG33         No        No      Yes             Real-time event 33
SIG35         No        No      Yes             Real-time event 35
(gdb) run
...
Thread 75 "WizardOfLegend." received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffee1ff3700 (LWP 11529)]
0x00007ffff6aab74b in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff6aab74b in raise () from /lib64/libc.so.6
#1  0x00007ffff6a948cd in abort () from /lib64/libc.so.6
#2  0x00007fff8ae44a23 in ?? () from /usr/lib64/libglib-2.0.so.0
#3  0x00007fff8aea0f4a in g_assertion_message_expr () from /usr/lib64/libglib-2.0.so.0
#4  0x00007fff8ae65396 in ?? () from /usr/lib64/libglib-2.0.so.0
#5  0x00007fff740ea520 in pa_stream_unref () from /usr/local/lib/apulse/libpulse.so.0
#6  0x00007fff740e8283 in deh_stream_first_readwrite_callback () from /usr/local/lib/apulse/libpulse.so.0
#7  0x00007fff740e6ce8 in pa_mainloop_dispatch () from /usr/local/lib/apulse/libpulse.so.0
#8  0x00007fff740e6e27 in pa_mainloop_iterate () from /usr/local/lib/apulse/libpulse.so.0
#9  0x00007fff740e747e in pa_mainloop_run () from /usr/local/lib/apulse/libpulse.so.0
#10 0x00007fff740eabf4 in mainloop_thread () from /usr/local/lib/apulse/libpulse.so.0
#11 0x00007ffff79b71d8 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff6b7d2cf in clone () from /lib64/libc.so.6
(gdb) quit

So, basically, something went wrong in pa_stream_unref AND it went wrong when calling a glib function !
In such situations, you can guess that the latest function was called with bogus arguments.

Alright, since I had to clone apulse git repository, I might as well see and edit the code of pa_stream_unref and try to debug this.

Let’s look at the code of pa_stream_unref :

APULSE_EXPORT
void
pa_stream_unref(pa_stream *s)
{
    trace_info_f("F %s s=%p\n", __func__, s);

    s->ref_cnt--;
    if (s->ref_cnt == 0) {
        g_hash_table_remove(s->c->streams_ht, GINT_TO_POINTER(s->idx));
        ringbuffer_free(s->rb);
        free(s->peek_buffer);
        free(s->write_buffer);
        free(s->name);
        free(s);
    }
}

The only glib function called is g_hash_table_remove so… I can only guess that it’s the one causing the crash.

So, let’s look on the internet for g_hash_table_remove and see how it is supposed to be called :

gboolean
g_hash_table_remove (GHashTable *hash_table,
                     gconstpointer key);

Alright…

Then let’s add some logs… I need to check the environment.

When you have to add some logs in someone else code, look around the code to see how the developper logs things generally. See if there’s a log_error, warn or, in this case, trace_error function.

Try the error logging functions first. If you go for debug logging functions, you might not see anything unless you enable some specific flags during compilation… Errors, though, tend to be displayed in every configuration.

Now, in order to understand what’s passed to the function, I added this before calling g_hash_table_remove :

		trace_error("s->c->streams_ht : %p - %d",
			s->c->streams_ht,
			s->idx);

Which led to these errors appearing the Player.log of the Unity3D game :

[apulse] [error] s->c->streams_ht : 0x3eed240 - 0
[apulse] [error] s->c->streams_ht : 0x3eed240 - 0

Ok…

Now, I want to know if the hash_table actually have something stored. Let’s check the documentation, see if there’s a way to get the size… or length… size !
g_hash_table_size : Returns the number of elements contained in the GHashTable.

guint
g_hash_table_size (GHashTable *hash_table);

Ok, let’s log the number of elements too

		trace_error("s->c->streams_ht : %p - (%u elements) %d",
			s->c->streams_ht,
			g_hash_table_size(s->c->streams_ht),
			s->idx);

Recompile, reinstall the apulse library, relaunch the game…

This led to new errors in the game’s Player.log :

[apulse] [error] s->c->streams_ht : 0x3d36640 (0 elements) - 0
[apulse] [error] s->c->streams_ht : 0x3d36640 (0 elements) - 0

Yeah, ok, the hash is empty so calling remove functions will only lead to issues.

Now, I could have searched for “why it’s empty and why this function is called with an empty hash table”. But, I went for the quick fix instead.

My first idead of the quick fix was :
Let’s check if the element to be deleted is actually stored in the hash table before trying to remove it.

So I tried using g_hash_table_lookup for that matter.

I modified the code like this :

		GHashTable * __restrict const streams_ht =
			s->c->streams_ht;
		void const * key = GINT_TO_POINTER(s->idx);
		if (g_hash_table_lookup(streams_ht, key))
			g_hash_table_remove(streams_ht, key);
			

This… led to another crash that required me to reuse gdb to catch the bug.
gdb returned this :

#0  0x00007ffff6aab74b in raise () from /lib64/libc.so.6
#1  0x00007ffff6a948cd in abort () from /lib64/libc.so.6
#2  0x00007fff8ae44a23 in ?? () from /usr/lib64/libglib-2.0.so.0
#3  0x00007fff8aea0f4a in g_assertion_message_expr () from /usr/lib64/libglib-2.0.so.0
#4  0x00007fff8ae65976 in g_hash_table_lookup () from /usr/lib64/libglib-2.0.so.0
#5  0x00007fff740ea531 in pa_stream_unref () from /usr/local/lib/apulse/libpulse.so.0
#6  0x00007fff740e8283 in deh_stream_first_readwrite_callback () from /usr/local/lib/apulse/libpulse.so.0
#7  0x00007fff740e6ce8 in pa_mainloop_dispatch () from /usr/local/lib/apulse/libpulse.so.0
#8  0x00007fff740e6e27 in pa_mainloop_iterate () from /usr/local/lib/apulse/libpulse.so.0
#9  0x00007fff740e747e in pa_mainloop_run () from /usr/local/lib/apulse/libpulse.so.0
#10 0x00007fff740eac1d in mainloop_thread () from /usr/local/lib/apulse/libpulse.so.0
#11 0x00007ffff79b71d8 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff6b7d2cf in clone () from /lib64/libc.so.6

Oh, yeah, okay… g_hash_table_lookup also generated a crash…

Wait, in the previous logs, s->idx was 0 and this index is turned into a key using GINT_TO_POINTER
0 converted into a pointer will most likely generate a NULL pointer, by definition, so here’s the new catch :

Let’s remove the element if the key is not 0 AND if the element is actually stored in the hash table.

        GHashTable * __restrict const streams_ht =
            s->c->streams_ht;
        void const * key = GINT_TO_POINTER(s->idx);
        if (key && g_hash_table_lookup(streams_ht, key))
            g_hash_table_remove(streams_ht, key);

This time IT WORKED ! The game launched and I was able to hear the music and sound effects !
YAAAY.

After that, I forked the apulse project, integrated this quick patch and then sent a pull request.

Gitlab runner and Docker Desktop nightmares on Windows

Various fixes for issues I encountered

Here’s an old post that I wanted to write the week after I got all these issues, but were taken by other projects and… 2 months later, I don’t remember the whole chronology of the events.

All I remember is that Docker for Windows was unexplicably broken, and looking at different Github issues pages I stumbled upon, it seems that I wasn’t alone and that this tool was clearly not tested for integration.

After a few days, I was able to get Docker for Windows working correctly on Windows, however the Gitlab runner failed miserably for different reasons.

However what upseted me was that this “Gitlab runner” tricked me into thinking that it was only using the Docker images I told it to use, but instead it also used its “gitlab-runner” image for whatever reasons, and this image was bugged, failing the whole docker CI process.
The reason this upsets me is that, when I use Docker, I do it in order to get a reliable and reproduceable environment for the whole build process.
Really, reliability and control of the environment are the two main reasons I use Docker in the first place. The fact that the Gitlab runner introduced a third-party bugged image, while never really telling anything about it, broke these two main concepts.
The use of Docker with the Gitlab runner led to an unreliable process (Who knows when that image will be bugged again ? And if it’s bugged, what can I do ?) and stripped me from the control of the environment (What is this image doing in the first place ? How can I control it ?).

Seriously, I’m okay with throwing additional boiler plate scripts to the Docker building scripts, as long as I still understand what will be executed, why I should do this and in which environment it will be done.
But subtly pulling and running some Docker images, to do I don’t know what, I don’t why, AND FAILING THE WHOLE CI WITH THAT just scream “kludgy and unuseable build system”.
So, I gave up on the Gitlab CI. It looks cool on the outside, but once you start encountering issues with it, you suddenly lose a lot of trust in this tool after understanding the reasons of these issues, and how they are handled on their official Gitlab issues pages.

Also, I now understand that the tools that make automatic testing, CI/CD, and all other quality assurance processes more simpler are the ones that have the poorest QA.

So here’s the issues I encountered and documented 2 months ago.

Gitlab

How do I get the ID of my project ?

Below the name, when looking at the project Details (the main page of the project). Be sure to use a recent Gitlab.

How to test the API GET requests quickly

Log in your Gitlab server, using a cookie-accepting browser. Then go the the API endpoint.

For example, let’s say that your Gitlab server is hosted at : https://grumpy.hamsters.gitlab

Recent browsers will switch to a special console mode that display the received JSON response in a readable manner.

/api/v4/projects/:id -> 404 Not found

Either your didn’t provide the right credentials OR your database is broken.

This took me a while to understand what was going on though… So, if you just want to check for the database requests, here’s what you can do :

  1. Do a backup, and if it’s a production server duplicate it on a dev machine !

  2. Enable Ruby on Rails ActiveRecord database request logging, by setting config.log_level = :info to config.log_level = :debug in config/environments/production.rb and restarting Gitlab (gitlab-ctl restart).
    In the official Docker image container, the file will be located at :
    /opt/gitlab/embedded/service/gitlab-rails/config/environments/production.rb. Remember to revert this setup afterwards by setting config.log_level = :debug back to config.log_level = :info again.
    You could also create another environment file and set Rails to use this new environment, but this might generate new issues I’m not aware of.

  3. Request your project information through the API again (basically redo the GET request to https://your.server/api/v4/projects/1234) and check the logs.
    You should see the SQL request printed in the logs.
    Note that I’m using the official Docker image, so I’m using docker logs -f container_name to follow the logs.
    If you’re not using the official Docker image, /var/log/gitlab/production.log might be a good place to look for. Else just do a grep SELECT * -r in /var/log and check the results

  4. Start gitlab-psql and re-execute the database request. Generally the request will be something like SELECT "projects"."id" FROM "projects" WHERE "projects"."id" = 1234 AND "projects"."pending_delete" = false LIMIT 1;.
    In my case, since the constraints were not propagated correctly, pending_delete for my project was set to NULL instead of FALSE which failed the SQL request.

Now, if you want to check the code handling API requests for projects, check these files :

  • lib/api/projects.rb
  • lib/api/helpers.rb

In the official Docker image, these files are located at /opt/gitlab/embedded/service/gitlab-rails/lib/api.

You can do some puts debugging in it, restarting Gitlab everytime (gitlab-ctl restart) and checking the logs for the puts messages.
It’s ugly but it works and can help pinpoint the issue, though you’ll need some good knowledge in Ruby.
Just remember that in Ruby, functions/methods can be called without parentheses, and these can be confused with local variables…

Repair constraints and defaults on PostgreSQL

Single column

Repair NOT NULL
ALTER TABLE ONLY "table_name" ALTER COLUMN "column_name" SET NOT NULL;
Repair DEFAULT
ALTER TABLE ONLY "table_name" ALTER COLUMN "column_name" SET DEFAULT default_value;

Remember that strings, in PostgreSQL, must be single-quoted.

‘a’ → Good.
“a” → Error.

Set NULL values back to default
UPDATE "table_name" SET "column_name" = default_value WHERE "column_name" IS NULL;

Remember that strings, in PostgreSQL, must be single-quoted.

‘a’ → Good.
“a” → Error.

Repair PostgreSQL database constraints and defaults after a MySQL migration

I made a little custom Ruby script to repair the constraints and defaults, due to bad migrations…

Of course, as always, if you operate on your database : BACK IT UP !
No need for PostgreSQL commands, just copy the database folder ! Or the entire container persitent data volumes, if you’re using Docker.

Anyway, here’s the ruby script :

#!/usr/bin/env ruby

class Table
	def initialize(name)
		@table_name = name
		#puts "SELECT * from #@table_name"
	end

	def set_not_null(col_name, nullable)
		if (nullable == false)
			puts %Q[ALTER TABLE ONLY "#{@table_name}" ALTER COLUMN "#{col_name}" SET NOT NULL;] 
		end
	end

	def set_null_values_to_default(col_name, default)
		puts %Q[UPDATE "#{@table_name}" SET "#{col_name}" = #{default} WHERE "#{col_name}" IS NULL;]
	end
	
	def fix_null_when_not_null(col_name, nullable, default)
		if (nullable == false)
			set_not_null(col_name, false)
			set_null_values_to_default(col_name, default)
		end
	end

	def set_default_value(col_name, default)
		puts %Q[ALTER TABLE ONLY "#{@table_name}" ALTER COLUMN "#{col_name}" SET DEFAULT #{default};]
	end

	def repair_column(col_name, params, decent_default)
		fix_null_when_not_null(col_name, false, decent_default) if (params[:null] == false)
		set_default_value(col_name, decent_default) if (params[:default] != nil)
	end

	def datetime_with_timezone(col_name, **args)
		set_not_null(col_name, args[:null])
	end
	def datetime(col_name, **args)
		datetime_with_timezone(col_name, args)
	end
	def date(col_name, **args)
		datetime_with_timezone(col_name, args)
	end
	def integer(col_name, **args)
		default = (args[:default] || 0)
		repair_column(col_name, args, default)
	end
	def decimal(col_name, **args)
		default = (args[:default] || 0.0)
		repair_column(col_name, args, default)
	end
	def float(col_name, **args)
		decimal(col_name, args)
	end
	def bigint(col_name, **args)
		integer(col_name, args)
	end
	def text(col_name, **args)
		default = "'#{args[:default] || ""}'"
		repair_column(col_name, args, default)
	end
	def string(col_name, **args)
		text(col_name, args)
	end
	def boolean(col_name, **args)
		default = "#{args[:default] || false}"
		repair_column(col_name, args, default)
		set_null_values_to_default(col_name, default)
	end
	def index(*args)
	end
	def binary(col_name, **args)
		set_not_null(col_name, args[:null])
	end
	def jsonb(col_name, **args)
		set_not_null(col_name, args[:null])
	end
	
end

def create_table(name, **args, &block)
	t = Table.new(name)
	yield t
end

What I did then is :

  • I pasted all the create_table blocks from the schema/db.rb file used by the Gitlab version I used (/opt/gitlab/embedded/service/gitlab-rails/db/schema.rb in the official Gitlab Docker Image).
  • Ran it like this :

    ruby repair_db.rb > script.psql

If you do this with Powershell, script.psql will be saved in UTF-16… So, you’ll have to convert it to UTF-8, else PostgreSQL won’t be able to parse the script.

  • Checked that script.psql contained actual data

    cat script.psql
  • Copied the generated file (script.psql) to my Gitlab server.
    If you’re using Docker, use docker cp script.psql your_container_name:/tmp/script.psql.

  • Ran it with gitlab-psql -f /path/to/script.psql
    gitlab-psql -f /tmp/script.psql if we follow the same Docker example.

Some table names/column names were incorrect and produced errors. However, all the weird API issues that I encountered were fixed using this simple method.

Gitlab runner issues

Never find jobs

  • Check that your runner is enabled for this project (GitlabYour projectSettingsCI/CD)
  • Check that the runner isn’t setup to answer for ONE specific tag.
    If it’s setup, in Gitlab, to respond to specific tags, either tag the pipeline in the .gitlab-ci.yml, or remove the tag restrictions in the runner settings (click on the runner-name in the list).
  • Check that you can access the project description, through the API, using your credentials at least.
    If you cannot, it might be a Gitlab (database) issue.
    Just log-in on your Gitlab instance, go to https://your-same.gitlab.server/api/v4/projects/id and check that you’re not receiving a “404 Not Found” JSON error.

unable to access ‘https://gitlab-ci-token:[MASKED]@your.server/repo/name.git/': SSL certificate problem: unable to get local issuer certificate

Two options :

  1. Edit config.toml and add this in the [[runners]] section :

    pre_clone_script = "git config --global http.\"https://gitlab.grenoble.texis\".sslVerify false"
  2. Edit config.toml and add this in the [[runners]] section :

    pre_clone_script = "git config --global http.sslCAinfo /etc/ssl/certs/mycerts/gitlab.grenoble.texis.crt"

And change volumes in the [runners.docker] subsection to this :

volumes = ["/cache", "/c/Gitlab/certs:/etc/ssl/certs/mycerts"]

SSL Certificate problem: unable to get local issuer certificate

It’s a bug with the latest gitlab-runner images.
Thanks to Grégory Maitrallain for reporting this in the gitlab-runner issues list.

That said, while he has been able to work around this issue by adding helper_image = "gitlab/gitlab-runner-helper:x86_64-91c2f0aa" in the config.toml file, I haven’t been able to use this workaround.
On my side, using this helper image makes the docker script dies very early…

So, if the helper_image think doesn’t work for you, the best remaining solutions for now are to either :

  • Edit the config.toml file and add :

    pre_clone_script = "git config --global http.https://your.gitlab.com.sslVerify false"

(Replace your.gitlab.com by the DNS, or IP address, of the server you’re cloning from).

  • Use another runner, that will use webhooks instead.

Don’t bother trying the git config --global http.sslCAInfo /path/to/cert.pem solution, it won’t work. The runner already injects the SSL certificates it had to use to connect to the gitlab server.
You can check this by adding && git config --global list in the pre_clone_script, like this :

pre_clone_script = "git config --global http.https://your.gitlab.com.sslVerify false && git config --global list"

Docker Desktop issues

Firewall is blocking

This error message is just a generic message that makes no sense most of the time.

What I did to pass this error was enable the “Microsoft App-V client” service. Just look for “Disabled” services and check if there isn’t some “HyperV” or “Whatever-V” service that could be enabled.
If that’s the case, enable it and retry. Remember that you can to open the “Services” MMC pane using Administrator privileges if you want to start, stop or change services.

An unexpected error occured while sharing drive

Your drive (*C:*, *D:*, Whatever…) is not enabled for sharing in Docker Desktop settings.
Right click the Docker notification icon, in the notification bar, and select Settings.

Clicking Apply & Restart in Docker Desktop does nothing. Same thing for restart.

First, let me say to any developer coding an UI :

NEVER HIDE ERRORS !
NEVER EVER DO THAT !
IF SOMETHING GOES WRONG, SHOWING NOTHING IS THE WORST THING YOU COULD DO !

I understand that there’s an UX trend about “hiding everything to the user, so that he can live in a world of dreams and rainbows, without any error message or bad thing that would create DOUBTS and FEARS !“…
But what this does actually is ENRAGING THE USER, who have to deal with an unresponsive and broken UI instead !

Seriously, here’s what happened :

  • I clicked on “Apply & Restart”
  • Saw nothing happening !

”… Did it worked ? … Did it crashed ? …
Should I wait a little bit … ? CAN I GET ANY INFORMATION !?”

This “DON’T SHOW THE ERRORS AND DON’T SAY ANYTHING” just make the UI developpers look like they forgot to connect the buttons to actual functions… It’s seriously stupid.
If something goes wrong, show a helping message somehow. The same one you put on the logs if you cannot state the issue in a “user-friendly” way.

Anyway, in my case, this was due to the Server service being disabled, since I don’t want the file hosting service being run when I’m not using it.
The Server service is named like this, and it’s a very old Microsoft service dating from Windows… 98 First Edition ? 95 ?

Now, first, in order to pinpoint the real issue, you might want to check the logs or, better, follow the logs.
See below if you don’t know how to do this.

Follow the logs with ‘Git bash’

I recommend installing ‘Git for Windows’ with its ‘Git Bash’. This MSYS32 and bash setup work very nicely on Windows (as long as you don’t start Windows specific console softwares, like python or irb…)

With “Git bash” run as an administrator, do this :

tail -f $LOCALAPPDATA/Docker/log.txt

Then do the actions that don’t work on the “Docker Desktop Settings” UI, and check back the bash console to see if any error message were printed.

There might be a way to do so the same thing in Powershell, but given Powershell propension to auto-scroll to the end on new output, I’d highly recommend to not use Powershell for following logs, or find a way to disable this behaviour since looking for errors will be far more difficult.

Check the logs afterwards with CMD or Powershell

So, if you’re using cmd.exe or Powershell, what you could instead is : 1. Do the actions that do not work. 2. Check the logs

If you’re running cmd.exe as an administrator.

cd %LOCALAPPDATA%/Docker
dir
notepad log.txt

If you’re running Powershell as an administrator.

cd $env:LOCALAPPDATA/Docker
ls
notepad log.txt

mkdir /host_mnt/c: file exists when trying to mount a volume

  • Run this in a powershell run as administrator :

    docker container prune
    docker volume prune
  • Then open the Docker Desktop Settings.

  • In Resources -> Volumes, click on “Reset Credentials”.

  • Type your administrator credentials again.

  • Click on “Apply & Restart”.

Note that you might have to do this every time the system goes into “sleep mode”… I’m not kidding… This is insane !

The story behind this (written 1 week after the incidents)

So, last week, at my job, I decided to put a Gitlab CI runner in place, on Windows machines, in order to present and show how to use automated testing and deployment, using Gitlab runners and Docker containers.

I already testing the whole thing on another Gitlab server, from the same workplace, and my office Linux machine. However, since most of the team develop Windows applications, a Windows CI/CD workflow made sense.

Ugh…

You know, sometimes, you get this sensation that tells you to stop wasting energy into a project, of things will turn to shit…

So, to begin with, the first Gitlab server I tested it on was one I installed myself, using Docker-compose on Synology.

I have to say that docker-compose on Synology works nicely, but for fuck sake if you’re going to put a whole UI to manage docker, put one to generate and manage docker-compose.yml files…
It will be WAY better and WAY more useable than a UI where you have to SET EACH ENVIRONMENT VARIABLE BY HAND through a horrendous form EVERYTIME YOU WANT TO CREATE A CONTAINER.
Dear Docker UI makers : use Docker before making the UI !

Anyway, I installed the whole thing and it worked beautifully, with updates basically done like this :

cd /path/to/docker-configs/gitlab &&
# Shutdown
docker-compose down -f docker-compose.yml &&
# Backup
tar cJpvf ../backups/gitlab-`date +"%Y%m%d-%H%M"`.tar.xz . &&
# Update the image
docker pull gitlab/gitlab-ce:latest &&
# Start Gitlab again
docker-compose up -f docker-compose.yml -d

And the only reason why I’m using tarballs to generate backups, is that I didn’t learn how to use volumes correctly.

So, yeah… the other Gitlab server… It was the one provided by Synology as a package at some point, that I decided to migrate to the official docker image, using a similar configuration.
The whole idea was that Synology updates were sometimes SOOO slow, that you had to wait several months to get access to new Gitlab features.

The issue, however, was that the old Gitlab server used MySQL as a database and the new one used PostgreSQL as a database… The tutorials provided by Gitlab were either too old or unuseable, so it has been a pain in the ass !
Long-story short, I was able to migrate the tables and columns correctly, but not their restrictions.
Turns out that all the DEFAULT and NOT NULL, and others simple things were not ported… That I learned a bit too late. No error messages spew out during the migration.
Well… A LOT OF THEM HAPPENED during the ActiveRecord migrations scripts that update the old Gitlab version schemas.
And don’t you dare use a too old version, or these migrations scripts won’t even run !
I love Gitlab !

I was a able to fix these database issues when errors popped up here and there, but this still generated hidden issues with the API.
Turns out that Gitlab, while using Ruby on Rails, can be very silent about database issues.
The main reason being that ActiveRecord insertion requests generally leave some fields blank, hoping that the DEFAULT restrictions will kick in and fill the blanks. However, if these restrictions are not ported, you will get NULL values inserted in boolean columns, without either PostgreSQL nor ActiveRecord catching the issue.
So when Gitlab, using ActiveRecord, tries to SELECT rows WHERE the that column value is FALSE, PostgreSQL ignores the rows with NULL in that column, which lead to Gitlab finding nothing and returning nothing.

Now, my first issue was that the Gitlab runners would not answer to any job request set on specific projects.

Debugging issues with gitlab-runner is a REAL pain, by the way…
gitlab-runner outputs almost NOTHING, even when executed with the --debug flag. You’d expect this flag to send 10k of logs of the standard output, but NO ! It outputs almost NOTHING !
It doesn’t even print the REST requests it is sending !
So, when you have no clear idea on how it connects to the server, it makes debugging so much harder ! For no real reason !

So I started to use the API and… something was odd. I could access some projects descriptions with /api/v4/projects/:id, while some others would output “404 Not found”.

I first thought that it could be some permissions issues but I still feared that it might have to do with the database. However, since I saw nothing in the logs that went like “I DIDN’T EXPECT NULL VALUES HERE !“, I searched for simple reasons first.

AAAND, after wasting an entire day while checking the project permissions, checking the logs after creating new projects, and trying to sniff gitlab-runner’s traffic, I decided it was time to delve into Gitlab code, and understand why it replied with “404 Not Found” for valid projects.

So I backed up the docker-compose files and volumes data, put them on my machine, restarted an identical container and searched for the files handling the “projects” api requests.
After a few grep, I found that the main file handling the whole “projects” API logic was :
/opt/gitlab/embedded/service/gitlab-rails/lib/api/projects.rb

The next step was “Check if I can do some ‘puts’ debugging”…
Well, no, the first step was understanding how this… thing… is architectured. Good thing I have a lot of experience with Ruby. And after… guessing how the whole thing worked, I checked if I could do some puts debugging.

See, when you attack that kind the code, there might be some advanced debugging mechanics. But being able to just log the content of variables or objects, with simple functions/methods like puts, fprintf or log, is always the best call for short-term debugging.
These functions are easy to remember and don’t need to delve into dozen of manpages, to just get a simple variable content on the screen.

I tried to edit projects.rb and add some puts messages.
When using logging functions for short-term debugging, the first messages you want to display are stupid messages that you can trace easily in the logs.
Even though there were no traffic on that Gitlab instance, Gitlab still output a good amount of logs for every operation. So if you want to check that puts messages are available in the logs, you need dumb messages like : puts "MEOW".
I tried this, did an API request and… no MEOW… So the first reaction was “Hmm… maybe I need to restart the app. IIRC, Rails applications need to be restarted on modifications.“. And, yeah, a restart of Gitlab with gitlab-ctl restart was all that was needed.

One restart later, I checked the logs (docker logs -f gitlabce_gitlab_1) and : it MEOWED ! Yay ! I can log code results !

So the next step was to trace how the get “:id” function worked…
While I did a lot of projects in Ruby, I was a bit rusty and, while I appreciate the whole “functions calls without parentheses” in Ruby, I still find it very misleading when it is used like in this function.
Seriously, I spent 10 minutes trying to understand how “user_project” was defined. I took it for a local variable at first… Then I thought “maybe it’s more global ? But global variables are referenced with $ prefixes and… Rails projects don’t use global variables anyway…“.
AAAND after a few grep, I understood that user_project was actually a helper method, defined in lib/api/helpers/project_helpers.rb.

Before I understood this, I put some messages before and after the options definitions and while I saw the messages before, I never saw the messages after… So something went wrong meanwhile, obviously.
I then added some debugging messages to the helpers methods, which made me understood that projects.find_by(id: id) returned nothing.

Well… find_by is a generic Ruby on Rails function that send a request to the database so… If that returned nothing, it only meant one thing : the request on the database returned nothing.
Alright, “time to find how to log ActiveRecord database requests in Ruby on Rails applications…“. With a few web searches, I found that setting config.log_level = :info to config.log_level = :debug in config/environment/production.rb did the trick.
One restart later I had the database requests and the request was something like :

SELECT "projects" WHERE "projects"."id" = 1234 AND "projects"."pending_delete" IS FALSE

Turned out that pending_delete columns were set to nil instead of false in every new project, making the whole request fail every time… UGH… By looking at the db/schema.rb it was clear that the “default: false” constraint was not setup in the database…

Since it wasn’t the first time that happened, I devised a Ruby script which took db/schema.rb create_table instructions as input, and generate PostgreSQL UPDATE TABLE instructions that would update and setup the DEFAULT and NOT NULL constraints, while updating inserted NULL values to the default values when DEFAULT constraints existed.

This fixed most of the issues and, notably, the API. Once fixed the runners were able to access the server jobs again, and started executing them and… failing only on Windows… Hmm…

A lot of Windows administrators would say “It failed on Windows ? How surprising !“…
However, this time, the hate is really misplaced…

The problems encountered (written 2 months later)

Yeah, I don’t exactly remember how it went. But basically, I’ve hit a LOT of issues with only Docker for Windows. This was due to me not having the “Server” service enabled (which broke the UI) and not having the “Microsoft App-V client” service enabled (which led to some firewall issues messages that were complete red-herrings).
Note that this “Server” service is Microsoft NetBIOS “Server” service which date from… Windows 98 ? 95 ? … It’s still there on Windows 10.

Meanwhile, what I got was :

  • The UI of Docker for Desktop not responding
  • Docker failing to launch due to some pseudo firewall issues
  • Docker for Desktop unable to mount volumes
  • Had to repair some Gitlab database issues dating from the MySQL to PostgreSQL migration.
  • Gitlab runner unable to clone git repositories from internal webservers using self signed certificates.
    And, NO, the agent is UNABLE to use SSH to clone a repository.
    WHY !?
  • No way to provide the certificates to use to the runner, due to some stupid bugs in the “gitlab-runner” Docker image, that is covertly used to prepare the build environment.
    The bug was not fixed for severals months straight.

So, I could still clone the repositories by just disabling the git client SSL verifications… But, as I said earlier, the whole “let’s add some Docker images without telling you, and fail the whole build due to some hidden bugs in these images” just draw me mad. Seeing how the issue was handled too…
That just screamed Unreliable.