Debugging is such a critical skill. Part and parcel of daily work for most of us, engineers. I have been to multiple interviews and only during a handful of them, I was given the question “How would you debug the issue with…”, even though the job description clearly stated this skill as desirable. When conducting an interview, this is my favorite question to ask as there are countless possible ways to drive the discussion. I think that debugging, similar to hard-skills, can be learned over time. The teaching process requires just a) time and b) persistence. That’s just a digression; I am always keen on diving into debugging problems, but somehow I have noticed that lots of engineers focus mostly on producing more and more code.
That being said, I would like to share some details about one of my debugging sessions. I strongly believe that every solved issue teaches us something, so that’s a good reason to write a new post.
One night I was prototyping an application based on Spring Boot and PostgreSQL database. It was a pretty simple application whose one of the features was user management: the admin user could see all the registered users, sort the list and search users by name, last name or email.
I was evaluating different approaches for the search. I started with simple
LIKE %term%, however, it wasn’t scalable enough for one million users and it was the high time to come up with something else.
When LIKE %% is not enough
I have recently realized that PostgreSQL supports
to_tsvector functions which, when combined, give us a powerful starter pack to embed full-text search capabilities into our applications.
I quickly crafted simplified database schema (presented below) and started experimenting:
CREATE TABLE users ( user_id SERIAL, user_name TEXT, user_lastname TEXT, user_email TEXT, CONSTRAINT users_pkey PRIMARY KEY (user_id) ); INSERT INTO users (user_name, user_lastname, user_email) VALUES ('Adam', 'Jędro', 'email@example.com');
Select query looked promising and worked well in the first try:
SELECT user_id FROM users WHERE to_tsvector(CONCAT_WS(' ', user_name, user_lastname, user_email)) @@ to_tsquery('adam');
to_tsvector(CONCAT_WS(' ', user_name, user_lastname, user_email)) part computes a set of tokens out of three fields, whereas
@@ to_tsquery('adam'); part simply challenges these tokens and checks whether there is a match. Refer to the official documentation to read more about that.
I know that computing tokens (call to
to_tsvector() function) for each record is costly. However, it wasn’t an issue at the time. Before going to prod I could precompute this using DB triggers or use recent feature called generated columns.
That being said, I quickly ported the aforementioned query to the Spring Boot based application, wrote a couple of unit & func tests and pushed commit to the stack where I had ~one million users loaded into the DB.
When your app slows down
Although the feature in question (full match case insensitive search) was working properly, after a couple of requests the application sloooowed down. After a few more requests it was throwing OOM and ultimately the process decided to die. That’s interesting.
Equipped with a debugger and a glass of wine, I attached to the application and step-by-step traced what was happening. It turned out that all the heavy lifting had moved to the Postgres, so I didn’t have a clear explanation why Java app had hit high memory usage issues. At first, I thought that Postgres might be consuming the memory assigned to JVM since both processes resided in the same VM but this wasn’t true – CPU/memory utilization low and steady all the time.
Analyzing JVM heap dump
With no other clues, I started to think about analyzing the heap dump.
I began by adding the following options to the JVM start parameters:
These parameters instruct JVM to dump its memory to a configured path when OOM (Out Of Memory Error) happens – as simple as that.
I quickly reproduced the problem, downloaded the heap dump and opened it using YourKit Java profiler. At this point, it became clear that JVM itself was the source of problems. It turned out that heap dump alone held almost 4GB of memory – almost as much as the JVM process was supposed to use up.
After opening it, I almost instantly saw one million(!) instances of
org.postgresql.util.ServerErrorMessage. It immediately took my attention instantly.
PSQLWarning instances (and objects they held) totaled of almost 100% of memory, as you can see on the picture below:
“Shallow Size” refers to the size in the memory of the object itself. “Retained Size” refers to the size in the memory of the object itself, plus the size of the objects that are held by the “parent” object.
Going back to debugging, I noticed that all the instances of
org.postgresql.util.PSQLWarning had the same information:
Root cause & solution
From that point, it was clear that for each scanned row PostgreSQL returned
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored notice which caused DB driver to allocate literally tens of new objects for each scanner row – each row that was scanned by DB, not actual row returned by DB as a query result (upper limit set by query was 30 anyway). Since the whole query took up to a few seconds, all these objects weren’t subject to any GC during the request lifecycle. This, in turn, was the reason behind the OOM, which occurred when a couple of search queries were running in parallel. What is more, since each request to the app created millions of objects, GC was still busy deleting not needed objects what contributed to even slower processing.
As can be deduced from error returned by PostgreSQL, original problem was that when the admin user typed “a” in the search form, it went in unchanged form (of course after sanitizing it) to the DB query shown above. As this is a stop word thrown out by most of the full-text search engines, it was clear that this query should not even hit DB – I was implementing full match case insensitive search at that time so I wasn’t interested in getting rows that partially matched searched term. There are few ways to solve the mentioned problem, I went with creating custom search dictionary that basically didn’t skip any stop words, however, it could be also solved by setting
client_min_messages property directly in PostgreSQL to
error. I verified it by just writing test that executed
jdbcTemplate.update("SET client_min_messages = error"); before sending a query to the DB. That’s not proper solution of fixing bugs, rather a dirty hack that no one should use 😉