Solving Database Performance Problems Using 'perf' and Flame Graphs
This article is part of a technical series of blog posts written by Adform's own technical experts, on the challenges of dealing with high loads of data and utilizing new technologies. You can read the previous article in this series here.
At Adform, we truly believe that data is king. In fact, most of our key product features are primarily data-driven; be it optimizing our bidding algorithms or battling fraud, everything relies on having quick and efficient access to quality data.
One of the database technologies we rely on is Vertica, a proprietary, closed-source columnar SQL database that is very well suited for the OLAP type of analytical workloads. Not only do we use it to provide blazingly fast reports to our end-clients, but also as a raw data warehouse used internally by our numerous data analysts and scientists. The ability to interactively query terabytes of raw data enables them to understand our data and, in turn, continually keep improving our data-driven products.
Today, our largest Vertica cluster consists of 26 physical nodes that during peak time ingest approx. 1 million rows per second or roughly 100 MB/s worth of data, which adds up to approx. 4 TB per day. Allowing the cluster to be queried ad hoc presents additional challenges in maintaining it since the workloads are less predictable due to the varied nature of queries being submitted. Solving database performance problems usually means studying query plans and using other database provided observability tools. However, given the lack of patterns and the vast amount of data and queries, it is not always easy to locate and understand the problem. Thus, we sometimes end up looking for additional tools.
The Linux kernel has recently and steadily been acquiring more and more sophisticated performance troubleshooting tools, such as perf, tracepoints, and most recently eBPF with tools like bcc making their usage easier than ever before. A lot has been written about these tools and techniques. For comprehensive introductions and summaries, be sure to study Brendan Gregg's blog and latest presentations on these topics.
In this post, we want to share our experience in Adform on how these tools can help solve performance problems that otherwise may seem very mysterious, especially when dealing with closed source software, such as Vertica. In particular, we will present three distinct example issues that we encountered and were able to efficiently solve using Linux performance tooling. We start with the most extreme performance case we have ever had to solve using Vertica.
Debugging Data Loading Bottlenecks
Conceptually, it is very simple to start using Vertica: create some tables and begin loading data by continually issuing COPY statements that can load data from either file stored locally on Vertica nodes or streams transferred via the network. In our case, we consume data from Kafka, construct data files, and send them to random nodes for ingestion. From there on, Vertica reshuffles rows between nodes based on table segmentation and compacts data in the background by sorting, encoding and compressing it into as few files as possible. This process is referred to as the ROS container mergeout.
First Sign of Trouble
Initially, things started out great, and we were able to ingest a steady stream of ~50 MB/s worth of data to a cluster of 12 nodes in more or less real-time (sans batching) without any adverse effects on query performance. As the next step in improving the pipeline, we decided to switch from forming and loading CSV files to preparing native binary files. This not only improves type safety but should also improve performance. That was at least the initial thought: less parsing, less data, more efficiency! Unfortunately, once we deployed the new loader, the cluster came to a screeching halt. Query performance suddenly became abysmal, and what is worse, Vertica was not keeping up with merging ROS containers. Due to a tough limit of maximum allowed containers, our loaders would periodically have to stop and wait for mergeouts to complete, and, in turn, were not keeping up with the streams of data from Kafka.
When loading data using the binary loader CPU usage on all Vertica nodes topped out at close to 100% with an abnormally high amount of time spent in kernel mode, as indicated by the >50% system CPU stat, which immediately signals a problem somewhere deeper in the system. At this point, we started investigating the system using all the Linux performance troubleshooting tools mentioned in the introduction. The most straightforward thing to do is to check perf top to see the functions the CPU is spending most of its time on:
We immediately see a very unhealthy picture. For some reason, the kernel is spending most of its time inside _spin_lock, which is supposed to be used for very brief locks only.
We then tried rendering a CPU flame graph, which is produced by sampling stack traces, visualizing not only time spend in symbols but also taking into account the call hierarchy, thus enabling one to see the larger context of the slow paths. The horizontal axis represents time spent inside functions (wider is worse), and the vertical direction visualizes the call chain with the top symbol being the deepest function called. Here is the flame graph captured on one of the nodes:
We already knew that _spin_lock was the culprit, but unfortunately, the flame graph did not provide much additional information. It is hard to believe that the Vertica process invokes locks directly. A more reasonable explanation is that the stack traces are broken, which is usually caused by missing debug symbols or call chains that are too deep. While disappointed, we were at the same time also relieved that all Vertica binaries have their symbols intact, which is not something immediately obvious given the closed source nature and cost of the product. Either way, it seemed obvious that the problem lay somewhere in the kernel or the standard libraries.
The Solution to the Problem
A crucial hint to understanding the problem and identifying the right solution can be seen in the output of the perf top above. Besides the various locking primitives and symbols that one would expect to find in a healthy database (such as compressing, looking up columns etc.), we also noticed that a lot of time was spent on allocating and freeing memory. It is indeed widely known that the malloc implementation in glibc is far from perfect when it comes to concurrent usage, (see e.g. bug reports 1, 2). In fact, general purpose memory allocation is a non-trivial problem by itself and multiple implementations for malloc exist out in the wild, such as tcmalloc, dlmalloc, jemalloc and numerous others, each optimized towards a slightly different use case and having a different approach to the problem. We can easily switch between various implementations by simply setting the LD_PRELOAD environmental variable for the target application to the path of the desired shared library. The Linux dynamic linker will override the functions during the application startup.
We thus tried switching the memory allocator to jemalloc; an implementation that states minimizing lock contention as one of its design goals and is known for its overall good performance (for a technical overview, see this blog post by Facebook). As seen in various benchmarks (e.g. 1, 2), the jemalloc implementation works especially well in workloads that have relatively stable pools of threads, precisely the type of workloads found in databases, which is why quite a few of them ship with it as enabled by default. Sure enough, switching allocators solved our problem. Here is the flame graph on one of the nodes shortly after the switch was applied:
A significant improvement! The cluster went back to normal and started working better than ever before.
To illustrate just how bad the standard malloc was working, we deliberately reverted back to it on one of the nodes (vert80l1) and compared CPU usage during a normal cluster operation:
The vendor was notified about the problem, and it was eventually acknowledged that malloc can cause performance issues on Red Hat Linux systems. However, as far as we know, it is still unclear exactly what causes this extreme behavior, the only clue being the use of native data encoding when loading data.
We verified this behavior on CentOS 6.6 / glibc 2.12 and on CentOS 7.2 / glibc 2.17 with some versions in between, also on Vertica versions 7.1 and 7.2.
Identifying Badly Performing Queries
While the previously described _spin_lock problem might in some sense be considered pathological, we nevertheless found that tools like perf have their applications in even more standard day-to-day situations.
As already mentioned in the introduction, the Vertica database cluster is used in an ad hoc fashion quite extensively by our client support specialists, business analysts and data scientists, meaning that usage patterns are rather unpredictable and the queries submitted are rarely the same, which often results in unpredictable server load. Most of the time, standard database tools allow one to isolate the offending queries very quickly and fix the problem by notifying the responsible person. However, sometimes techniques like analyzing query plans fail to identify any wrongdoing, yet we see an unreasonably high database load for no apparent reason. See an example in the graph below:
A flame graph captured during one of these high load periods is indicating that Vertica is spending a disproportionately large amount of time in the function timestamp2timestamptz:
While we do not have access to the source code of Vertica, it is reasonable to assume that this function converts time zones. The flame graph is also telling us that the conversion is being called from within a function called EE::VEval::FixedGE_Col, which suggests a "greater or equal" comparison between columns.
Going through the query history and looking for queries involving timestamp comparisons, we quickly suspected a series of queries being run periodically of the form:
From the documentation of the built-in function NOW() we see that it returns a TIMESTAMP WITH TIMEZONE. Since the column log_time is of the type TIMESTAMP, Vertica performs a conversion, even though the global timezone of the database is UTC, which effectively means it does nothing.
The query plan for this particular query looks reasonable, and there is no indication that it can have such adverse effects. It was also hard to track it down, because the query was being run at regular intervals even before the problem began. Only after looking at the amount of data processed by time, we were able to find a clear correlation and confirmed that it was indeed causing the slowdowns:
As seen in the server load graph above, the dashed lines here are approximate times when the cluster load increased considerably. The most obvious way of fixing the problem was to simply stop using NOW(), which we believe is a good practice by itself, since NOW() is not deterministic and makes queries using it non-reproducible.
Understanding Query Performance
The previous example illustrates that Vertica, like many other database systems, is not very good at estimating the effect of a query in terms of CPU usage. Instead, the estimates are mostly based on the amount of data a query touches and which algorithms will be used to evaluate joins, grouping etc. In other words, the query profiler is understandably mostly concerned with memory usage, disk / network throughput and neglects to estimate that e.g. converting time zones is expensive CPU wise.
Here is another example, where the query plan does not tell the complete story. Consider the following pseudo-query:
While this looks like a contrived example, it was part of a bigger query constructing price partition landscape. Running the full query resulted in noticeable performance degradation, and it was hard to tell exactly why from the query plan: It was touching a reasonable amount of data, performing a few joins, groupings etc. - nothing out of the ordinary. However, looking at the flame graph immediately revealed the problem:
A lot of time is being spent in functions with the words Numeric and Float in them, hinting at decimal to float conversion. Indeed, the column price is of the type DECIMAL and one of the CASE branches produces a FLOAT, and thus, Vertica is forced to do the conversion. Once again, it is hard to foresee such extreme adverse effects solely from the query and its plan.
Modern databases provide a wealth of tools out of the box for understanding and solving performance problems, such as query execution plans and various database specific metrics. The examples outlined in this blog post demonstrate that modern operating system performance tools, such as perf, can be successfully used in conjunction to not only helping to identify pathological performance problems such as sub-optimal memory allocation, but also in helping to solve every day database performance issues such as poorly performing queries.
Ultimately, we are big believers in mechanical sympathy in Adform, and we thus always try to understand our systems from the top to the bottom. We believe that the examples presented in this write-up validate this point of view and can only hope that they will encourage others to embrace it too.
This article was written by Saulius Valatka. He is a technical lead in the data research group, focusing on providing a reliable data infrastructure and pipelines for the ever increasing vast amounts of data that Adform processes daily.