Blog

Concord's Function Tracing System

Debugging a native distributed application on a client’s machine is harder than it sounds. Basically, it is impossible to go from a function backtrace to a system wide diagnosis.

Imagine the scenario of getting a TCP timeout on a function call. The timeout could mean there was a cable cut on the datacenter, power rack outage, a nuclear bomb blew the receiving end, misconfiguration, etc. There are so many layers of abstractions that connecting a networking issue to a line of code often requires additional systems to process, correlate and analyze failures so that a human can see how a misconfiguration can cause a system wide outage.

Anyway, the bad news is that when we get a report of a failed operator (Concord’s unit of computation), all we get a backtrace :( - my friends at backtrace.io often say:

The most informative thing you get from a backtrace is:

0x11G0FURSLF

On a more serious note you get something like this:

...
...
I1126 00:32:38.772112    46 executor_utils.hpp:47] Stopping mesos driver
*** Check failure stack trace: ***
    @     0x7f02d0439daa  (unknown)
    @     0x7f02d0439ce4  (unknown)
    @     0x7f02d04396e6  (unknown)
    @     0x7f02d043c687  (unknown)
    @     0x7f02d75e4608  bolt::PrincipalServiceClientLoop::sendRecords()
    @     0x7f02d75e2b80  _ZZN4bolt26PrincipalServiceClientLoop4loopEvENUlvE0_clEv
    @     0x7f02d75e4eab  _ZNSt17_Function_handlerIFvvEZN4bolt26PrincipalServiceClientLoop4loopEvEUlvE0_E9_M_invokeERKSt9_Any_data
terminate called after throwing an instance of 'std::logic_error'
  what():  Bad use of ProcessReturnCode; state is 3 expected 2
    @     0x7f02d755bb32  std::function<>::operator()()
Aborted (core dumped)
...

And this is only if you have been clever and used libunwind, hooked up Google Logging failure handling and your program properly initialized… and oh, forgot to mention, make sure your core dump is enabled, also set the ulimits? …

# example
$> sudo  su
$> echo 'core.%h.%e.%t' > /proc/sys/kernel/core_pattern

At Concord we were lucky enough to start a project with --std=c++14 which has been a really fun way to integrate state of the art libraries like Facebook Folly, Wangle, etc. These libraries come with lots of goodies, from lock free queues to string formatters and everything in between.

The problem with integrating other open source projects in general is that when your product has a bug, it is your responsibility to fix it. The challenge then becomes how exactly do you debug all of these components when you ` apt-get `‘ed some binary dependencies? The search of these bug-space is tedious. The way we tackled it at Concord was to use function tracing with ‘tags’ (think of tags as notes or function grouping, like ‘web’ or ‘backend’).

There are many tricks, mechanisms and code we wrote to try and isolate these issues from a framework perspective. In this post we’ll only focus on function tracing.

What is function tracing

The guys at the Linux Trace Toolkit: Next Generation (LTTNG) - lttng.org do an excellent job at explaining what tracing is:

As the history of software engineering progressed and led to what we now take for granted—complex, numerous and interdependent software applications running in parallel on sophisticated operating systems like Linux—the authors of such components, or software developers, began feeling a natural urge of having tools to ensure the robustness and good performance of their masterpieces.

..

Tracing is a technique used to understand what goes on in a running software system. The software used for tracing is called a tracer, which is conceptually similar to a tape recorder. When recording, specific probes placed in the software source code generate events that are saved on a giant tape: a trace file. Both user applications and the operating system may be traced at the same time, opening the possibility of resolving a wide range of problems that are otherwise extremely challenging.

For folks familiar with web technologies, it looks pretty much like the Chrome developer tools.

chrome://tracing

One might ask the question: Why on earth would I go through the trouble of setting up a tracing mechanism when I can just do a printf and be done with it.

This is exactly what most developers do. In fact even high profile projects like Apache Kafka, Facebook Folly use dynamic log levels as their tracing mechanism.

Here is a snippet from folly’s main async wrapper of libevent (EventBase.cpp):


  if (ev.ev_base) {
    evb_ = event_base_new();
  }

  if (UNLIKELY(evb_ == nullptr)) {
    LOG(ERROR) << "EventBase(): Failed to init event base.";
    folly::throwSystemError("error in EventBase::EventBase()");
  }
  VLOG(5) << "EventBase(): Created.";
  initNotificationQueue();
  RequestContext::saveContext();

This is nice and convenient because now you can trace the execution.

$> ./a.out --vmodule=EventBase=5 --v=2

If by institutional practice you have a code policy that tells you how & what log level means what, then this is not so bad. For example, Kafka has a coding standard that clearly states that the logs are yet another user interface and to be treated seriously: Apache Kafka’s coding guide

Why on earth would one then go through the trouble of instituting a tracing engine like LTTNG? The answer is simple:

Performance

Even if you are using a high performance logging library (i.e: you perform all the tricks of fallocate-ing a big file, using async writes, batching your writes, etc) chances are that your library has some observable performance interference. That is, your code will work differently under different log levels, especially in a multithreaded environment, where you might have timer callbacks. See the observer effect for more details.

Quoting the LTTNG project:

LTTng disturbs the traced system as little as possible in order to make tracing of subtle race conditions and rare interrupt cascades possible.

From experience and we are willing to pay more for it, LTTNG increases our code execution runtime by about 2%. This was on a relatively simple test, but we were sold after we learned to use it well. The tl;dr is that it uses a ring-buffer of some size. Once the size is full, unlike your logging library, lttng discards the packets, attempting to interfere as little as possible with the actual code. This seemed like a good trade-off for a high performance streaming framework.

…So, what’s next?

Let’s trace!

Whether we are working through a bug regarding strange network outages caused by a misconfiguration and your networking library, or by a double-firewall software delaying ‘low-priority traffic’ seen in the wild, our post mortems usually end up with more tracing being instrumented.

LTTNG Definition

TRACEPOINT_EVENT(
    concord,
    principal_generated_spans,
    TP_ARGS(
        int, spans_arg,
        const char*, addr_arg
    ),
    TP_FIELDS(
        ctf_string(addr, addr_arg)
        ctf_integer(int, spans, spans_arg)
    )
)

From a user’s perspective, LTTNG is like protobuf or any other IDL. You have a schema - i.e.: .tp file for lttng or .proto for protobuf. You have a compiler or code generator, lttng-gen-tp in this case. After all the pomp and circumstance you have some source files you link into your program and you are ready to trace all the codez.

Code

Inside our CMake file we define a simple function that generates lttng .c and .h files from the .tp schema.


function (register_lttng_dependencies)
  add_custom_command (
    OUTPUT ${LTTNG_SOURCES} ${LTTNG_HEADERS}
    DEPENDS ${PROJECT_SOURCE_DIR}/lttng/traces.tp
    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}/lib
    COMMAND lttng-gen-tp -o bolt/lttng/traces.c -o bolt/lttng/traces.h ${PROJECT_SOURCE_DIR}/lttng/traces.tp
    )
endfunction ()

After you have your your project linking and compiling with lttng you can use it via the tracepoint() function call.

#include <boost/scope_exit.hpp> // macro
#include "bolt/lttng/traces.h"
...

  BOOST_SCOPE_EXIT_ALL(&spans, this) {
    if(!spans.empty()) {
      tracepoint(concord, principal_generated_spans, spans.size(),
                 client_.address().describe().c_str());
      latencyMonitor_->recordSpans(std::move(spans));
    }
  };

Now the fun part - profit!

We wrote a simple bash script that collects all concord:* tracepoints so we can analyze them with Trace Compass.


lttng create "ccord-$(\date +%s)"
lttng list --userspace
lttng enable-event --userspace "concord:*"
lttng start
sleep 30;
lttng stop
lttng destroy

Here is how the above trace looks using Eclipse (trace compass).

A little background about this specific trace. We wrote a bug that would incorrectly tag our latency tracing system with the same tag (CLIENT_RCV) so the latencies printed on the flat files would not match our dashboard. Because we do it every 1024 messages, it made it especially difficult since you have to correlate the traces from multiple operators and then join them together by their parent id. LTTNG helped us out by ensuring that every node was producing traces correctly, but never closing any - aha! bug solved.

If you found this interesting, go and trace Concord’s VirtualBox. Our philosophy for debugging has always been, if we can do it, so should the user.


Special thanks to Tarik Koc, Sarah Rohrbach, Rob Blafford for reading drafts of this post.