LLDB: Sanitizing the debugger's runtime

This month I started to work on correcting of the ptrace(2) layer, as test suites used to trigger failures on the kernel side. This finally ended up sanitizing the LLDB runtime as well, addressing LLDB and NetBSD userland bugs.

It turned out that more bugs were unveiled and this is not the final report on LLDB.

The good

Besides the greater enhancements this month I performed a cleanup in the ATF ptrace(2) tests again. Additionally I have managed to unbreak the LLDB Debug build and to eliminate compiler warnings in the NetBSD Native Process Plugin.

It is worth noting that LLVM can run tests on NetBSD again, the patch in gtest/LLVM has been installed by Joerg Sonnenberg and a more generic one has been submitted to the upstream googletest repository. There was also an improvement in ftruncate(2) on the LLVM side (authored by Joerg).

Since LLD (the LLVM linker) is advancing rapidly, it improved support for NetBSD and it can link a functional executable on NetBSD. I submitted a patch to stop crashing it on startup anymore. It was nearly used for linking LLDB/NetBSD and it spotted a real linking error... however there are further issues that need to be addressed in the future. Currently LLD is not part of the mainline LLDB tasks - it's part of improving the work environment. This linker should reduce the linking time - compared to GNU linkers - of LLDB by a factor of 3x-10x and save precious developer time. As of now LLDB linking can take minutes on a modern amd64 machine designed for performance.

Kernel correctness

I have researched (in pkgsrc-wip) initial support for multiple threads in the NetBSD Native Process Plugin. This code revealed - when running the LLDB regression test-suite - new kernel bugs. This unfortunately affects the usability of a debugger in a multithread environment in general and explains why GDB was never doing its job properly in such circumstances.

One of the first errors was asserting kernel panic with PT_*STEP, when a debuggee has more than a single thread. I have narrowed it down to lock primitives misuse in the do_ptrace() kernel code. The fix has been committed.

LLDB and userland correctness

LLDB introduced support for kevent(2) and it contains the following function:

Status MainLoop::RunImpl::Poll() {
  in_events.resize(loop.m_read_fds.size());
  unsigned i = 0;
  for (auto &fd : loop.m_read_fds)
    EV_SET(&in_events[i++], fd.first, EVFILT_READ, EV_ADD, 0, 0, 0);
  num_events = kevent(loop.m_kqueue, in_events.data(), in_events.size(),
                      out_events, llvm::array_lengthof(out_events), nullptr);
  if (num_events < 0)
    return Status("kevent() failed with error %d\n", num_events);
  return Status();
}

It works on FreeBSD and MacOSX, however it broke on NetBSD.

Culprit line:

   EV_SET(&in_events[i++], fd.first, EVFILT_READ, EV_ADD, 0, 0, 0);

FreeBSD defined EV_SET() as a macro this way:

#define EV_SET(kevp_, a, b, c, d, e, f) do {    \
        struct kevent *kevp = (kevp_);          \
        (kevp)->ident = (a);                    \
        (kevp)->filter = (b);                   \
        (kevp)->flags = (c);                    \
        (kevp)->fflags = (d);                   \
        (kevp)->data = (e);                     \
        (kevp)->udata = (f);                    \
} while(0)

NetBSD version was different:

#define EV_SET(kevp, a, b, c, d, e, f)                                  \
do {                                                                    \
        (kevp)->ident = (a);                                            \
        (kevp)->filter = (b);                                           \
        (kevp)->flags = (c);                                            \
        (kevp)->fflags = (d);                                           \
        (kevp)->data = (e);                                             \
        (kevp)->udata = (f);                                            \
} while (/* CONSTCOND */ 0)
This resulted in heap damage, as keyp was incremented every time a value was assigned to (keyp)->.

Without GCC asan and ubsan tools, finding this bug would be much more time consuming, as the random memory corruption was affecting unrelated lambda function in a different part of the code.

To use the GCC sanitizers with packages from pkgsrc, on NetBSD-current, one has to use one or both of these lines:

_WRAP_EXTRA_ARGS.CXX+= -fno-omit-frame-pointer -O0 -g -ggdb -U_FORTIFY_SOURCE -fsanitize=address -fsanitize=undefined -lasan -lubsan
CWRAPPERS_APPEND.cxx+= -fno-omit-frame-pointer -O0 -g -ggdb -U_FORTIFY_SOURCE -fsanitize=address -fsanitize=undefined -lasan -lubsan

While there, I have fixed another - generic - bug in the LLVM headers. The class Triple constructor hadn't initializee the SubArch field, which upsetting the GCC address sanitizer. It was triggered in LLDB in the following code:

void ArchSpec::Clear() {
  m_triple = llvm::Triple();
  m_core = kCore_invalid;
  m_byte_order = eByteOrderInvalid;
  m_distribution_id.Clear();
  m_flags = 0;
}

I have filed a patch for review to address this.

The bad

Unfortunately this is not the full story and there is further mandatory work.

LLDB acceleration

The EV_SET() bug broke upstream LLDB over a month ago, and during this period the debugger was significantly accelerated and parallelized. It is difficult to declare it definitely, but it might be the reason why the tracer's runtime broke due to threading desynchronization. LLDB behaves differently when run standalone, under ktruss(1) and under gdb(1) - the shared bug is that it always fails in one way or another, which isn't trivial to debug.

The ugly

There are also unpleasant issues at the core of the Operating System.

Kernel troubles

Another bug with single-step functions that affects another aspect of correctness - this time with reliable execution of a program - is that processes die in non-deterministic ways when single-stepped. My current impression is that there is no appropriate translation between process and thread (LWP) states under a debugger.

These issues are sibling problems to unreliable PT_RESUME and PT_SUSPEND.

In order to be able to appropriately address this, I have diligently studied this month the Solaris Internals book to get a better image of the design of the NetBSD kernel multiprocessing, which was modeled after this commercial UNIX.

Plan for the next milestone

The current troubles can be summarized as data races in the kernel and at the same time in LLDB. I have decided to port the LLVM sanitizers, as I require the Thread Sanitizer (tsan). Temporarily I have removed the code for tracing processes with multiple threads to hide the known kernel bugs and focus on the LLDB races.

Unfortunately LLDB is not easily bisectable (build time of the LLVM+Clang+LLDB stack, number of revisions), therefore the debugging has to be performed on the most recent code from upstream trunk.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL, and chip in what you can:

http://netbsd.org/donations/#how-to-donate