JoelFernandes.org

My dumping ground for what I've been upto

BPFd- Running BCC tools remotely across systems

| Comments

This article (with some edits) also appeared on LWN.

Introduction

BCC (BPF Compiler Collection) is a toolkit and a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can’t crash the kernel, they are safer than kernel modules and can be run in production environments. Brendan Gregg has written nice tools and given talks showing the full power of eBPF based tools. Unfortunately, BCC has no support for a cross-development workflow. I define “cross-development” as a development workflow in which the development machine and the target machine running the developed code are different. Cross-development is very typical among Embedded System kernel developers who often develop on a powerful x86 host and then flash and test their code on SoCs (System on Chips) based on the ARM architecture. Not having a cross-development flow gives rise to several complications, lets go over them and discuss a solution called BPFd that cleverly addresses this issue.

In the Android kernel team, we work mostly on ARM64 systems, since most Android devices are on this architecture. BCC tools support on ARM64 systems has stayed broken for years. One of the reasons for this difficulty is with ARM64 inline assembler statements. Unavoidably, kernel header includes in BCC tools result in inclusion of asm headers which in the case of ARM64 has the potential of spewing inline asm ARM64 instructions causing major pains to LLVM’s BPF backend. Recently this issue got fixed by BPF inline asm support (these LLVM commits) and folks could finally run BCC tools on arm64, but..

In order for BCC tools to work at all, they need kernel sources. This is because most tools need to register callbacks on the ever-changing kernel API in order to get their data. Such callbacks are registered using the kprobe infrastructure. When a BCC tool is run, BCC switches its current directory into the kernel source directory before compilation starts, and compiles the C program that embodies the BCC tool’s logic. The C program is free to include kernel headers for kprobes to work and to use kernel data structures.

Even if one were not to use kprobes, BCC also implicity adds a common helpers.h include directive whenever an eBPF C program is being compiled, found in src/cc/export/helpers.h in the BCC sources. This helpers.h header uses the LINUX_VERSION_CODE macro to create a “version” section in the compiled output. LINUX_VERSION_CODE is available only in the specific kernel’s sources being targeted and is used during eBPF program loading to make sure the BPF program is being loaded into a kernel with the right version. As you can see, kernel sources quickly become mandatory for compiling eBPF programs.

In some sense this build process is similar to how external kernel modules are built. Kernel sources are large in size and often can take up a large amount of space on the system being debugged. They can also get out of sync, which may make the tools misbehave.

The other issue is Clang and LLVM libraries need to be available on the target being traced. This is because the tools compile the needed BPF bytecode which are then loaded into the kernel. These libraries take up a lot space. It seems overkill that you need a full-blown compiler infrastructure on a system when the BPF code can be compiled elsewhere and maybe even compiled just once. Further, these libraries need to be cross-compiled to run on the architecture you’re tracing. That’s possible, but why would anyone want to do that if they didn’t need to? Cross-compiling compiler toolchains can be tedious and stressful.

BPFd: A daemon for running eBPF BCC tools across systems

Sources for BPFd can be downloaded here.

Instead of loading up all the tools, compiler infrastructure and kernel sources onto the remote targets being traced and running BCC that way, I decided to write a proxy program named BPFd that receives commands and performs them on behalf of whoever is requesting them. All the heavily lifting (compilation, parsing of user input, parsing of the hash maps, presentation of results etc) is done by BCC tools on the host machine, with BPFd running on the target and being the interface to the target kernel. BPFd encapsulates all the needs of BCC and performs them - this includes loading a BPF program, creating, deleting and looking up maps, attaching a eBPF program to a kprobe, polling for new data that the eBPF program may have written into a perf buffer, etc. If it’s woken up because the perf buffer contains new data, it’ll inform BCC tools on the host about it, or it can return map data whenever requested, which may contain information updated by the target eBPF program.

Simple design

Before this work, the BCC tools architecture was as follows: BCC architecture

BPFd based invocations partition this, thus making it possible to do cross-development and execution of the tools across machine and architecture boundaries. For instance, kernel sources that the BCC tools depend on can be on a development machine, with eBPF code being loaded onto a remote machine. This partioning is illustrated in the following diagram: BCC architecture with BPFd

The design of BPFd is quite simple, it expects commands on stdin (standard input) and provides the results over stdout (standard output). Every command a single line always, no matter how big the command is. This allows easy testing using cat, since one could simply cat a file with commands, and check if BPFd’s stdout contain the expected results. Results from a command, however can be multiple lines.

BPF maps are data structures that a BPF program uses to store data which can be retrieved at a later time. Maps are represented by file descriptor returned by the bpf system call once the map has been successfully created. For example, following is a command to BPFd for creating a BPF hashtable map.

BPF_CREATE_MAP 1 count 8 40 10240 0

And the result from BPFd is:

bpf_create_map: ret=3

Since BPFd is proxying the map creation, the file descriptor (3 in this example) is mapped into BPFd's file descriptor table. The file descriptor can be used later to look up entries that the BPF program in the kernel may have created, or to clear all entries in the map, as is done by tools that periodically clear the accounting done by a BPF program.

The BPF_CREATE_MAP command in this example tells BPFd to create a map named count with map type 1 (type 1 is a hashtable map), with a key size of 8 bytes and a value size of 40, maximum of 10240 entries and no special flags. BPFd created a map and identified by file descriptor 3.

With the simple standard input/output design, it’s possible to write wrappers around BPFd to handle more advanced communication methods such as USB or Networking. As a part of my analysis work in the Android kernel team, I am communicating these commands over the Android Debug Bridge which interfaces with the target device over either USB or TCP/IP. I have shared several demos below.

Changes to the BCC project for working with BPFd

BCC needed several changes to be able to talk to BPFd over a remote connection. All these changes are available here and will be pushed upstream soon.

Following are all the BCC modifications that have been made:

Support for remote communication with BPFd such as over the network

A new remotes module has been added to BCC tools with an abstraction that different remote types, such as networking or USB must implement. This keeps code duplication to a minimum. By implementing the functions needed for a remote, a new communication method can be easily added. Currently an adb remote and a process remote are provided. The adb remote is for communication with the target device over USB or TCP/IP using the Android Debug Bridge. The process remote is probably useful just for local testing. With the process remote, BPFd is forked on the same machine running BCC and communicates with it over stdin and stdout.

Changes to BCC to send commands to the remote BPFd

libbpf.c is the main C file in the BCC project that talks to the kernel for all things BPF. This is illustrated in the diagram above. Inorder to make BCC perform BPF operations on the remote machine instead of the local machine, parts of BCC that make calls to the local libbpf.c are now instead channeled to the remote BPFd on the target. BPFd on the target then perform the commands on behalf of BCC running locally, by calling into its copy of libbpf.c.

One of the tricky parts to making this work is, not only calls to libbpf.c but certain other paths need to be channeled to the remote machine. For example, to attach to a tracepoint, BCC needs a list of all available tracepoints on the system. This list has to be obtained on the remote system, not the local one and is the exact reason why there exists the GET_TRACE_EVENTS command in BPFd.

Making the kernel build for correct target processor architecture

When BCC compiles the C program encapsulated in a BCC tool into eBPF instructions, it assumes that the eBPF program will run on the same processor architecture that BCC is running on. This is incorrect especially when building the eBPF program for a different target.

Some time ago, before I started this project, I changed this when building the in-kernel eBPF samples (which are simple standalone samples and unrelated to BCC). Now, I have had to make a similar change to BCC so that it compiles the C program correctly for the target architecture.

Installation

Try it out for yourself! Follow the Detailed or Simple instructions. Also, apply this kernel patch to make it faster to run tools like offcputime. I am submitting this patch to LKML as we speak.

BPF Demos: examples of BCC tools running on Android

Running filetop

filetop is a BCC tool which shows you all read/write I/O operations with a similar experience to the top tool. It refreshes every few seconds, giving you a live view of these operations. Goto your bcc directory and set the environment variables needed. For Android running on Hikey960, I run:

joel@ubuntu:~/bcc# source arm64-adb.rc

which basically sets the following environment variables:

  export ARCH=arm64
  export BCC_KERNEL_SOURCE=/home/joel/sdb/hikey-kernel/
  export BCC_REMOTE=adb

You could also use the convenient bcc-set script provided in BPFd sources to set these environment variables for you. Check INSTALL.md file in BPFd sources for more information.

Next I start filetop:

joel@ubuntu:~/bcc# ./tools/filetop.py 5

This tells the tool to monitor file I/O every 5 seconds.

While filetop is running, I start the stock email app in Android and the output looks like:

  Tracing... Output every 5 secs. Hit Ctrl-C to end
  13:29:25 loadavg: 0.33 0.23 0.15 2/446 2931
 
  TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
  3787   Binder:2985_8    44     0      140     0       R profile.db
  3792   m.android.email  89     0      130     0       R Email.apk
  3813   AsyncTask #3     29     0      48      0       R EmailProvider.db
  3808   SharedPreferenc  1      0      16      0       R AndroidMail.Main.xml
  3792   m.android.email  2      0      16      0       R deviceName
  3815   SharedPreferenc  1      0      16      0       R MailAppProvider.xml
  3813   AsyncTask #3     8      0      12      0       R EmailProviderBody.db
  2434   WifiService      4      0      4       0       R iface_stat_fmt
  3792   m.android.email  66     0      2       0       R framework-res.apk

Notice the Email.apk being read by Android to load the email application, and then various other reads happening related to the email app. Finally, WifiService continously reads iface_state_fmt to get network statistics for Android accounting.

Running biosnoop

Biosnoop is another great tool shows you block level I/O operations (bio) happening on the system along with the latency and size of the operation. Following is a sample output of running tools/biosnoop.py while doing random things in the Android system.

  TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
  0.000000000    jbd2/sdd13-8   2135   sdd     W  37414248  28672      1.90
  0.001563000    jbd2/sdd13-8   2135   sdd     W  37414304  4096       0.43
  0.003715000    jbd2/sdd13-8   2135   sdd     R  20648736  4096       1.94
  5.119298000    kworker/u16:1  3848   sdd     W  11968512  8192       1.72
  5.119421000    kworker/u16:1  3848   sdd     W  20357128  4096       1.80
  5.448831000    SettingsProvid 2415   sdd     W  20648752  8192       1.70

Running hardirq

This tool measures the total time taken by different hardirqs in the systems. Excessive time spent in hardirq can result in poor real-time performance of the system.

joel@ubuntu:~/bcc# ./tools/hardirqs.py

Output:

  Tracing hard irq event time... Hit Ctrl-C to end.
  HARDIRQ                    TOTAL_usecs
  wl18xx                             232
  dw-mci                            1066
  e82c0000.mali                     8514
  kirin                             9977
  timer                            22384

Running biotop

Run biotop while launching the android Gallery app and doing random stuff:

joel@ubuntu:~/bcc# ./tools/biotop.py

Output:

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
4524   droid.gallery3d  R 8   48  ?           33    1744   0.51
2135   jbd2/sdd13-8     W 8   48  ?           15     356   0.32
4313   kworker/u16:4    W 8   48  ?           26     232   1.61
4529   Jit thread pool  R 8   48  ?            4     184   0.27
2135   jbd2/sdd13-8     R 8   48  ?            7      68   2.19
2459   LazyTaskWriterT  W 8   48  ?            3      12   1.77

Open issues as of this writing

While most issues have been fixed, a few remain. Please check the issue tracker and contribute patches or help by testing.

Other usecases for BPFd

While the main usecase at the moment is easier use of BCC tools on cross-development models, another potential usecase that’s gaining interest is easy loading of a BPF program. The BPFd code can be stored on disk in base64 format and sent to bpfd using something as simple as:

joel@ubuntu:~/bpfprogs# cat my_bpf_prog.base64 | bpfd

In the Android kernel team, we are also expermenting for certain usecases that need eBPF, with loading a program with a forked BPFd instance, creating maps, and then pinning them for use at a later time once BPFd exits and then kill the BPFd fork since its done. Creating a separate process (fork/exec of BPFd) and having it load the eBPF program for you has the distinct advantage that the runtime-fixing up map file descriptors isn’t needed in the loaded eBPF machine instructions. In other words, the eBPF program’s instructions can be pre-determined and statically loaded. The reason for this convience is BPFd starts with the same number of file descriptors each time before the first map is created.

Conclusion

Building code for instrumentation on a different machine than the one actually running the debugging code is beneficial and BPFd makes this possible. Alternately, one could also write tracing code in their own kernel module on a development machine, copy it over to a remote target, and do similar tracing/debugging. However, this is quite unsafe since kernel modules can crash the kernel. On the other hand, eBPF programs are verified before they’re run and are guaranteed to be safe when loaded into the kernel, unlike kernel modules. Furthermore, the BCC project offers great support for parsing the output of maps, processing them and presenting results all using the friendly Python programming language. BCC tools are quite promising and could be the future for easier and safer deep tracing endeavours. BPFd can hopefully make it even more easier to run these tools for folks such as Embedded system and Android developers who typically compile their kernels on their local machine and run them on a non-local target machine.

If you have any questions, feel to reach out to me or drop me a note in the comments section.