Linux perf: a quick primer to application profiling

Recently, I was asked to give a demo to the team on how to use perf to profile applications. I thought, why not turn this into a blog post? So here we are. As usual we’ll use my orderbook project to demonstate the commands.

What is perf?

perf is a powerful tool that comes with the Linux kernel, designed to help you analyze and debug the performance of your applications. It’s an extremely versatile tools with a diverse set of capabilities. We’re going to focus only on application profiling for this post.

Disclaimers

Perf provides and excellent TUI what makes the whole process of profiling infinitely easier, but this post focuses on the STDIO interface as it’s unlikely that the TUI will always be available on remote servers.

This post is also not catered towards VM based languages such as Java.

Finally, perf also has the capability of using IPT. However, if you’d like to use IPT, you’re better off using Magic Trace which I have written about separately. Magic Trace uses perf under the hood.

Prerequisites

To use perf effectively, make sure:

  1. You have root or similar privileges on the system you are profiling.
  2. Your program should be compiled with debug symbols enabled. This will allow perf to map the performance data it collects to the actual lines of source code.

Here’s how you can enable debug symbols in C++ and Go:

C++

For C++, you can use the -g flag with the g++ or clang++ compiler to enable debug symbols:

1
$ g++ -g source.cpp -o program

Go

For Go, debug symbols are included by default when you use the go build command. However, if you’re using the -ldflags "-s -w" flags to strip debug information and reduce binary size, you’ll need to omit those flags to keep the debug symbols

Profiling with perf

Let’s get right to it. We’ll start with recording the profiling data using three different call-graph modes: lbr, dwarf, and fp. All of the following perf record commands generate a perf.data file, which contains the profiling data.

Run your workload

Let’s run the benchmark from my orderbook as an example here:

1
2
$ go build cmd/bench/main.go
$ ./bench -duration=300

Last Branch Record (LBR) Mode

LBR (Last Branch Record) is a hardware feature available on some CPUs that records information about the most recent branches that the CPU has executed. In the context of perf, LBR provides very low overhead and can offer high accuracy. When available, this is my method of choice. To check if LBR is available on your machine, run:

1
$ cat /proc/cpuinfo | grep lbr

The output of this command should include an ’lbr’ flag, indicating that LBR is supported.

To record a profile:

1
$ perf record -F 999 -a -g --call-graph lbr --user-callchains -p $(pgrep bench) -- sleep 30
  • -F 999 specifies the frequency of sampling. In this case, perf will sample the target 999 times per second. This is 999 instead of 1000 to reduce the possibility of lockstep sampling.
  • -a specifies that perf should monitor all CPUs.
  • -g captures call-graphs.
  • --call-graph lbr specifies the method used to capture the call-graphs, in this case lbr
  • --user-callchains ensures that perf records call chains from user space only.
  • -p $(pgrep <command>) specifies the process to profile. Replace <command> with the name of your process. In this case bench.
  • -- sleep 30 tells perf to run for 30 seconds.

DWARF Mode

When LBR isn’t awailable, DWARF mode can be the next best thing, with a caveat. It comes with a significant overhead, which can sometimes make the FP mode a more preferable choice.

1
$ perf record -F 99 -a -g --call-graph dwarf --user-callchains -p $(pgrep bench) -- sleep 30

Frame Pointer (FP) Mode

The overhead with FP is typically lower than DWARF, but it requires the program to be compiled with frame pointers enabled. Additionally, the FP method may not handle optimizations like tail call optimization or inlined functions accurately.

If you intend to use the Frame Pointer (FP) method for call-graph profiling with perf, you should ensure that your program is compiled with frame pointers enabled. In C++, you can do this by using the -fno-omit-frame-pointer flag:

1
$ g++ -g -fno-omit-frame-pointer source.cpp -o program

To record:

1
$ perf record -F 99 -a -g --call-graph fp --user-callchains -p $(pgrep bench) -- sleep 30

Analyzing Profiling Data

Now that we have our profiling data, let’s analyze it:

1
$ perf report -G -n --stdio
  • -G uses the inverted call graph.
  • -n shows a column with number of swamples

The perf report command reads the perf.data file and displays a summary of the profiling data.

Perf Report

Filtering Profiling Data

You might want to filter the profiling data based on specific symbols:

1
$ perf report -G -n --symbol-filter processOrder --stdio

This command filters the profiling data to only include those events related to the symbol processOrder.

Filtered Perf Report

Annotating Source Code

Finally, you might want to annotate your source code with the profiling data:

1
$ perf annotate -l -s 'github.com/geseq/orderbook.(*OrderBook).AddOrder' --stdio --source
  • -l specifies that perf should include line numbers in the annotation.
  • -s specifies the symbol to annotate. In this case, it’s a Go function from the orderbook package.
  • --source tells perf to interleave source code with the assembly code in the annotation.

Perf Annotate

Flame Graphs

Flame graphs are a nice visualization trick allowing the most frequent code-paths to be identified quickly and accurately.

To create flame graphs, you need the FlameGraph tool suite.

1
git clone https://github.com/brendangregg/FlameGraph

Creating a Flame Graph with perf data involves a couple of quick steps using the recording perf.data file created above:

  • Fold stacks: Use perf script to generate an unfolded stack file, then fold it with the stackcollapse-perf.pl script from the FlameGraph tool suite.

    1
    
    $ perf script | <Path To FlameGraph>/stackcollapse-perf.pl > out.perf-folded
    

    This creates a new file out.perf-folded with folded stack traces.

  • Generate the Flame Graph: Now generate the flame graph with the flamegraph.pl script.

    1
    
    $ <Path To FlameGraph>/flamegraph.pl out.perf-folded > perf-flamegraph.svg
    

    This creates a Flame Graph perf-flamegraph.svg from the folded stack traces.

Now, open the perf-flamegraph.svg in a web browser. This will help quickly and accurately identify the most frequent code-paths.

Perf Flame Graphs

Each bar in the graph represents a stack frame. The wider a bar, the more frequently it was observed in the profile. The top edge shows what is currently running and beneath it the call stack. By sorting alphabetically, the colors are randomized but consistent across different flame graphs.

There you have it! A brief primer on using perf to profile your applications.