Leanpub: Publish Early, Publish Often

Benchmarks

The Go testing package contains a benchmarking tool for examining the performance of our Go code. In this chapter, we will use the benchmark utility to progressively improve the performance of a piece of code. We will then discuss advanced benchmarking techniques to ensure that we are measuring the right thing.

A simple benchmark

Let’s suppose we have a simple function that computes the n^th Fibonacci number. The sequence \(F_{n}\) of Fibonacci numbers is defined by the recurrence relation, \(F_{n} = F_{n-1} + F_{n-2}\), with \(F_{0} = 0, F_{1} = 1\). That is, every number after the first two is the sum of the two preceding ones:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...

Because the sequence is recursively defined, a function that calculates the n^th Fibonacci number is often used to illustrate programming language recursion in computer science text books. Below is such a function that uses the definition to recursively calculate the n^th Fibonacci number.

A function that recursively obtains the n^th Fibonacci number

 1 package fibonacci
 2 
 3 // F returns the nth Fibonnaci number.
 4 func F(n int) int {
 5 	if n <= 0 {
 6 		return 0
 7 	} else if n == 1 {
 8 		return 1
 9 	}
10 	return F(n-1) + F(n-2)
11 }

Let’s make sure it works by writing a quick test, as we saw in the chapter on Testing.

A test for the function that recursively obtains the n^th Fibonacci number

 1 // fibonacci_test.go
 2 package fibonacci
 3 
 4 import "testing"
 5 
 6 func TestF(t *testing.T) {
 7 	cases := []struct {
 8 		n    int
 9 		want int
10 	}{
11 		{-1, 0},
12 		{0, 0},
13 		{1, 1},
14 		{2, 1},
15 		{3, 2},
16 		{8, 21},
17 	}
18 
19 	for _, tt := range cases {
20 		got := FastF(tt.n)
21 		if got != tt.want {
22 			t.Errorf("F(%d) = %d, want %d", tt.n, got, tt.want)
23 		}
24 	}
25 }

Running the test, we see that indeed, our function works as promised:

$ go test -v
=== RUN   TestF
--- PASS: TestF (0.00s)
PASS
ok      _/home/productiongo/benchmarks/fibonacci   0.001s

Now, this recursive Fibonacci function works, but we can do better. How much better? Before we rewrite this function, let’s establish a baseline to which we can compare our future efficiency improvements. Go provides a benchmark tool as part of the testing package. Anagalous to TestX(t *testing.T), we create benchmarks with BenchmarkX(b *testing.B):

A benchmark for the Fibonacci function

 1 // fibonacci_bench_test.go
 2 package fibonacci
 3 
 4 import "testing"
 5 
 6 var numbers = []int{
 7 	0, 10, 20, 30,
 8 }
 9 
10 func BenchmarkF(b *testing.B) {
11 	// run F(n) b.N times
12 	m := len(numbers)
13 	for n := 0; n < b.N; n++ {
14 		F(numbers[n%m])
15 	}
16 }

The BenchmarkF function can be saved in any file ending with _test.go to be included by the testing package. The only real surprise in the code is the for loop defined on line 13,

13 for n := 0; n < b.N; n++ {

The benchmark function must run the target code b.N times. During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably.

To run the benchmark, we need to instruct go test to run benchmarks using the -bench flag. Similar to the -run command-line argument, -bench also accepts a regular expression to match the benchmark functions we want to run. To run all the benchmark functions, we provide -bench=.. go test will first run all the tests (or those matched by -run, if provided), and then run the benchmarks. The output for our benchmark above looks is as follows:

$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkF-4        1000       1255534 ns/op
PASS
ok      _/home/productiongo/benchmarks/fibonacci 1.387s

The output tells us that the benchmarks ran on a Linux x86-64 environment. Furthermore, the testing package executed our one benchmark, BenchmarkF. It ran the b.N loop 1000 times, and each iteration (i.e. each call to F) lasted 1,255,534ns (~1.2ms) on average.

1.2ms per call seems a bit slow! Especially considering that the numbers we provided to the Fibonacci function were quite small. Let’s improve our original function by not using recursion.

An improved Fibonacci function

 1 package fibonacci
 2 
 3 // FastF returns the nth Fibonnaci number,
 4 // but does not use recursion.
 5 func FastF(n int) int {
 6 	var a, b int = 0, 1
 7 	for i := 0; i < n; i++ {
 8 		a, b = b, a+b
 9 	}
10 	return a
11 }

This new function FastF, is equivalent to the original, but uses only two variables and no recursion to calculate the final answer. Neat! Let’s check whether it’s actually any faster. We can do this by adding a new benchmark function for FastF:

func BenchmarkFastF(b *testing.B) {
    // run FastF(n) b.N times
    m := len(numbers)
    for n := 0; n < b.N; n++ {
        FastF(numbers[n%m])
    }
}

Again we run go test -bench=.. This time we will see the output of both benchmarks:

$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkF-4            1000       1245008 ns/op
BenchmarkFastF-4    50000000            20.3 ns/op
PASS
ok      _/home/productiongo/benchmarks/fibonacci 2.444s

The output is telling us that F still took around 1245008ns per execution, but FastF took only 20.3ns! The benchmark proves that our non-recursive FastF is indeed orders of magnitude faster than the textbook recursive version, at least for the provided inputs.

Comparing benchmarks

The benchcmp tool parses the output of two go test -bench runs and compares the results.

To install, run:

go get golang.org/x/tools/cmd/benchcmp

Let’s output the benchmark for the original F function from earlier to a file, using BenchmarkF:

$ go test -bench . > old.txt

The file will look as follows:

goos: darwin
goarch: amd64
BenchmarkF-4        1000       1965113 ns/op
PASS
ok      _/Users/productiongo/benchmarks/benchcmp   2.173s

Now instead of implementing FastF, we copy the FastF logic into our original F function:

Fast F implementation

 1 package fibbonaci
 2 
 3 // F returns the nth Fibonnaci number.
 4 func F(n int) int {
 5 	var a, b int = 0, 1
 6 	for i := 0; i < n; i++ {
 7 		a, b = b, a+b
 8 	}
 9 	return a
10 }

and re-run the benchmark, outputting to a file called new.txt:

$ go test -bench . > new.txt

new.txt should look like this:

goos: darwin
goarch: amd64
BenchmarkF-4    50000000            25.0 ns/op
PASS
ok      _/Users/productiongo/benchmarks/benchcmp   1.289s

Now let’s run benchcmp on the results:

benchcmp old.txt new.txt  
benchmark        old ns/op     new ns/op     delta
BenchmarkF-4     1965113       25.0          -100.00%

We can see the old performance, new performance, and a delta. In this case, the new version of F performs so well that it reduced the runtime of the original by 99.9987%. Thus rounded to two decimals, we get a delta of -100.00%.

Resetting benchmark timers

We can reset the benchmark timer if we don’t want the overall benchmark timing to include the execution time of our setup code.

A benchmark from the crypto/aes package in the Go source code provides an example of this:

crypto/aes BenchmarkEncrypt

 1 package aes
 2 
 3 import "testing"
 4 
 5 func BenchmarkEncrypt(b *testing.B) {
 6 	tt := encryptTests[0]
 7 	c, err := NewCipher(tt.key)
 8 	if err != nil {
 9 		b.Fatal("NewCipher:", err)
10 	}
11 	out := make([]byte, len(tt.in))
12 	b.SetBytes(int64(len(out)))
13 	b.ResetTimer()
14 	for i := 0; i < b.N; i++ {
15 		c.Encrypt(out, tt.in)
16 	}
17 }

As we can see, there is some setup done in the benchmark, then a call to b.ResetTimer() to reset the benchmark time and memory allocation counters.

Benchmarking memory allocations

The Go benchmarking tools also allow us to output the number memory allocations by the benchmark, alongside the time taken by each iteration. We do this by adding the -benchmem flag. Let’s see what happens if we do this on our Fibonnaci benchmarks from before.

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkF-4            1000       1241017 ns/op           0 B/op          0 \
allocs/op
BenchmarkFastF-4    100000000           20.6 ns/op         0 B/op          0 \
allocs/op
PASS
ok      _/Users/productiongo/benchmarks/fibonacci 3.453s

We now have two new columns on the right: the number of bytes per operation, and the number of heap allocations per operation. For our Fibonnaci functions, both of these are zero. Why is this? Let’s add the -gcflags=-m option to see the details. The output below is truncated to the first 10 lines:

$ go test -bench=. -benchmem -gcflags=-m
# _/home/herman/Dropbox/mastergo/manuscript/code/benchmarks/fibonacci
./fibonacci_bench_test.go:10:20: BenchmarkF b does not escape
./fibonacci_fast_bench_test.go:6:24: BenchmarkFastF b does not escape
./fibonacci_test.go:22:5: t.common escapes to heap
./fibonacci_test.go:6:15: leaking param: t
./fibonacci_test.go:22:38: tt.n escapes to heap
./fibonacci_test.go:22:38: got escapes to heap
./fibonacci_test.go:22:49: tt.want escapes to heap
./fibonacci_test.go:10:3: TestF []struct { n int; want int } literal does not\
 escape
./fibonacci_test.go:22:12: TestF ... argument does not escape
...

The Go compiler performs escape analysis. If an allocation does not escape the function, it can be stored on the stack. Variables placed on the stack avoid the costs involved with a heap allocation and the garbage collector. The omission of the fibonacci.go file from the output above implies that no variables from our F and FastF functions escaped to the heap. Let’s take another look at the FastF function to see why this is:

func FastF(n int) int {
    var a, b int = 0, 1
    for i := 0; i < n; i++ {
        a, b = b, a+b
    }
    return a
}

In this function, the a, b, and i variables are declared locally and do not need to be put onto the heap, because they are not used again when the function exits. Consider what would happen if, instead of storing only the last two values, we naively stored all values calculated up to n:

A high-memory implementation of F

 1 package fibonacci
 2 
 3 // FastHighMemF returns the nth Fibonacci number, but
 4 // stores the full slice of intermediate
 5 // results, consuming more memory than necessary.
 6 func FastHighMemF(n int) int {
 7 	if n <= 0 {
 8 		return 0
 9 	}
10 
11 	r := make([]int, n+1)
12 	r[0] = 0
13 	r[1] = 1
14 	for i := 2; i <= n; i++ {
15 		r[i] = r[i-1] + r[i-2]
16 	}
17 	return r[n]
18 }

Running the same test and benchmark from before on this high-memory version of F, we get:

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkFastHighMemF-4     20000000            72.0 ns/op       132 B/op    \
      0 allocs/op
PASS
ok      _/Users/productiongo/benchmarks/fibonacci_mem 1.518s

This time our function used 132 bytes per operation, due to our use of a slice in the function. If you are wondering why the number is 132 specifically: the exact number of bytes is sensitive to the numbers we use in the benchmark. The higher the input n, the more memory the function will allocate. The average of the values used in the benchmark (0, 10, 20, 30) is 15. Because this was compiled for a 64-bit machine, each int will use 8 bytes (8x8=64 bits). The slice headers also use some bytes. We still have zero heap allocations per operation, due to all variables being contained within the function. We will discuss advanced memory profiling and optimization techniques in Optimization.

Modulo vs Bitwise-and

In our Fibonacci benchmarks so far, we have made use of a list of four integer test cases:

6 var nums = []int{
7     0, 10, 20, 30,
8 }

which we then loop over in the BenchmarkF function:

13     for n := 0; n < b.N; n++ {
14         FastF(nums[n%m])
15     }

But when it comes down to the nanoseconds, modulo is a relatively slow computation to do on every iteration. It can actually have an impact on the accuracy of our results! Let’s peek at the Go assembler code. Go allows us to do this with the go tool compile -S command, which outputs a pseudo-assembly language called ASM. In the command below, we filter the instructions for the line we are interested in with grep:

$ go tool compile -S fibonacci.go fibonacci_bench_test.go | grep "fibonacci_b\
ench_test.go:14"
    0x0036 00054 (fibonacci_bench_test.go:14)   MOVQ    (BX)(DX*8), AX
    0x003a 00058 (fibonacci_bench_test.go:14)   MOVQ    AX, (SP)
    0x003e 00062 (fibonacci_bench_test.go:14)   PCDATA  $0, $0
    0x003e 00062 (fibonacci_bench_test.go:14)   CALL    "".F(SB)
    0x0043 00067 (fibonacci_bench_test.go:14)   MOVQ    ""..autotmp_5+16(SP),\
 AX
    0x0061 00097 (fibonacci_bench_test.go:14)   MOVQ    "".numbers(SB), BX
    0x0068 00104 (fibonacci_bench_test.go:14)   MOVQ    "".numbers+8(SB), SI
    0x006f 00111 (fibonacci_bench_test.go:14)   TESTQ   CX, CX
    0x0072 00114 (fibonacci_bench_test.go:14)   JEQ 161
    0x0074 00116 (fibonacci_bench_test.go:14)   MOVQ    AX, DI
    0x0077 00119 (fibonacci_bench_test.go:14)   CMPQ    CX, $-1
    0x007b 00123 (fibonacci_bench_test.go:14)   JEQ 132
    0x007d 00125 (fibonacci_bench_test.go:14)   CQO
    0x007f 00127 (fibonacci_bench_test.go:14)   IDIVQ   CX
    0x0082 00130 (fibonacci_bench_test.go:14)   JMP 137
    0x0084 00132 (fibonacci_bench_test.go:14)   NEGQ    AX
    0x0087 00135 (fibonacci_bench_test.go:14)   XORL    DX, DX
    0x0089 00137 (fibonacci_bench_test.go:14)   CMPQ    DX, SI
    0x008c 00140 (fibonacci_bench_test.go:14)   JCS 49
    0x008e 00142 (fibonacci_bench_test.go:14)   JMP 154
    0x009a 00154 (fibonacci_bench_test.go:14)   PCDATA  $0, $1
    0x009a 00154 (fibonacci_bench_test.go:14)   CALL    runtime.panicindex(SB)
    0x009f 00159 (fibonacci_bench_test.go:14)   UNDEF
    0x00a1 00161 (fibonacci_bench_test.go:14)   PCDATA  $0, $1
    0x00a1 00161 (fibonacci_bench_test.go:14)   CALL    runtime.panicdivide(S\
B)
    0x00a6 00166 (fibonacci_bench_test.go:14)   UNDEF
    0x00a8 00168 (fibonacci_bench_test.go:14)   NOP

The details of this output are not as important as it is to notice how many instructions there are. Now, let’s rewrite the code to use bitwise-and (&) instead of modulo %:

12     m := len(nums)-1
13     for n := 0; n < b.N; n++ {
14         FastF(nums[n&m])
15     }

Now, the ASM code becomes:

$ go tool compile -S fibonacci.go fibonacci_bench_test.go | grep "fibonacci_b\
ench_test.go:14"        
    0x0032 00050 (fibonacci_bench_test.go:14)  MOVQ    (BX)(DI*8), AX
    0x0036 00054 (fibonacci_bench_test.go:14)   MOVQ    AX, (SP)
    0x003a 00058 (fibonacci_bench_test.go:14)   PCDATA  $0, $0
    0x003a 00058 (fibonacci_bench_test.go:14)   CALL    "".F(SB)
    0x003f 00063 (fibonacci_bench_test.go:14)   MOVQ    "".n+16(SP), AX
    0x005e 00094 (fibonacci_bench_test.go:14)   MOVQ    "".numbers(SB), BX
    0x0065 00101 (fibonacci_bench_test.go:14)   MOVQ    "".numbers+8(SB), SI
    0x006c 00108 (fibonacci_bench_test.go:14)   LEAQ    -1(AX), DI
    0x0070 00112 (fibonacci_bench_test.go:14)   ANDQ    CX, DI
    0x0073 00115 (fibonacci_bench_test.go:14)   CMPQ    DI, SI
    0x0076 00118 (fibonacci_bench_test.go:14)   JCS 45
    0x0078 00120 (fibonacci_bench_test.go:14)   JMP 132
    0x0084 00132 (fibonacci_bench_test.go:14)   PCDATA  $0, $1
    0x0084 00132 (fibonacci_bench_test.go:14)   CALL    runtime.panicindex(SB)
    0x0089 00137 (fibonacci_bench_test.go:14)   UNDEF
    0x008b 00139 (fibonacci_bench_test.go:14)   NOP

This is considerably shorter than before. In other words, the Go runtime will need to perform fewer operations, but the results will be the same. We can use modulo instead of ampersand because we have exactly four items in our nums slice. In general, \(n \% m == n \& (m - 1)\) if \(m\) is a power of two. For example,

0 % 4 = 0 & 3 = 0
1 % 4 = 1 & 3 = 1
2 % 4 = 2 & 3 = 2
3 % 4 = 3 & 3 = 3
4 % 4 = 4 & 3 = 0
5 % 4 = 5 & 3 = 1
6 % 4 = 6 & 3 = 2
...

If you are not yet convinced, expand the binary version of the bitwise-and operations to show that this is true:

0 & 3 = 00b & 11b = 0
1 & 3 = 01b & 11b = 1
2 & 3 = 10b & 11b = 2
3 & 3 = 11b & 11b = 3
...

To evaluate the impact of changing from modulo to ampersand on the benchmark results, let us create two benchmarks for FastF, one with modulo and the other with bitwise-and:

$ go test -bench=BenchmarkFastF
goos: linux
goarch: amd64
BenchmarkFastFModulo-4          100000000           16.7 ns/op
BenchmarkFastFBitwiseAnd-4      200000000            7.40 ns/op
PASS
ok      _/Users/productiongo/benchmarks/bench_bitwise_and 4.058s

The version using bitwise-and runs twice as fast. Our original benchmark was spending half the time recalculating modulo operations! This is unlikely to have a big impact on benchmarks of bigger functions, but when benchmarking small pieces of code, using bitwise-and instead of modulo will make the benchmark results more accurate.

Up next

Tooling