Confusing results from golang benchmarking of function and go routine - Enhance your coding expertise with RidiculousRichard on @onlycoders.net

1 year ago

#342139

RidiculousRichard

Confusing results from golang benchmarking of function and go routine call overhead

Out of curiosity, I am trying to understand what the function and go routine call overhead is for golang. I therefore wrote the benchmarks below giving the results below that. The result for BenchmarkNestedFunctions confuses me as it seems far too high so I naturally assume I have done something wrong. I was expecting the BenchmarkNestedFunctions to be slightly higher than the BenchmarkNopFunc and very close to the BenchmarkSplitNestedFunctions. Please can anyone suggest what I may be either not understanding or doing wrong.

package main

import (
    "testing"
)

// Intended to allow me to see the iteration overhead being used in the benchmarking
func BenchmarkTestLoop(b *testing.B) {
    for i := 0; i < b.N; i++ {
    }
}

//go:noinline
func nop() {
}

// Intended to allow me to see the overhead from making a do nothing function call which I hope is not being optimised out
func BenchmarkNopFunc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        nop()
    }
}

// Intended to allow me to see the added cost from creating a channel, closing it and then reading from it
func BenchmarkChannelMakeCloseRead(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        close(done)
        _, _ = <-done
    }
}

//go:noinline
func nestedfunction(n int, done chan<- struct{}) {
    n--
    if n > 0 {
        nestedfunction(n, done)
    } else {
        close(done)
    }
}

// Intended to allow me to see the added cost of making 1 function call doing a set of channel operations for each call
func BenchmarkUnnestedFunctions(b *testing.B) {
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        nestedfunction(1, done)
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of repeated nested calls and stack growth with an upper limit on the call depth to allow examination of a particular stack size
func BenchmarkNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    if b.N > max {
        b.N = max
    }
    done := make(chan struct{})
    nestedfunction(b.N, done)
    _, _ = <-done
}

// Intended to allow me to see the added cost of repeated nested call with any stack reuse the runtime supports (presuming it doesn't free and the realloc the stack as it grows)
func BenchmarkSplitNestedFunctions(b *testing.B) {
    // Max number of nested function calls to prevent excessive stack growth
    const max int = 200000
    for i := 0; i < b.N; i += max {
        done := make(chan struct{})
        if (b.N - i) > max {
            nestedfunction(max, done)
        } else {
            nestedfunction(b.N-i, done)
        }
        _, _ = <-done
    }
}

// Intended to allow me to see the added cost of spinning up a go routine to perform comparable useful work as the nested function calls
func BenchmarkNestedGoRoutines(b *testing.B) {
    done := make(chan struct{})
    go nestedgoroutines(b.N, done)
    _, _ = <-done
}

The benchmarks are invoked as follows:

$ go test -bench=. -benchmem -benchtime=200ms
goos: windows
goarch: amd64
pkg: golangbenchmarks
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkTestLoop-24                    1000000000               0.2247 ns/op          0 B/op          0 allocs/op
BenchmarkNopFunc-24                     170787386                1.402 ns/op           0 B/op          0 allocs/op
BenchmarkChannelMakeCloseRead-24         3990243                52.72 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            4791862                58.63 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000                50.11 ns/op            0 B/op          0 allocs/op
BenchmarkSplitNestedFunctions-24        155160835                1.528 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24              636734               412.2 ns/op            24 B/op          1 allocs/op
PASS
ok      golangbenchmarks        1.700s

The BenchmarkTestLoop, BenchmarkNopFunc and BenchmarkSplitNestedFunctions results seem reasonably consistent with each other and make sense, the BenchmarkSplitNestedFunctions is doing more work than the BenchmarkNopFunc on average per benchmark operation but not by much because the expensive BenchmarkChannelMakeCloseRead operation is only done about once every 200,000 benchmarking operations.

Similarly the BenchmarkChannelMakeCloseRead and BenchmarkUnnestedFunctions results seem consistent since each BenchmarkUnnestedFunctions is doing slightly more than each BenchmarkChannelMakeCloseRead if only by a decrement and if test which is potentially causing a branch prediction failure (although I would have hoped the branch predicter would have been able to use the last branch result, but I don't know how complex the close function implementation is which may be overwhelming the branch history)

However BenchmarkNestedFunctions and BenchmarkSplitNestedFunctions are radically different and I don't understand why. There should be similar with the only intentional difference being any grown stack re-use and I did not expect the stack growth cost to be nearly so high (or is that the explanation and it is just co-incidence that result is so similar to the BenchmarkChannelMakeCloseRead result making me think it is not actually doing what I thought it was?)

It should also be noted that the BenchmarkSplitNestedFunctions result can occasionally take significantly different values; I have seen a few values in the range of 10 to 200 ns/op when running it repeatedly. It can also fail to report any result ns/op time while still passing when I run it; I have no idea what is going on there:

BenchmarkChannelMakeCloseRead-24         5724488                54.26 ns/op           96 B/op          1 allocs/op
BenchmarkUnnestedFunctions-24            3992061                57.49 ns/op           96 B/op          1 allocs/op
BenchmarkNestedFunctions-24               200000               0 B/op          0 allocs/op
BenchmarkNestedFunctions2-24            154956972                1.590 ns/op           0 B/op          0 allocs/op
BenchmarkNestedGoRoutines-24             1000000               342.1 ns/op            24 B/op          1 allocs/op

If anyone can point out my mistake in the benchmark / my interpretation of the results and explain what is really happening then that would be greatly appreciated

Background info:

Stack growth and function inlining: https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go
Stack growth limitations: https://dave.cheney.net/2013/06/02/why-is-a-goroutines-stack-infinite
Golang stack structure: https://blog.cloudflare.com/how-stacks-are-handled-in-go/
Branch prediction: https://en.wikipedia.org/wiki/Branch_predictor
Top level 3900X architecture overview: https://www.techpowerup.com/review/amd-ryzen-9-3900x/3.html
3900X branch prediction history/buffer size 16/512/7k: https://www.techpowerup.com/review/amd-ryzen-9-3900x/images/arch3.jpg

benchmarking

callstack

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs