Basic Micro-Optimizations Part 2
  Nov 28, 2014
  
Following yesterday's post on basic micro-optimizations, I wanted to look at a concrete example I recently came across. A couple weeks ago, I was profiling code for a typical HTTP API service written in Go. A significant amount of memory was being allocated for http.Header instances. Even if you aren't familiar with Go, the following code should be obvious:
var body = []byte("hello world!")
func fatHandler(writer http.ResponseWriter, req *http.Request) {
  header := writer.Header()
  header["Content-Length"] = []string{"12"}
  header["Cache-Control"] = []string{"public,max-age=30"}
  header["Content-Type"] = []string{"text/plain"}
  writer.WriteHeader(200)
  writer.Write(body)
}
Calling the above 1 million times takes 432ms and allocates 48MB of memory. On my machine, it required 753 garbage collections which took 175ms - a significant percentage of the total time.
What's the above code doing? Go's http.Header structure is a map[string][]string. In other words, it's a dictionary where the key is a string and the values are an array of strings. Why an array of strings? Because HTTP allows a key to have multiple values. Knowing this, think about what the above code is doing. First, it's allocating a map. From an API point of view, a map makes sense, but given that that 99% of all cases will require very few keys (<20), it's far from the most efficient solution. Next, for each key, it's allocating a dynamic array. I don't know what the starting size of the array is, but if it's anything other than 1, in most cases, that's a lot of wasted space.
There's little we can do about the fact that the built-in http.ResponseWriter requires a map, but can we reduce the memory footprint caused by the arrays and, if so, what impact would it have?
When I'm testing performance and memory alternatives, I always try to keep things focused. Isolate the components and data and do the simplest possible tests to see if it makes any sense. This was what I came up with: 
var (
  body = []byte("hello world!")
  headerbucket = []string{"", "", ""}
)
func slimHandler(writer http.ResponseWriter, req *http.Request) {
  headerbucket[0] = "12"
  headerbucket[1] = "public,age=30"
  headerbucket[2] = "text/plain"
  header := writer.Header()
  header["Content-Length"] = headerbucket[0:1]
  header["Cache-Control"] = headerbucket[1:2]
  header["Content-Type"] = headerbucket[2:3]
  writer.WriteHeader(200)
  writer.Write(body)
}
Sticking with our 1 million calls, the above code will allocate 2 999 999 fewer arrays than the original version. Result? 85KB (from 48MB), 188ms (from 432ms), 1GC (from 753) which required 0.35ms (from 175ms).
The above code isn't production ready. Specifically, we [probably] need to make it concurrent-friendly, but that's a solvable problem (using a pool of []string for example).
It's a significant difference and, I believe, it's the type of attention to detail that results in being able to handle tens of thousands of requests per second on a few machines rather than a few hundred. It makes the code more complicated and more fragile, but even if it doesn't make sense in most cases, it's always good to be familiar with possibilities.