High Concurrency LRU Caching

Oct 19, 2013

LRU Cache Introduction

The most common way of implementing an LRU cache is to use a hashtable for lookups and a linked list to track when items were used. You might want to read this introduction if you aren't familiar with the implementation.

There are two Aha! moments most people have writing one. The first is the realization that you need two different data structures. The second is realizing how you link them together. You see, when you GET or SET an item, you do so from the hashtable. However, you need to promote it in the list. Conversely, when you free up space, you do so from the list, but you need to delete it from the hashtable. The result tends to be a wrapper object that looks like:

type Item struct {
  key string              //hahstable key
  value Value             //actual value
  element *list.Element   //linked list node
}

Such an object gives you access to the linked list from the hashtable, or the hashtable from the linked list.

Concurrency

Supporting concurrent access to our cache is pretty simple. It's important to realize though that granular locks are key to achieving high throughput. If you had a single lock for the entire cache, you'd end up serializing access to it. That's not even worth exploring. As a first step, we can create a separate read-write mutex for our hashtable and our list.

Hashtable

A read-write mutex for the hashtable is efficient. Assuming that we are GETing more than we are SETting, we'll mostly be acquiring read-locks (which multiple threads can secure). The only time we need a write lock is when setting an item. If we keep things basic, it ends up looking like:

func (c *Cache) Get(key string) Value {
  c.lookupLock.RLock()
  item := c.lookup[key]
  c.lookupLock.RUnlock()
  if item == nil { return nil }
  return item.value
}

func (c *Cache) Set(key string, value Value)  {
  item := &Item{key: key, value: value,}
  c.lookupLock.Lock()
  defer c.lookupLock.Unlock()
  c.lookup[key] = item
}

If necessary, we could always shard our hashtable to support more write throughput.

List

Concurrent access of our list is a much bigger challenge. Why? Because every GET requires a write lock on our list. Promoting an item involves moving a node (or inserting one in the case of an initial set) at the head of the list:

func (c *Cache) Get(key string) Value {
  c.lookupLock.RLock()
  item := c.lookup[key]
  c.lookupLock.RUnlock()
  if item == nil { return nil }
  c.promote(item)
  return item.value
}

func (c *Cache) Promote(item *Item) {
  c.listLock.Lock()
  defer c.listLock.Unlock()
  c.list.MoveToFront(item.element)
}

What can we do?

Window

One of my favorite solutions is to use a window to limit how often you'll promote an item. In other words, if an item was recently promoted, it's probably near enough the front of our list that we don't need to re-promote it. Given a large enough cache (both in terms of total space and number of items), your window could be measured in minutes.

To achieve this we must first change our Item structure:

type Item struct {
  key string
  sync.Mutex             //synchronize changes to the item
  value Value
  promoted time.Time     //time item was last promoted
  element *list.Element
}

Now we can reduce the frequency of write locks on our list:

func (c *Cache) promote(item *Item) {
  now := time.Now()
  stale := now.Add(time.Minute * -2)

  item.Lock()
  defer item.Unlock()
  if item.promoted.Before(stale) {
    item.promoted = now
    c.listLock.Lock()
    defer c.listLock.Unlock()
    c.list.MoveToFront(item.element)
  }
}

Buffering

In addition to using a window, we can also promote in a separate dedicated thread. Go channels make this solution particularly clean:

//we haven't looked at instantiating the cache yet..
nc New() *Cache {
  c := &Cache{
    list: list.New(),
    lookup: make(map[string]*Item),
    promotables: make(chan *Item, 2048),
  }
  go c.promoter()
  return c
}

func (c *Cache) promote(item *Item) {
  //leaving out all the window code to keep it simple
  //but it absolutely works in addition to everything else
  //we'll explore
  c.promotable <- item
}

func (c *Cache) promoter() {
  for {
    item := <- c.promotables
    c.list.MoveToFront(item.element)
  }
}

Notice that the above code doesn't require a lock on the list. Synchronization is instead done through Go's channels, which gives us buffering for free.

But wait, we're cheating. What about freeing up memory? Well, if we want to keep everything simple and lock free, we can do it all in the promoter (and maybe rename it along the way):

func (c *Cache) worker() {
  ms := new(runtime.MemStats)
  for {
    item := <- c.promotables
    c.list.MoveToFront(item.element)
    runtime.ReadMemStats(ms)
    if ms.HeapAlloc > c.size {
      c.gc()
    }
  }
}

func (c *Cache) gc() {
  for i := 0; i < c.itemsToPrune; i++ {
    element := c.list.Back()
    if element == nil { return }
    item := element.Value.(*Item)
    c.bucket(item.key).Remove(item.key)
    c.list.Remove(element)
  }
}

I like this approach. However, we're leaning heavily on our channel's buffer. If the garbage collection takes long, our promotable channel will fill up, and our promote method will block.

A small improvement would be to make promote non blocking:

func (c *Cache) promote(item *Item) {
  //leaving out all the window code to keep it simple
  select {
    c.promotable <- item:
    default:
  }
}

If we're able to write to the channel, great. Otherwise, the item just won't get promoted. (Go's select is powerful, the default path will execute if all other paths are blocking).

Rather than failing to promote the latest item, we could even attempt to dump the oldest one:

func (c *Cache) promote(item *Item) {
  //leaving out all the window code to keep it simple
  select {
    c.promotable <- item:
    default:
      <-c.promotable  //free up a slot
      c.promote(item) //try to queue it back up
  }
}

But this isn't risk free (another goroutines could keep using up our freed slot!)

Intrusive List

What if you aren't comfortable running the promoter and garbage collector in the same thread, for fear of dropping promote requests? Are you back to a coarse lock across the list?

The beauty of a linked list is that it isn't an atomic data structure. When you're promoting an item to the head, you can also safely manipulate the tail, as long as they aren't the same item (or siblings of the item). In other words, synchronizing a linked list can happen at the node level.

(As an aside, a key benefit of a skiplist is exactly this. The ability to lock only those nodes which participate directly in a modification, makes it a concurrent friendly data structure.)

What we want to do is build a linked list with nodes that can be synchronized. Our Item structure already has a mutex (to synchronize access its promoted field). So the ideal solution is to build a linked list with Item as the element. This is called an intrusive linked list since the data and next/prev pointers sit within the same structure.

We're doing all of this so that we can lock a node at a time and manipulate it as needed, rather than using a single lock across the entire list. This isn't as simple as it might first seem (in fact, I won't provide a working solution). The problem is that, if we aren't careful, we'll create dealocks. How? Imagine a list with 26 items, labeld A-Z:

A<->B<->C<->D<->E<->...X<->Y<->Z

What happens if we want to promote Y while also freeing up memory? Our promoter will lock Y, and our GC will lock Z. Our promoter will then try to lock Z while our GC will try to lock Y. This is a deadlock. To make this work, in pseudocode, we need to do something like:

function lockFamily(node)
  lock node
  if try lock node.next
    if try lock node.prev
      //we can safely manipulate this chunk
    else
      unlock node.next
      unlock node
      lockFamily node  //try again
    end
  else
    unlock node
    lockFamily node
  end
end

Not impossible, but not fun either

Conclusion

Ultimately, our goal was to reduce lock contention against our list. We achieved this in three ways. First, we use a window to limit the frequency of promotion. Second, we use a buffered channel to process promotions in a separate thread. Finally, we can do promotion and GC within the same thread.

This is all preliminary work. The idea came to me while walking home last night, and has been quickly implemented as a prototype. If you're interested, and brave, you can check out the source code here. It takes care of some of the edge cases that I overlooked (such as promoting a new vs existing item).