The State of Caching in Go

Since writing this post, we have built Ristretto: A High Performance Go cache. Read all about it here.

This post made it to the top of Golang subreddit and is trending #2 on the front page of Hacker News. Do engage in discussion there and show us love by giving us a GitHub star.

Every database system requires a smart cache. That is, a cache that keeps the most frequently or recently accessed items and has a configurable memory utilization limit.

Being a graph database, Dgraph can access thousands---even millions---of keys per query, depending upon the number of intermediate results. Accessing these via our key-value DB, Badger, results in disk seeks, which is something we want to reduce for performance reasons.

Typical access patterns follow a Zipfian distribution. The most frequently accessed keys are accessed exponentially more times than others. We see this in Dgraph as well.

We have been extremely happy with our choice of using the Go language to build Dgraph. A lot can be said about why Go makes a great language to write back-end code in, but that is a post in itself. We would not swap out Go for any other existing language. Although Go is great, the Go ecosystem is still lacking.

My general complaints around Go stem from the lack of a library ecosystem around the language. Go is mature in the sense that it is stable and has delivered upon the core promise of fast compilation, execution, and utilization of machine cores really well. However, for a language built around concurrency, it still lacks an ecosystem of performant, concurrent libraries which can scale well to the number of cores. A concurrent list or map [^1] is largely left as an exercise to the end-user---which would be just fine if this was a serially-executing language---but feels like a glaring omission when everything is built around concurrency.

[^1]: Use of sync.Map is discouraged by the Go team for various reasons. One of them is that it does not reduce in size, despite deleting elements in the map. So, for practical purposes, I'm going to disregard sync.Map as a first-class data structure. Most users would fill the need with a mutex lock on top of Go native data structures.

In particular, Go lacks a concurrent LRU (or LFU) cache which can scale well enough to be a process-global cache. In this blog post, I will take you through the various attempts at workarounds that are typically advocated, including some which we have executed and learnt from within Dgraph. Aman will then present the design, performance and hit ratio comparison for the existing popular cache implementations in the Go ecosystem.

I'll start by listing the requirements for the cache, the various approaches we can take to implement the cache, and how they fail to achieve them.

Requirements for the cache

Concurrent.
Memory-bounded (limit to configurable max memory usage).
Scale well as the number of cores and goroutines increase.
Scale well under non-random key access distribution (e.g. Zipf).
High cache hit ratio

Use Go map with sync.Mutex

Using a Go map with a sync.Mutex (or sync.RWMutex) is the most commonly advocated approach to caching. This does result in all the goroutines blocking upon one lock, which can lead to contention. This also fails to keep a tab on the amount of memory usage. So, it does not work as a memory bounded cache in itself.

Fails on requirements 2-4.

Use lock striping with Go maps

This is the same concept as above, but splits keys using a fingerprint into many smaller Go map shards protected by mutex locks (see here). Many developers incorrectly believe lock striping to be a great way to avoid contention, particularly when setting the number of shards to exceed the number of threads in the program (> GOMAXPROCS).

In our initial attempts at building a simplified memory-bounded cache, we built this as well. To allow for releasing memory back to the OS, we would periodically pick a random shard and delete its map, allowing it to refill. This was a crude but simple technique which performed better than an LRU cache (explained below) but had many downsides.

One, Go is slow to release memory back to the OS but fast to request it. As soon as the shard was emptied, goroutines trying to access the keys in that shard would start allocating memory while the previous memory was still not released back fully, causing a spike in memory usage and a rapid OOM crash.

Also, what we failed to realize at the time was that the access patterns are bounded by Zipf's law. The most frequently accessed keys are still under a few locks, thus becoming the cause of contention for all goroutines. The performance of this approach does not scale well with the number of cores.

Fails on requirements 2,4.

LRU cache

Go has a basic LRU cache implementation as part of groupcache. After our failed attempt with lock striping with map shards, we modified this LRU cache by introducing locks and making it concurrent. While this cache did solve the immediate issues around memory spikes caused by frequently and consistent memory releases, we realized it would introduce contention.

This cache is also sized by the number of entries, not the amount of memory they are consuming. Trying to estimate the memory usage of a complex data structure on the heap in Go is prohibitively expensive and almost impossible to get right, something we realized after trying in vain using multiple mechanisms. This got particularly hard because our data structures were changing after being placed in the cache (something that we plan to avoid going forward).

But, we did not comprehend how much contention that cache could cause. After having this cache for over a year, we realized that the contention around this cache was so significant, that removing it caused our queries to speed up 10x!

In this implementation, every read is a write which updates the relative positioning of the element in the recency linked list. Thus, all accesses wait on a single mutex lock. In addition, the critical section of LRU is slower than a map and does a lot of pointer dereferences, maintaining a map and a doubly-linked list (see code). Despite our efforts around lazy eviction, it still suffered from severe contention.

Fails on requirement 3-4.

Striped LRU cache

We did not bother trying this. From our experiments with striped map shards, we know it would only be an incremental improvement and would not scale well. (Though, for the purpose of benchmarking caches for this article, we implemented a striped LRU cache as described below.)

Would fail on requirement 4.

Popular Cache Implementations

Many other approaches aim at reducing the GC time spent on the map shards. GC time increases as the number of entries in the map increase. Reduction is achieved by allocating fewer, larger byte slices and storing many cache entries in each slice. This is an effective approach---we use this in Badger in multiple places (Skiplist, Table builder, etc.). Some of the popular caches in Go use this technique.

BigCache

BigCache divides the data into shards based on the hash of the key. Each shard contains a map and a ring buffer. Whenever a new element is set, it appends the element in the ring buffer of the corresponding shard and the offset into the buffer is stored in the map. If the same element is set more than once, the previous entries in the buffer are marked invalid. If the buffer is too small, it is expanded until the maximum capacity is reached.

Each map key is a uint32 hash and the value is a uint32 pointer to the offset in the buffer where the value is stored along with metadata information. If there are hash collisions, BigCache ignores the previous key and stores the current one into the map. Allocating fewer, larger buffers up front and using a map[uint32]uint32 is a great way to avoid paying the cost of GC sweeps.

FreeCache

FreeCache divides the cache into 256 segments. Each segment contains 256 slots and a ring buffer to store the data. When a new key is added to the cache, a segment id is identified using the lower eight bits of the hash of the key. Further, a slot is chosen using LSB 9-16 of the hash of the key. Dividing data into slots helps in reducing the search space while looking for a key in the cache.

The data is then appended into the ring buffer and offset is stored into a sorted array. If a ring buffer doesn't have enough space, eviction is performed in the segment from the beginning of the ring buffer using a modified LRU policy. An entry is deleted from the ring buffer if the last access time for the entry is smaller than the average access time of the segment. To find an entry in a cache on Get, a binary search is performed in the sorted array in the corresponding slot.

GroupCache

GroupCache implements an exact LRU eviction policy using a linked list and a Go map. For a fair comparison, we implemented a sharding logic with 256 shards on top of GroupCache.

Performance Comparison

To compare performance of various caches, we generated a Zipf-distributed workload and ran benchmarks using an n1-highcpu-32 machine. The chart below compares performance of the three cache libraries for a read-only workload.

Read-Only

Read-Only Performance Benchmark

We can see that BigCache reads scales well given that reads are lock-free. FreeCache and GroupCache reads are not lock-free and don't scale after a point (20 concurrent accesses). (higher value is better on y axis)

Write-Only

Write-Only Performance Benchmark

For a write-only workload, all the libraries seem to perform similarly. Though, FreeCache performs slightly better than the other two.

Read-Write (25% writes, 75% reads)

Read-Write Performance Benchmark

For a mixed workload containing 25% writes and 75% reads, while BigCache is the only library that clearly seems to scale well, the hit ratios are bad for a Zipf workload, as explained in the next section.

Hit Ratio Comparison

Hit Ratio for the three caches are shown below. FreeCache does pretty close to LRU policy implemented by GroupCache. BigCache, however, doesn't do well for a Zipf-distributed workload for following reasons:

BigCache doesn't utilize the buffer efficiently and may end up storing multiple entries for the same key in the buffer.
BigCache doesn't update entries on access (read), hence, resulting in eviction of recently accessed keys.

| Cache Size (# of elem) | 10000 | 100000 | 1000000 | 10000000 | | :--------------------: | :---: | :----: | :-----: | :------: | | BigCache | - | 37% | 52% | 55% | | FreeCache | - | 38% | 55% | 90% | | GroupCache | 29% | 40% | 54% | 90% |

So, we can conclude that none of the cache libraries meet all the requirements.

GroupCache and FreeCache fails on requirement 4 whereas BigCache fails on requirement 5.

So, what are we left with?

Well, nothing really. We are not aware of a smart memory-bounded cache in Go that can meet the entire list of requirements. If you know of one, do let us know in the comments.

Meanwhile, we came across Caffeine, a Java library used by Cassandra, Finagle and other database systems. It uses TinyLFU, a highly efficient cache admission policy and uses various techniques to scale and perform well as the number of threads and cores grow, while providing a close to optimal hit ratio. You can read more about how it works in this article.

Caffeine meets all the five requirements I mentioned at the beginning, so we are looking into building a Go equivalent of Caffeine, something which can serve our needs and potentially fill the gap of a concurrent, performant, memory-bounded cache in the Go language. Let us know if you'd like to contribute or if you have something similar built already!

Acknowledgment

We want to thank Benjamin Manes for helping us replicate performance benchmarks from Caffeine into Go (code available here). We also would like to thank Damian Gryski for providing us with base framework to benchmark cache hit ratios (available here), which we also modified to work for our needs. He readily accepted our changes into his GitHub repository.

Thanks for reading this post. If you liked this post, show us some love and give us a star on GitHub.