Zig's HashMap - Part 3
Previously On...
In Part 1 we saw how Zig's StringHashMap
and AutoHashMap
are wrappers around a HashMap
. HashMap
works across different key types by requiring a Context
. Here's the built-in context that StringHashMap
uses:
pub const StringContext = struct {
pub fn hash(_: StringContext, s: []const u8) u64 {
return std.hash.Wyhash.hash(0, s);
}
pub fn eql(_: StringContext, a: []const u8, b: []const u8) bool {
return std.mem.eql(u8, a, b);
}
};
The Context is a simple interface which makes HashMap
usable for any key. As long as you can provide a hash
and eql
implementation, you can use any type of key.
Concurrency
The purpose of this post is not to make Zig's HashMap thread-safe, but let's consider a common trick used to support multiple concurrent writers: using shards/buckets. If you aren't familiar with this pattern, it involves wrapping multiple HashMaps + Mutex (often called "Buckets" or "Shards"). The wrapper can then direct operations to the bucket "responsible" for the key.
pub fn ConcurrentStringHashMap(comptime V: type) type {
return struct {
buckets: [4]Bucket(V),
// ...
pub fn get(self: *Self, key: []const u8) ?V {
const hash_code = std.hash.Wyhash.hash(0, key);
// because we have 4 buckets
const bucket_index = @mod(hash_code, 4);
return self.buckets[bucket_index].get(key);
}
};
}
Where Bucket(V)
is its own wrapper for a HashMap
and RwLock
:
fn Bucket(comptime V: type) type {
return struct {
lock: std.Thread.RwLock,
hash_map: std.StringHashMap(V),
// init, deinit, ...
pub fn get(self: *Self, key: []const u8) ?V {
self.lock.lockShared();
defer self.lock.unlockShared();
return self.hash_map.get(key);
}
};
}
But there's something that always annoys me about this implementation: every operation requires hashing the key
twice. First we calculate the hash code in getIndex
to determine the correct bucket. The second time takes place within StringContext.hash
. This gnaws at me; hash maps work because hash functions are deterministic. We have the hash_code
in our outer get
, do we really need to calculate it again?
One option is to use a custom key and context for our Bucket
:
const Key = struct {
key: []const u8,
hash_code: u64,
const Context = struct {
pub fn hash(_: Context, k: Key) u64 {
return k.hash_code;
}
pub fn eql(_: Context, a: Key, b: Key) bool {
return std.mem.eql(u8, a.key, b.key);
}
};
};
Our get
method becomes:
pub fn get(self: *Self, key: []const u8) ?V {
const hash_code = std.hash.Wyhash.hash(0, key);
const bucket_index = @mod(hash_code, 4);
return self.buckets[bucket_index].get(.{.key = key, .hash_code = hash_code});
}
This works, but at the cost of having to store a larger key in our HashMap. While it's tempting to think this'll just use a bit more RAM, the reality is that it could negatively impact performance by increasing CPU-cache misses. Is there a solution that doesn't require this trade off?
Adapted Context
What we want is for our key to be just the string, []const u8
, and our Context to use our pre-generated hash code. At an extreme, we could say that we want our Context to be independent from our key. This is possible through *Adapted variants of the HashMap methods you're already using, such as getAdapted
. The *Adapted variants, take an explicit Context per call (much like the Unmanged HashMap variants take an explicit Allocator
per call).
Our ConcurrentStringHashMap(V).get
only used the calculate hash code to decide which bucket to use. Now we'll pass this to the bucket as well:
pub fn get(self: *Self, key: []const u8) ?V {
const hash_code = std.hash.Wyhash.hash(0, key);
const bucket_index = @mod(hash_code, 4);
return self.buckets[bucket_index].get(key, hash_code);
}
Bucket(V).get
can use this hash code, along with a custom context in order to skip the duplicate key hashing:
fn get(self: *Self, key: []const u8, hash_code: u64) ?V {
self.lock.lockShared();
defer self.lock.unlockShared();
return self.hash_map.getAdapted(key, BucketContext{.hash_code = hash_code});
}
With BucketContext
being similar to the built-in StringContext
:
const BucketContext = struct {
hash_code: u64,
// We no longer need to re-calcualte the hash code!
pub fn hash(self: BucketContext, key: []const u8) u64 {
return self.hash_code;
}
// Same as StringContext.eql
pub fn eql(_: BucketContext, a: []const u8, b: []const u8) bool {
return std.mem.eql(u8, a, b);
}
};
You could say that we're overwriting the default (stateless) StringContext
, with our own stateful BucketContext
on a per-call basis. In fact, many of the HashMap method you use every day are thin wrappers around *Adapted variant, where the default context is used.
However, not every method has a direct *Adapted variant. For example, there's no putAdapted
. This is likely because put
is itself a wrapper around getOrPut
and there is a getOrPutAdapted
. So using the *Adapted variants means that, sometimes, we'll need to dig a little deeper to find the right method. Here's our Bucket(V).put
:
fn put(self: *Self, key: []const u8, hash_code: u64, value: V) !void {
self.lock.lock();
defer self.lock.unlock();
const result = try self.hash_map.getOrPutAdapted(key, BucketContext{.hash_code = hash_code});
result.key_ptr.* = key;
result.value_ptr.* = value;
}
When we use map.get
or map.put
, the type of the key is the generic parameter of our HashMap (a []const u8
in the case of a StringHashMap
). But for the *Adapted variants, both the key
and context
are anytype
. This isn't something we're utilizing here, but it provides enormous flexibility by decoupling not just the data, but also the types of the Context with respect to the HashMap.
Rehasing
As items are inserted into a HashMap, it'll start to fill up. At some point (known as the load factor), the HashMap has to grow its underlying storage and re-hash all its entries. This can introduce challenges when using adapted contexts. For example, imagine we insert "a" and provide an adapted context of BucketContext{.hash_code = 2941419223392617777}
. Next we insert "b" with its own adapted context of BucketContext{.hash_code = 16239767293555273624}
. But when we try to insert "c", the HashMap needs to grow. At this point, our previously inserted entries with the keys "a" and "b" need to be re-hashed, but those explicitly provided adapted contexts are long gone.
In our specific cases, where Bucket(V)
is a wrapper for StringHashMap(V)
, this isn't actually a problem. Why? Because the re-hashing step will use the default context, aka the StringContext
at the top of this post. For a given string, the hash
method of StringContext
produces the same result as what we're storing in the hash_code
field. To be sure of this, instead of using std.hash.Wyhash.hash(0, key)
directly, we should probably use std.hash_map.hashString(key)
. This would ensure that our hash is always consistent with StringContext
.
However, if you were doing something really unique in your adapted context, it could be a problem (don't ask me what kind of thing, because I can't think of any). However, all is not lost, at the very end of this rabbit hole is a getOrPutContextAdapted
which accept two adapted contexts! The first is the adapted context to use when looking up the key we want to get or put (just like the adapted context we passed to getOrPutAdapted
). The additional one is the adapted context to use when re-hashing.
Conclusion
If you've tried to navigate the HashMap documentation, you know that Zig's HashMaps are built on abstractions. StringHashMap
and AutoHashMap
are fairly thin wrappers around HashMap
, which itself is wrapper around HashMapUnmanaged
. Now we see that many of the standard methods, like get
are abstractions around *Adapted variants - if not directly then through other layers of abstractions.
My initial encounter with Zig's various HashMaps wasn't great. I didn't understand; I could hardly navigate the code. While there's still some things I don't understand, I'm starting to grasp the why. I understand why we have managed and unmanaged types and why we have *Adapted variants.
You might never needed to use adapted contexts, but it's good to know that they're available. It might seem unrelated, but for me, it's like positive/negative look-ahead/look-behind in regular expressions. I've used these once or twice, long ago, but now only retain a vague sense of how they can be useful - I certainly don't remember the syntax. But that vague awareness is enough to be useful. If I do run into a problem where adapted contexts or negative look-ahead might be useful, there's a good chance I'll be able to recognize it and then can refresh/deepen my understanding.