homedark

Zig's HashMap - Part 3

Aug 07, 2024

Previously On...

In Part 1 we saw how Zig's StringHashMap and AutoHashMap are wrappers around a HashMap. HashMap works across different key types by requiring a Context. Here's the built-in context that StringHashMap uses:

pub const StringContext = struct {
  pub fn hash(_: StringContext, s: []const u8) u64 {
    return std.hash.Wyhash.hash(0, s);
  }
  pub fn eql(_: StringContext, a: []const u8, b: []const u8) bool {
    return std.mem.eql(u8, a, b);
  }
};

The Context is a simple interface which makes HashMap usable for any key. As long as you can provide a hash and eql implementation, you can use any type of key.

Concurrency

The purpose of this post is not to make Zig's HashMap thread-safe, but let's consider a common trick used to support multiple concurrent writers: using shards/buckets. If you aren't familiar with this pattern, it involves wrapping multiple HashMaps + Mutex (often called "Buckets" or "Shards"). The wrapper can then direct operations to the bucket "responsible" for the key.

pub fn ConcurrentStringHashMap(comptime V: type) type {
  return struct {
    buckets: [4]Bucket(V),

    // ...

    pub fn get(self: *Self, key: []const u8) ?V {
      const hash_code = std.hash.Wyhash.hash(0, key);
      // because we have 4 buckets
      const bucket_index = @mod(hash_code, 4);
      return self.buckets[bucket_index].get(key);
    }
  };
}

Where Bucket(V) is its own wrapper for a HashMap and RwLock:

fn Bucket(comptime V: type) type {
  return struct {
    lock: std.Thread.RwLock,
    hash_map: std.StringHashMap(V),

    // init, deinit, ...

    pub fn get(self: *Self, key: []const u8) ?V {
      self.lock.lockShared();
      defer self.lock.unlockShared();
      return self.hash_map.get(key);
    }
  };
}

But there's something that always annoys me about this implementation: every operation requires hashing the key twice. First we calculate the hash code in getIndex to determine the correct bucket. The second time takes place within StringContext.hash. This gnaws at me; hash maps work because hash functions are deterministic. We have the hash_code in our outer get, do we really need to calculate it again?

One option is to use a custom key and context for our Bucket:

const Key = struct {
  key: []const u8,
  hash_code: u64,

  const Context = struct {
    pub fn hash(_: Context, k: Key) u64 {
      return k.hash_code;
    }
    pub fn eql(_: Context, a: Key, b: Key) bool {
      return std.mem.eql(u8, a.key, b.key);
    }
  };
};

Our get method becomes:

pub fn get(self: *Self, key: []const u8) ?V {
  const hash_code = std.hash.Wyhash.hash(0, key);
  const bucket_index = @mod(hash_code, 4);
  return self.buckets[bucket_index].get(.{.key = key, .hash_code = hash_code});
}

This works, but at the cost of having to store a larger key in our HashMap. While it's tempting to think this'll just use a bit more RAM, the reality is that it could negatively impact performance by increasing CPU-cache misses. Is there a solution that doesn't require this trade off?

Adapted Context

What we want is for our key to be just the string, []const u8, and our Context to use our pre-generated hash code. At an extreme, we could say that we want our Context to be independent from our key. This is possible through *Adapted variants of the HashMap methods you're already using, such as getAdapted. The *Adapted variants, take an explicit Context per call (much like the Unmanged HashMap variants take an explicit Allocator per call).

Our ConcurrentStringHashMap(V).get only used the calculate hash code to decide which bucket to use. Now we'll pass this to the bucket as well:

pub fn get(self: *Self, key: []const u8) ?V {
  const hash_code = std.hash.Wyhash.hash(0, key);
  const bucket_index = @mod(hash_code, 4);
  return self.buckets[bucket_index].get(key, hash_code);
}

Bucket(V).get can use this hash code, along with a custom context in order to skip the duplicate key hashing:

fn get(self: *Self, key: []const u8, hash_code: u64) ?V {
  self.lock.lockShared();
  defer self.lock.unlockShared();
  return self.hash_map.getAdapted(key, BucketContext{.hash_code = hash_code});
}

With BucketContext being similar to the built-in StringContext:

const BucketContext = struct {
  hash_code: u64,

  // We no longer need to re-calcualte the hash code!
  pub fn hash(self: BucketContext, key: []const u8) u64 {
    return self.hash_code;
  }

  // Same as StringContext.eql
  pub fn eql(_: BucketContext, a: []const u8, b: []const u8) bool {
    return std.mem.eql(u8, a, b);
  }
};

You could say that we're overwriting the default (stateless) StringContext, with our own stateful BucketContext on a per-call basis. In fact, many of the HashMap method you use every day are thin wrappers around *Adapted variant, where the default context is used.

However, not every method has a direct *Adapted variant. For example, there's no putAdapted. This is likely because put is itself a wrapper around getOrPut and there is a getOrPutAdapted. So using the *Adapted variants means that, sometimes, we'll need to dig a little deeper to find the right method. Here's our Bucket(V).put:

fn put(self: *Self, key: []const u8, hash_code: u64, value: V) !void {
  self.lock.lock();
  defer self.lock.unlock();

  const result = try self.hash_map.getOrPutAdapted(key, BucketContext{.hash_code = hash_code});
  result.key_ptr.* = key;
  result.value_ptr.* = value;
}

When we use map.get or map.put, the type of the key is the generic parameter of our HashMap (a []const u8 in the case of a StringHashMap). But for the *Adapted variants, both the key and context are anytype. This isn't something we're utilizing here, but it provides enormous flexibility by decoupling not just the data, but also the types of the Context with respect to the HashMap.

Rehasing

As items are inserted into a HashMap, it'll start to fill up. At some point (known as the load factor), the HashMap has to grow its underlying storage and re-hash all its entries. This can introduce challenges when using adapted contexts. For example, imagine we insert "a" and provide an adapted context of BucketContext{.hash_code = 2941419223392617777}. Next we insert "b" with its own adapted context of BucketContext{.hash_code = 16239767293555273624}. But when we try to insert "c", the HashMap needs to grow. At this point, our previously inserted entries with the keys "a" and "b" need to be re-hashed, but those explicitly provided adapted contexts are long gone.

In our specific cases, where Bucket(V) is a wrapper for StringHashMap(V), this isn't actually a problem. Why? Because the re-hashing step will use the default context, aka the StringContext at the top of this post. For a given string, the hash method of StringContext produces the same result as what we're storing in the hash_code field. To be sure of this, instead of using std.hash.Wyhash.hash(0, key) directly, we should probably use std.hash_map.hashString(key). This would ensure that our hash is always consistent with StringContext.

However, if you were doing something really unique in your adapted context, it could be a problem (don't ask me what kind of thing, because I can't think of any). However, all is not lost, at the very end of this rabbit hole is a getOrPutContextAdapted which accept two adapted contexts! The first is the adapted context to use when looking up the key we want to get or put (just like the adapted context we passed to getOrPutAdapted). The additional one is the adapted context to use when re-hashing.

Conclusion

If you've tried to navigate the HashMap documentation, you know that Zig's HashMaps are built on abstractions. StringHashMap and AutoHashMap are fairly thin wrappers around HashMap, which itself is wrapper around HashMapUnmanaged. Now we see that many of the standard methods, like get are abstractions around *Adapted variants - if not directly then through other layers of abstractions.

My initial encounter with Zig's various HashMaps wasn't great. I didn't understand; I could hardly navigate the code. While there's still some things I don't understand, I'm starting to grasp the why. I understand why we have managed and unmanaged types and why we have *Adapted variants.

You might never needed to use adapted contexts, but it's good to know that they're available. It might seem unrelated, but for me, it's like positive/negative look-ahead/look-behind in regular expressions. I've used these once or twice, long ago, but now only retain a vague sense of how they can be useful - I certainly don't remember the syntax. But that vague awareness is enough to be useful. If I do run into a problem where adapted contexts or negative look-ahead might be useful, there's a good chance I'll be able to recognize it and then can refresh/deepen my understanding.