homedark

Zig's HashMap - Part 2

Jan 22, 2024

In part 1 we explored how the six HashMap variants relate to each other and what each offered to developers. We largely focused on defining and initializing HashMaps for various data type and utilizing custom hash and eql functions for types not supported by the StringHashMap or AutoHashMap. In this part we'll focus on keys and values, how they're stored, exposed and our responsibility with respect to their lifetime.

Loosely speaking, Zig's hash map are implemented using two slices: one for keys and one for values. The hash code returned from the hash function is used to find the ideal index of an entry within these arrays. To start simply, if we had this code:

var lookup = std.StringHashMap(i32).init(allocator);
defer lookup.deinit();

try lookup.put("Goku", 9001);
try lookup.put("Paul", 1234);

We could visualize our hash map like so:

keys:               values:
       --------          --------
       | Paul  |         | 1234 |     @mod(hash("Paul"), 5) == 0
       --------          --------
       |      |          |      |
       --------          --------
       |      |          |      |
       --------          --------
       | Goku |          | 9001 |    @mod(hash("Goku"), 5) == 3
       --------          --------
       |      |          |      |
       --------          --------

When we hash our keys and apply a modulo operation with the length of our array (5 above), we get the ideal index for the entry. I say "ideal" because our hash function can return the same hash code for two different keys; the chance of such collisions increase dramatically when we map the hash code to one of 5 available slots via @mod. But if we ignore possible collisions, this is a reasonable view of our hash map.

Once our hash map fills up to a certain point (in part 1, we briefly talked about the fill factor and mentioned that Zig defaults to 80%), it needs to grow to accommodate new values and to maintain the constant-time performance of lookups. Growing a hash map is like growing a dynamic array, we allocate a new array and copy the values from the original to the new (a simple algorithm would be making it 2x larger). For this to work with a hash map though, we can't simply copy the keys and values to the same index of the new slices. We need to re-calculate their index. Why? Because the location of an entry has to be consistent and predictable. We can't insert a key-value pair using one algorithm, e.g. @mod(hash("Goku"), 5), and expect to find it using a different one, e.g. @mod(hash("Goku"), 10) (notice the 5 changed to 10, because our array grew).

This basic visualization is going to serve as a foundation for much of this post. Further, the fact that entries can be moved from one backing array to another (i.e. when the hash map fills up and needs to grow) is also something we'll keep revisiting.

Values

If we extend the above snippet and call lookup.get("Paul"), the return value will be 1234. The return type of get is a ?i32, or more generically, ?V. The optional return type allows get to inform us that the key wasn't found. If you've looked through Zig's documentation, you've probably noticed another similar method: getPtr(key) with a slightly different return type: ?*V.

Since it's difficult to appreciate the difference between the two methods when talking about primitive types like our i32, consider this version which swaps our i32 value for a User:

const std = @import("std");

pub fn main() !void {
	var gpa = std.heap.GeneralPurposeAllocator(.{}){};
	const allocator = gpa.allocator();

	var lookup = std.StringHashMap(User).init(allocator);
	defer lookup.deinit();

	try lookup.put("Goku", .{
		.id = 9000,
		.name = "Goku",
		.super = false,
	});

	var user = lookup.get("Goku").?;

	user.super = true;
	std.debug.print("{any}\n", .{lookup.get("Goku").?.super});
}

const User = struct {
	id: i32,
	name: []const u8,
	super: bool,
};

Even though we set user.super = true, the value of the User in lookup is still false. This is because, in Zig, assignment are done by copy. If we keep the code as-is, but change lookup.get to lookup.getPtr, it'll work. We're still doing an assignment, thus still copying a value, but the value we're copying is the address of the User in our hash map, not the user itself.

getPtr allows us to get a reference to the value within the hash map. As we see above, this has behavioral significance; we can directly modify the value stored in our hash map. It also has a performance implication as copying large values can be expensive. But consider our above visualization and remember that, as the hash table fills up, values can be relocated. With that in mind, can you explain why this code crashes?:

const std = @import("std");

pub fn main() !void {
	var gpa = std.heap.GeneralPurposeAllocator(.{}){};
	const allocator = gpa.allocator();

	// change the type, just to make it easier to write this snippet
	// the same would happen with our above StringHashMap(User)
	var lookup = std.AutoHashMap(usize, usize).init(allocator);
	defer lookup.deinit();

	try lookup.put(100, 100);
	const first = lookup.getPtr(100).?;

	for (0..50) |i| {
		try lookup.put(i, i);
	}
	first.* = 200;
}

If the first.* = 200; syntax is a confusing, we're dereferencing the pointer and writing a value to it. Our pointer is the address of an index in our values array, so what this syntax does is write directly into our array. It crashes because our inserting-loop has forced the hash table to grow, causing the underlying key and values to be re-allocated and all keys and values to be moved. The pointer returned by getPtr is no longer valid. At the time of this writing, the default hash map size is 8 with a fill factor of 80%. If we looped 0..5 the code would work, but one more iteration (0..6) causes the growth and thus our crash. With typical usage, this issue isn't normally a problem; you won't hold a reference to an entry while modifying the hash map. But understanding that it can happen and understanding why it happens will help us better utilize other hash map features that return value and key pointers.

Going back to our User example, what if we changed the type of lookup from std.StringHashMap(User) to std.StringHashMap(*User)? The biggest impact will be with respect to the lifetime of our values. With our original std.StringHashMap(User), we could say that the lookup owns the values - the users we insert are embedded with the hash map's value array. This makes lifetime management easy, when we deinit our lookup the backing keys and values arrays are freed.

Using a *User breaks that ownership. Our hash map stores pointers, but it doesn't own what they point to. Despite the call to lookup.deinit, this code leaks the user:

const std = @import("std");

pub fn main() !void {
	var gpa = std.heap.GeneralPurposeAllocator(.{}){};
	const allocator = gpa.allocator();

	var lookup = std.StringHashMap(*User).init(allocator);
	defer lookup.deinit();

	const goku = try allocator.create(User);
	goku.* = .{
		.id = 9000,
		.name = "Goku",
		.super = false,
	};
	try lookup.put("Goku", goku);
}

const User = struct {
	id: i32,
	name: []const u8,
	super: bool,
};

Let's visualize it:

lookup
 ===============================
 ║  keys:       values:        ║
 ║  --------    -------------  ║
 ║  | Goku* |   | 1024d4000 | ----> -------------
 ║  --------    -------------  ║    |   9000    |
 ║  |       |   |           |  ║    -------------
 ║  --------    -------------  ║    | 1047300e4 |---> -----------------
 ===============================    -------------     | G | o | k | u |
                                    |    4      |     -----------------
                                    -------------
                                    |   false   |
                                    -------------

The double-lined box is our lookup, and represents the memory it owns and is responsible for. References we put in our hash map will point to values outside that box. This has a number of implications. Most importantly though, it means the lifetime of the values is detached from the lifetime of the hash map, and calling lookup.deinit won't free them.

There is a common case where we want to use pointers and associate the lifetime of the value with the hash map. Recall our crashing program, when a pointer to our hash map value became invalid. As I said, that isn't normally a problem, but in more advanced scenarios, you might want to have different parts of code referencing a value that also exists in a hash map. Let's re-examine the above visualization and think about what happens if our hash map grows and relocates the keys and values arrays:

lookup
 ===============================
 ║  keys:       values:        ║
 ║  --------    -------------  ║
 ║  |       |   |           |  ║
 ║  --------    -------------  ║
 ║  --------    -------------  ║
 ║  |       |   |           |  ║
 ║  --------    -------------  ║
 ║  --------    -------------  ║
 ║  | Goku* |   | 1024d4000 | ----> -------------
 ║  --------    -------------  ║    |   9000    |
 ║  |       |   |           |  ║    -------------
 ║  --------    -------------  ║    | 1047300e4 |---> -----------------
 ===============================    -------------     | G | o | k | u |
                                    |    4      |     -----------------
                                    -------------
                                    |   false   |
                                    -------------

The two arrays have grown, been reallocated and our entry indexes have been re-calculated, but our actual User still resides at the same place in the heap (memory location 1047300e4). Just like deinit doesn't alter anything outside our double-lined boxes, other changes such as growth, won't alter them.

Generally speaking, it'll be obvious if you should be storing values or pointer to values. This is largely because methods like getPtr make it possible to efficiently retrieve and modify values directly from our hash map. Either way, we can have our performance cake, so performance isn't the main consideration. What does matter is whether values need to outlive the hash map and/or whether references to values need to exist (and thus remain valid) while the hash map undergoes changes.

In cases where where the hash map and the referenced values should be linked, we need to iterate through the values and clean them up before we call lookup.deinit:

defer {
	var it = lookup.valueIterator();
	while (it.next()) |value_ptr| {
		allocator.destroy(value_ptr.*);
	}
	lookup.deinit();
}

If the dereferencing (value_ptr.*) doesn't seem right, go back to the visualization. Our valueIterator is giving us a pointer to the value in the array, and the value in the array is a *User. Thus, value_ptr is a **User.

Whether we're storing a User or a *User, any allocated fields within the value are always our responsibility. In a real application, your user's name wouldn't be string literals, they would be dynamically allocated. In that case, our while loop above would have to change to:

while (it.next()) |value_ptr| {
	const user = value_ptr.*;
	allocator.free(user.name);
	allocator.destroy(user);
}

Even if our value is a User, its fields are our responsibility (it's a bit silly to think lookup.deinit would know how/what needs to be freed anyways):

while (it.next()) |value_ptr|
	allocator.free(value_ptr.name);
}

In this last case, since we're storing User, our value_ptr is a *User (a pointer to a User, not a pointer to a pointer to a User as before).

Keys

We could start and end this section by saying: everything we said about values applies equally to keys. That is 100% true, but it's somehow less intuitive. Most developers quickly understand that a heap-allocated User stored within a hash map has its own lifetime and need to be explicitly managed/freed. But for some reason, it isn't quite so obvious with keys.

Like values, if our keys are primitive types we don't have to do anything special. A key such as an integer is stored directly within our hash map's key array and thus has its lifetime and memory tied to the hash map. This is a very common case. But another common case is using string keys with the std.StringHashMap. This often trips up developers new to Zig, but you need to guarantee that string keys are valid for as long as they're used by the hash map. And, if they're dynamically allocated, you need to make sure they're freed when no longer used. This means doing for keys exactly what we did for values.

Let's visualize our hash map again, but this time properly represent a string key:

lookup
 ===================================
 ║  keys:       values:            ║
 ║  -------------    ------------  ║
 ║  | 1047300e4 |   | 1024d4000 | ----> -------------
 ║  -------------   -------------  ║    |   9000    |
 ║  |           |   |           |  ║    -------------
 ║  -------------   -------------  ║    | 1047300e4 |---> -----------------
 ===================================    -------------     | G | o | k | u |
                                        |    4      |     -----------------
                                        -------------
                                        |   false   |
                                        -------------

In this example, our key is actually the user.name. Having the key be part of the value is pretty common. Here's what that might look like:

const user = try allocator.create(User);
user.* = . {
	.id = 9000,
	.super = false,
	// simulate a name that comes from a dynamic source, like a DB
	.name = try allocator.dupe(u8, "Goku"),
};
try lookup.put(user.name, user);

In this case, our previous cleanup code is sufficient since we were already freeing user.name which is our key:

defer {
	var it = lookup.iterator();
	while (it.next()) |value_ptr| {
		const user = value_ptr.*;
		allocator.free(user.name);
		allocator.destroy(user);
	}
	lookup.deinit();
}

But for cases where the key isn't part of the value, we need to iterate and free the keys. In many cases, you'll want to iterate both the keys and values and free both. We can simulate this by freeing the name as referenced by the key instead of the user:

defer {
	var it = lookup.iterator();
	while (it.next()) |kv| {
		// This..
		allocator.free(kv.key_ptr.*);

		// Is the same as the following, but only because user.name is our key
		// allocator.free(user.name);

		allocator.destroy(kv.value_ptr.*);
	}
	lookup.deinit();
}

Instead of iteratorValue() we're using iterator() to get access to both a key_ptr and value_ptr.

The last thing to consider is how to remove values from our lookup. This code leaks both the name/key and the heap-allocated user User despite using our improved cleanup logic:

var lookup = std.StringHashMap(*User).init(allocator);

defer {
	var it = lookup.iterator();
	while (it.next()) |kv| {
		allocator.free(kv.key_ptr.*);
		allocator.destroy(kv.value_ptr.*);
	}
	lookup.deinit();
}

const user = try allocator.create(User);
user.* = . {
	.id = 9000,
	.super = false,
	// simulate a name that comes from a dynamic source, like a DB
	.name = try allocator.dupe(u8, "Goku"),
};
try lookup.put(user.name, user);

// We added this!
_ = lookup.remove(user.name);

The last line removes the entry from our hash map, so our cleanup routine no longer iterates over it and doesn't free name or user. We need to use fetchRemove instead of remove to get the key and value that was removed:

if (lookup.fetchRemove(user.name)) |kv| {
	allocator.free(kv.key);
	allocator.destroy(kv.value);
}

fetchRemove doesn't return key and value pointers, it returns the actual key and value. That doesn't change how we use it, but it should be obvious why the key and value are returned as opposed to a pointer to the key and value. With the items removed from the hash map, there is no valid pointer to the keys and values within the hash map - they've been removed.

All of this assumes that your values and keys need to be freed/invalidated when removed from the hash map. There can be cases where the lifetimes of your values (and more rarely your keys) are completely unrelated to their presence within the hash map. In those cases, you need to free the memory when it makes sense for your application to do so. There's no general pattern/guidance that applies.

For most cases, when dealing with non primitive keys or values, the takeway is that when you call deinit on your hash map, any allocations you made for your keys and/or values won't be magically freed; you'll need to do that yourself.

getOrPut

While there are a number of implications to what we've talked about so far, for me, one of the best benefits that comes from exposing key and value pointers directly is the getOrPut method.

If I asked you to store named counters in a map in Go, or most languages, you'd end up with something like:

count, exists := counters[name]
if exists == false {
	counters[name] = 1
} else {
	counters[name] = count + 1;
}

This code requires two lookups. While we've been trained not to look beyond the fact that hash map access is O(1), the reality is that fewer operations are faster than more, calculating the hash code isn't the cheapest operation (and its performance varies depending on the key type and length), and collisions add overhead to the entire process. The getOrPut method solves this problem by returning a value pointer and a boolean indicating whether or not the value was found.

In other words, with getOrPut we either get a pointer to the found value, or we get a pointer to where the item should go. We also get a boolean indicating which of the two scenarios e have. sThis allows the above kind of upsert to be written with a single lookup:

const gop = try counters.getOrPut(name);
if (gop.found_existing) {
	gop.value_ptr.* += 1;
} else {
	gop.value_ptr.* = 1;
}

Of course, like any other value or key pointer, value_ptr should only be considered valid so long as no changes to the hash map is made. This, by the way, also applied to the iterated keys and values we get from iterator(), valueIterator and keyIterator(), for the same reason.

Conclusion

Hopefully you're now more comfortable with using std.HashMap, std.AutoHashMap and std.StringHashMap and, if appropriate, their unmanaged variants. While you might never have to provide your own context (hash and eql function), it's good to know that it's an option. For day to day programming, I find it immensely useful to visualize data, particularly when pointers are used and levels of indirection are added. Whenever I'm dealing with a value_ptr or key_ptr, I think of those two slices and the difference between the value or key and the address of the value or key in those slices.