Sorting Strings in Zig

Dec 19, 2024

First, the code:

std.mem.sort([]const u8, values, {}, stringLessThan);

fn stringLessThan(_: void, lhs: []const u8, rhs: []const u8) bool {
    return std.mem.order(u8, lhs, rhs) == .lt;
}

std.mem.sort takes 4 arguments: the type of value we're sorting, the list of values to sort, an arbitrary context, and a function. The last argument, the function, is what determines how two values should be ordered with respect to each other. Above we're doing a byte-by-byte comparison of two strings. If we wanted to do an case insensitive comparison of ASCII values, we'd replace std.mem.order with std.ascii.orderIgnoreCase.

The third parameter is an application-specific context. This gets passed into our ordering function. It can be anything we want. Oftentimes, as above, we don't need a context, so we pass void ({}). However, imagine we wanted to boost certain values. In other words, we want to sort a list of strings, but have certain values, if present, appear at the front:

var boost = std.StringHashMap(u32).init(allocator);
try boost.put("Keemun", 100);
try boost.put("Silver Needle", 25);

std.mem.sort([]const u8, values, boost, stringLessThan);

fn stringLessThan(boost: std.StringHashMap(u32), lhs: []const u8, rhs: []const u8) bool {
    const lhs_boost = boost.get(lhs) orelse 0;
    const rhs_boost = boost.get(rhs) orelse 0;
    if (lhs_boost > rhs_boost)  {
        return true;
    }
    if (lhs_boost < rhs_boost)  {
        return false;
    }

    return std.mem.order(u8, lhs, rhs) == .lt;
}

This concept of an application-specific context is something you'll often see in Zig libraries (including the standard library). It fills the gap caused by not having easy closures. In the above code, we can't create a closure that captures boost; instead, we create a function and pass boost as an argument. Pretty simple.

You might see the above code written differently:

std.mem.sort([]const u8, values, {}, struct {
    fn lessThan(_: void, lhs: []const u8, rhs: []const u8) bool {
        return std.mem.order(u8, lhs, rhs) == .lt;
    }
}.lessThan);

This might look a bit like a closure, but it isn't. We still need to pass our context in the 3rd parameter of std.mem.sort and accept it as the first parameter of our custom lessThan function. The above code creates an anonymous structure. An anonymous structure is like any other structure, but instead of having an explicit name, we leave it up to the compiler. The compiler will generate something like blog.main__struct_6782. Without an explicit name, we can't really make use of it, except for where it's defined - which is what our above code is doing.

In addition to std.mem.sort, there's also std.mem.sortUnstable. The "stability" of a sort has to do with how equal values are treated. std.mem.sort is a stable sort, which means that if two values are equal, the original order is preserved. When using an unstable sort, there's no guarantee about how equal values are sorted with respect to each other.

For scalar values like strings and integers, sort stability probably doesn't matter. If you have "blue", "red", "green", "blue", you'll end up with "blue", "blue", "green", "red". There isn't anything to distinguish one "blue" from another, so whether or not the sorting function put the "blue" which was originally last at the front of the sorted array won't matter. But imagine you're sorting a User structure based on the user.name, in such cases, you might want to preserve the original value of equal users (but in my experience, even in these cases you probably won't care about it).

Both std.mem.sort and std.mem.sortUnstable are wrappers to functions found in the std.sort namespace. Specifically, std.mem.sort calls std.sort.block and std.mem.sortUnstable calls std.sort.pdq. The std.sort namespace has a few other sorting methods. Here's are basic results from sorting 1000 string values:

std.sort.block
8359 iterations   358870.91ns per iterations
worst: 989333ns   median: 357375ns    stddev: 26749.47ns

std.sort.pdq
11804 iterations  254196.41ns per iterations
worst: 308333ns   median: 253459ns    stddev: 3892.63ns

std.sort.heap
7115 iterations   421603.49ns per iterations
worst: 469083ns   median: 420542ns    stddev: 7155.60ns

std.sort.insertion
693 iterations    4326253.39ns per iterations
worst: 4465917ns  median: 4337208ns   stddev: 166829.84ns

Now, if we cut the list to only 5 values, we'll get different results, with the insertion sort being the fastest. The sorting algorithm you use will depend on the type and size of data you have, and, for some algorithms, how sorted the data already is - and let's not forget whether or not you need a stable sort. Unless you have specific reason not to, std.mem.sortUnstable (which just calls std.sort.pqd) is probably the best default to use.

Finally, you might notice the std.sort.asc and std.sort.desc functions. You can use them when you're trying to sort integers or floats: they are like our stringLessThan function, taking a void context, but use the less than operator (<) to compare two values:

std.mem.sort(i32, values, {}, std.sort.asc(i32));