homedark

Regular Expressions in Zig

Apr 23, 2023

If you're looking to use regular expressions in Zig, you have limited choices. One option worth considering is using Posix's regex.h. For our purposes, we'll be using three functions of the library: regcomp, regexec and regfree. The first takes and initializes a regex_t, the second executes that regex_t against an input and the third frees resources internally allocated when compiling the pattern.

Important Notice: regex.h internally allocates memory, so there's no way to fully manage memory using a Zig allocator.

Because of this known issue Zig does not properly translate the regex_t structure, so we have to do a bit of work to initialize a value. We can use the alignedAlloc of an Allocator to create a regex_t, but we need to know the size and alignment.

Create a regez.h file. I suggest placing it in lib/regez/regez.h of your project. For now, the content should be:

#include <regex.h>
#include <stdalign.h>

const size_t sizeof_regex_t = sizeof(regex_t);
const size_t alignof_regex_t = alignof(regex_t);

We've exposed the size and alignment of regex_t. This is all we need to create a regex_t with a Zig allocator:

const re = @cImport(@cInclude("regez.h"));
const REGEX_T_SIZEOF = re.sizeof_regex_t;
const REGEX_T_ALIGNOF = re.alignof_regex_t;

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  const slice = try allocator.alignedAlloc(u8, REGEX_T_ALIGNOF, REGEX_T_SIZEOF);
  defer allocator.free(slice);
  const regext: [*]re.regex_t = @ptrCast(slice.ptr);

  ...
}

We've only created (and freed) a regex_t, we haven't actually compiled a pattern yet, let alone made us of it. Before we do anything with our regex, you might be wondering how to run this Zig program with our custom regez.h.

If you're writing a script and relying on zig run FILE.zig, you can use zig run FILE.zig -Ilib/regez to add the lib/regez directory to the include search path. (The -I argument also works with zig test.) If you're using build.zig, you'll add step.addIncludePath("lib/regez"); where step is your test/exe step.

With that out of the way, there are two things left to do. The first is to compile a pattern:

if (re.regcomp(regex, "[ab]c", 0) != 0) {
    // TODO: the pattern is invalid
}
defer re.regfree(regex); // IMPORTANT!!

The regcomp function takes our regex, the pattern to compile (which has to be a null-terminated string) and bitwise options. Here we pass no options (0). The available options are:

So to enable extended regular expressions and ignore case, we'd do:

if (re.regcomp(regex, "[ab]c", re.REG_EXTENDED | re.REG_ICASE) != 0) {
    // TODO: the pattern is invalid
}
defer re.regfree(regex); // IMPORTANT!!

Notice that we call re.regfree. This is on top of the deferred allocator.free that we already have. This is necessary because regcomp allocates its own memory.

Finally, regexec lets us execute our regular expression against an input. We'll take this in two steps. The first thing we'll do is add a simple isMatch function in our regez.h. The full file now looks like:

#include <regex.h>
#include <stdbool.h>
#include <stdalign.h>

const size_t sizeof_regex_t = sizeof(regex_t);
const size_t alignof_regex_t = alignof(regex_t);

bool isMatch(regex_t *re, char const *input) {
  regmatch_t pmatch[0];
  return regexec(re, input, 0, pmatch, 0) == 0;
}

Which we can use from Zig with our regex variable and an input:

// prints true
std.debug.print("{any}\n", .{re.isMatch(regex, "ac")});

// prints false
std.debug.print("{any}\n", .{re.isMatch(regex, "nope")});

We can see from the above that regexec takes a regex_t *, an (null-terminated) input and 3 additional parameters. The 3rd parameter is the length of the 4th parameter. The 4th paraemter is an array to store match information. The 5th and final parameter is a bitwise options (which we won't go over, as they're not generally useful).

What if we care about match information? We need to leverage the 3rd and 4th parameters of regexec. The 4th parameter is an array of regmatch_t. This types has two fields: rm_so and rm_se which identifies the start offset and end offset of matches.

Let's change our pattern and look at an example:

if (re.regcomp(regex, "hello ?([[:alpha:]]*)", re.REG_EXTENDED | re.REG_ICASE) != 0) {
  print("Invalid Regular Expression", .{});
  return;
}

const input = "hello Teg!";
var matches: [5]re.regmatch_t = undefined;
if (re.regexec(regex, input, matches.len, &matches, 0) != 0) {
  // TODO: no match
}

for (matches, 0..) |m, i| {
  const start_offset = m.rm_so;
  if (start_offset == -1) break;

  const end_offset = m.rm_eo;

  const match = input[@intCast(usize, start_offset)..@intCast(usize, end_offset)];
  print("matches[{d}] = {s}\n", .{i, match});
}

Our pattern is hello ?([[:alpha:]]*) and our input is hello Teg!. Therefore, the above code will print: matches[0] = hello Teg followed by matches[1] = Teg. The full matching input is always at matches[0]. You can tell from the above code that when rm_so == -1, we have no more matches.

That's pretty much all there is to it. It is worth going over the regex.h man pages. There's a 4th function, regerror, that will take the error code from regcomp and regexec and provide an error message.