Regular Expressions in Zig
Apr 23, 2023
If you're looking to use regular expressions in Zig, you have limited choices. One option worth considering is using Posix's regex.h
. For our purposes, we'll be using three functions of the library: regcomp
, regexec
and regfree
. The first takes and initializes a regex_t
, the second executes that regex_t
against an input and the third frees resources internally allocated when compiling the pattern.
Important Notice: regex.h
internally allocates memory, so there's no way to fully manage memory using a Zig allocator.
Because of this known issue Zig does not properly translate the regex_t
structure, so we have to do a bit of work to initialize a value. We can use the alignedAlloc
of an Allocator
to create a regex_t
, but we need to know the size and alignment.
Create a regez.h
file. I suggest placing it in lib/regez/regez.h
of your project. For now, the content should be:
#include <regex.h>
#include <stdalign.h>
const size_t sizeof_regex_t = sizeof(regex_t);
const size_t alignof_regex_t = alignof(regex_t);
We've exposed the size and alignment of regex_t
. This is all we need to create a regex_t
with a Zig allocator:
const re = @cImport(@cInclude("regez.h"));
const REGEX_T_SIZEOF = re.sizeof_regex_t;
const REGEX_T_ALIGNOF = re.alignof_regex_t;
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
const slice = try allocator.alignedAlloc(u8, REGEX_T_ALIGNOF, REGEX_T_SIZEOF);
defer allocator.free(slice);
const regext: [*]re.regex_t = @ptrCast(slice.ptr);
...
}
We've only created (and freed) a regex_t
, we haven't actually compiled a pattern yet, let alone made us of it. Before we do anything with our regex
, you might be wondering how to run this Zig program with our custom regez.h
.
If you're writing a script and relying on zig run FILE.zig
, you can use zig run FILE.zig -Ilib/regez
to add the lib/regez
directory to the include search path. (The -I
argument also works with zig test
.) If you're using build.zig
, you'll add step.addIncludePath("lib/regez");
where step
is your test/exe step.
With that out of the way, there are two things left to do. The first is to compile a pattern:
if (re.regcomp(regex, "[ab]c", 0) != 0) {
}
defer re.regfree(regex);
The regcomp
function takes our regex
, the pattern to compile (which has to be a null-terminated string) and bitwise options. Here we pass no options (0
). The available options are:
- REG_EXTENDED - Use Extended Regular Expressions.
- REG_ICASE - Ignore case in match (see XBD Regular Expressions).
- REG_NOSUB - Report only success/fail in regexec().
- REG_NEWLINE - Change the handling of newline characters, as described in the text.
So to enable extended regular expressions and ignore case, we'd do:
if (re.regcomp(regex, "[ab]c", re.REG_EXTENDED | re.REG_ICASE) != 0) {
}
defer re.regfree(regex);
Notice that we call re.regfree
. This is on top of the deferred allocator.free
that we already have. This is necessary because regcomp
allocates its own memory.
Finally, regexec
lets us execute our regular expression against an input. We'll take this in two steps. The first thing we'll do is add a simple isMatch
function in our regez.h
. The full file now looks like:
#include <regex.h>
#include <stdbool.h>
#include <stdalign.h>
const size_t sizeof_regex_t = sizeof(regex_t);
const size_t alignof_regex_t = alignof(regex_t);
bool isMatch(regex_t *re, char const *input) {
regmatch_t pmatch[0];
return regexec(re, input, 0, pmatch, 0) == 0;
}
Which we can use from Zig with our regex
variable and an input:
std.debug.print("{any}\n", .{re.isMatch(regex, "ac")});
std.debug.print("{any}\n", .{re.isMatch(regex, "nope")});
We can see from the above that regexec
takes a regex_t *
, an (null-terminated) input and 3 additional parameters. The 3rd parameter is the length of the 4th parameter. The 4th paraemter is an array to store match information. The 5th and final parameter is a bitwise options (which we won't go over, as they're not generally useful).
What if we care about match information? We need to leverage the 3rd and 4th parameters of regexec
. The 4th parameter is an array of regmatch_t
. This types has two fields: rm_so
and rm_se
which identifies the start offset and end offset of matches.
Let's change our pattern and look at an example:
if (re.regcomp(regex, "hello ?([[:alpha:]]*)", re.REG_EXTENDED | re.REG_ICASE) != 0) {
print("Invalid Regular Expression", .{});
return;
}
const input = "hello Teg!";
var matches: [5]re.regmatch_t = undefined;
if (re.regexec(regex, input, matches.len, &matches, 0) != 0) {
}
for (matches, 0..) |m, i| {
const start_offset = m.rm_so;
if (start_offset == -1) break;
const end_offset = m.rm_eo;
const match = input[@intCast(usize, start_offset)..@intCast(usize, end_offset)];
print("matches[{d}] = {s}\n", .{i, match});
}
Our pattern is hello ?([[:alpha:]]*)
and our input is hello Teg!
. Therefore, the above code will print: matches[0] = hello Teg
followed by matches[1] = Teg
. The full matching input is always at matches[0]
. You can tell from the above code that when rm_so == -1
, we have no more matches.
That's pretty much all there is to it. It is worth going over the regex.h man pages. There's a 4th function, regerror
, that will take the error code from regcomp
and regexec
and provide an error message.