Custom String Formatting and JSON [De]Serializing in Zig

Aug 23, 2024

In our last blog post, we saw how builtins like @hasDecl and functions like std.meta.hasMethod can be used to inspect a type to determine its capabilities. Zig's standard library makes use of these in a few place to allow developers to opt-into specific behavior. In particular, both std.fmt and std.json provide developers the ability to define functions that control how a type is formatted and JSON serialized/deserialized.

While Zig does a good job of printing out custom types, it can be useful/necessary to tweak that output. For example, you might want to exclude a specific field from the output. If you define a public format method on your struct, enum or union, Zig will call it rather than using the default formatter:

const std = @import("std");

pub fn main() !void {
  const u = User{.power = 9001};
  std.debug.print("{}\n", .{u});
}

const User = struct {
  power: u32,

  pub fn format(self: *const User, comptime fmt: []const u8, _: std.fmt.FormatOptions, writer: anytype) !void {
    if (fmt.len != 0) {
      std.fmt.invalidFmtError(fmt, self);
    }
    return writer.print("power level @ {d}!!!", .{self.power});
  }
};

As you can see, format takes 4 parameters: the value being formatted (your type), the format string, format options, and the writer. It's pretty common to ignore both the format string and format options. Above we did use the format string as a simple guard, ensuring that our user variable was formatted using {} or {any}, and returning an error if something like {s} was used. The real question we need to answer is: what's writer. It's generally understood to be an std.io.Writer, but by using anytype we automatically support anything that can fulfill our needs.

More specifically, the two methods that you're most likely to use are: writer.print and writer.writeAll. We used print above, which is a simple yet powerful way to customize the string representation. We'd use writeAll if we wanted to directly generate a []const u8. Of course, there's nothing stopping us from using a both methods, as well as the other io.Writer methods like writeByte and writeByteNTimes.

We can control the JSON serialization of a struct, union of enum by defining a jsonStringify method:

const User = struct {
  power: u32,

  pub fn jsonStringify(self: *const User, jws: anytype) !void {
    try jws.beginObject();
    try jws.objectField("power");
    try jws.write(self.power);
    try jws.endObject();
  }
};

The signature is simpler, with the writer assumed to be an std.json.WriteStream. Unlike the more generic io.Writer used in format, the WriteStream is designed specifically for JSON. In addition to the beginObject and endObject, there's also a beginArray and endArray. Unlike write and writeAll found on the more generic io.Writer, the write method of the WriteStream is JSON aware. If we called jws.write() on a string value, the value would be quoted and escaped. If we called it on a structure, that structure would be JSON-encoded (either using Zig's default JSON encoder, or the structure's own jsonStringify method).

The WriteStream also has a print method. Like the io.Writer's print method we used in format, it takes a format string and optional parameters. print will apply the correct indentation (based on the options passed to stringify) but will not apply any additional JSON-specific format. So if we changed the code to:

const User = struct {
  name: []const u8,

  pub fn jsonStringify(self: *const User, jws: anytype) !void {
    try jws.beginObject();
    try jws.objectField("name");
    try jws.print("{s}", .{self.name});
    try jws.endObject();
  }
};

We'd likely end up generating invalid JSON, since the name value wouldn't be quoted or correctly escaped (notice the value isn't quoted):

{"name":leto}

Care must be taken when using print, but it provides the greatest flexibility; for example it's useful if you want to format numbers a specific way.

The counterpart to the jsonStringify method is the jsonParse function:

const User = struct {
  name: []const u8,

  pub fn jsonParse(allocator: Allocator, source: anytype, options: std.json.ParseOptions) !User {
    // ...
  }
};

The source is assumed to be either a std.json.Scanner or a std.json.Reader. These both expose the same methods, so, from our point of view, it doesn't really matter which it is. Unfortunately, implementing a custom jsonParse function is a lot more complicated than implementing a custom format or jsonStringify method. This is because the scanner is low-level and reads a token at a time. For example, a naive implementation for the above User with the name field looks like:

pub fn jsonParse(allocator: Allocator, source: anytype, options: std.json.ParseOptions) !User {
  if (try source.next() != .object_begin) {
    return error.UnexpectedToken;
  }

  var name: []const u8 = undefined;

  switch (try source.nextAlloc(allocator, .alloc_if_needed)) {
    .string, .allocated_string => |field| {
      if (std.mem.eql(u8, field, "name") == false) {
        return error.UnknownField;
      }
    },
    else => return error.UnexpectedToken,
  }

  switch (try source.nextAlloc(allocator, options.allocate.?)) {
    .string, .allocated_string => |value| name = value,
    else => return error.UnexpectedToken,
  }

  if (try source.next() != .object_end) {
    return error.UnexpectedToken;
  }

  return .{.name = name};
}

Of course, if we were to add more fields, keeping in mind that we should generally support JSON objects to have fields in any order, the code will get much more complicated. Also, we should probably consider the options.ignore_unknown_fields value to determine whether we should error or ignore an unknown field.

You can probably tell from the above code, but the type returned by next and nextAlloc is a tagged union. Specifically, a std.json.Token. Notice that when we're looking for a string (for the field name and field value), we match against both a .string and .allocated_string. Also notice that when we're reading the field, we use nextAlloc with .alloc_if_needed, but when reading the value, we're passing options.allocate.?. Why is this so complicated?

The json Scanner, Reader and Token are all designed to work with both a generic io.Reader and a string input. When dealing with a generic io.Reader, a 4K buffer is used. This has two implications. First, when our buffer fills up, it's reset and reused for the next 4K of data. This invalidates any old references. Second, both field names and field values can get split across multiple reads. For this reason we need to tell nextAlloc how to proceed: should it always create a duplicate of the value, or should it only do it when necessary. There's no "never allocate" option, because of the second case mentioned above: when a field name or value is spread out across multiple fills of the buffer, an allocation is required to create a single coherent value.

We use .alloc_if_needed on the field name, because that value is only needed until the next call to next or nextAlloc. We just need the field name to compare it to our expect "name". Hopefully the full field name is inside the buffer, meaning the scanner won't have to allocate anything. If it doesn't, it'll return a .string. If it does have to allocate, it'll return an .allocated_string. For our jsonParse it doesn't matter which it is, we just need the value. But in some cases, you might care about the difference.

options.allocate.? is the option given to the std.json.parseXYZ function. We use that for the field value, letting the caller decide whether or not the always dupe string values. When option.allocate isn't explicit set, the default depends on which parse function is used and whether a Scanner or Reader is provided.

The documentation for std.json.Token isn't the clearest, but it does help explain some of this and is probably required reading if you plan on writing your own jsonParse.

Conclusion

I don't think anyone has ever claimed that Zig's documentation is best-in-class. I think these three methods are particularly good examples of the documentation's shortcoming. It isn't just that the behaviors aren't easily discoverable, but the use of anytype - which provides more flexibility - harms understandability.

Still, both format and jsonStringify are straightforward once you've seen an example or two. And they both provide a flexible and expressive API. For jsonParse, you'll almost certainly need/want to write some helpers to deal with the low-level API which is exposed.