homedark

Zig's std.json.parseFormSlice and std.json.Parsed(T)

Jun 18, 2024

In Zig's discord server, I see a steady stream of developers new to Zig struggling with parsing JSON. I like helping with this problem because you can learn a lot about Zig through it. A typical, but incorrect, first attempt looks something like:

const std = @import("std");
const Allocator = std.mem.Allocator;

const Config = struct {
  db_path: []const u8,
};

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  defer _ = gpa.deinit();
  const allocator = gpa.allocator();

  const config = try parseConfig(allocator, "config.json");

  std.debug.print("db path: {s}\n", .{config.db_path});
}

fn parseConfig(allocator: Allocator, path: []const u8) !Config {
  const data = try std.fs.cwd().readFileAlloc(allocator, path, 4096);
  defer allocator.free(data);

  const parsed = try std.json.parseFromSlice(Config, allocator, data, .{});
  return parsed.value;
}

This code has two bugs: a dangling pointer and memory leak. Because the use of a dangling pointer is an undefined behavior, this code may or may not crash on the last line of main (which tries to print db_path), but it definitely will report a memory leak.

Dangling Pointer

First we'll fix the dangling pointer. The parseFromSlice function takes our JSON input (data) and tries to parse it into a Config. Our JSON input comes from reading the file. Thanks to the call to defer allocator.free(data), this data is freed when parseConfig exits. This is the source of our bug. By default, parseFromSlice uses references to the underlying JSON input. So when data is freed, those references are no longer valid.

The last parameter to parseFromSlice is a ParseOption. It controls various aspects of parsing. The option we care about is allocate which defaults to .alloc_if_needed. We need to pass .{.allocate = .alloc_always} to fix our dangling pointer:

... = try std.json.parseFromSlice(Config, allocator, data, .{
      .allocate = .alloc_always,
  });

We could consider this fixed, and move on to our memory leak, but let's go a bit deeper. The ability to reference the input data is an optimization. When the input outlives the parsed value, it makes sense to simply reference the existing input. To see this in action, we could change our code so that data outlives our parsed value and not used the alloc_always setting:

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  defer _ = gpa.deinit();
  const allocator = gpa.allocator();

  const data = try std.fs.cwd().readFileAlloc(allocator, "config.json", 4096);
  defer allocator.free(data);

  const parsed = try std.json.parseFromSlice(Config, allocator, data, .{});
  std.debug.print("db path: {s}\n", .{parsed.value.db_path});
}

This version doesn't have a dangling pointer and doesn't have to copy the string values of our JSON input to populate the fields of Config. Because data outlives parsed, references from parsed.value to data remain valid (although, someone could mutate data, but that's another problem). This version still has the same memory leak though (we'll get to that soon).

The allocate option has two possible values. The default is alloc_if_needed and the other, the one we used to fix the code, is alloc_always. You could be forgiven for thinking that our code should have worked with the default, alloc_if_needed. After all, our code needed the string value duplicated since our parsed Config outlives the input, right? But the "if needed" part of alloc_if_needed references the parser itself: allocate if the JSON parser needs it. If you think about it, this makes sense. There's no way for parseFromSlice to know that the parsed value outlives the JSON input. I think this option would be less confusing as a boolean called dupe_strings.

When does the parser itself need to make allocations? There are two cases. The first is for internal bookkeeping. Specifically, since JSON can be arbitrarily nested, and nesting can be a mix of arrays and objects, a parser needed to track of the type of value (object or array) of each nesting. Secondly, parseFromSlice is a higher-level API over a low-level JSON scanner. That scanner is more generic and works over a stream of data. If the JSON input is being streamed in, one chunk at a time, a string value might span multiple chunks. In such cases, the string parts must be duped by the scanner in order to produce a single cohesive string value. Because of these two cases, there's no none option to the allocate. An specialized scanner with a max supported nesting, that only worked on a long-lived full JSON input could be implemented without allocations. But that isn't how Zig's scanner works.

Memory Leak

For the second issue, the memory leak, recall the original code called parseFromSlice and assigned the return value to a variable name result and then returned result.value. That's immediately suspicious. If nothing else, what exactly does parseFromSlice return, and why doesn't it return a T (Config in our example) directly? From Zig's documentation, we know that the return type is a std.json.Parsed(T), but that type isn't documented.

Parsed(T) is a simple type. It's obviously a generic and has two fields:

pub fn Parsed(comptime T: type) type {
  return struct {
    value: T,
    arena: *ArenaAllocator,

    pub fn deinit(self: @This()) void {
      //...
    }
  };
}

In the previous section we saw that parsing JSON almost always requires allocations, for internal bookkeeping and/or for duping strings. If we don't free those allocations, they'll leak. By using an ArenaAllocator, it's much easier (and faster) to manage the memory as a whole.

The dangling pointer happened because the JSON input had a shorter lifetime than our config, resulting in config referencing no-longer valid memory. Our memory leak is kind of the opposite: by returning only result.value, we've lost any reference to the arena and thus can never free those allocations.

One solution is to change parseConfig so that it returns Parsed(Config). This allows our caller to deinit the arena:

pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  defer _ = gpa.deinit();
  const allocator = gpa.allocator();

  const parsed = try parseConfig(allocator, "config.json");
  defer parsed.deinit();

  std.debug.print("db path: {s}\n", .{parsed.value.db_path});
}

fn parseConfig(allocator: Allocator, path: []const u8) !std.json.Parsed(Config) {
  const data = try std.fs.cwd().readFileAlloc(allocator, path, 4096);
  defer allocator.free(data);

  return std.json.parseFromSlice(Config, allocator, data, .{
      .allocate = .alloc_always,
  });
}

You can't call parsed.deinit() inside of parseConfig and then return parsed.value. Then we'd re-introduce a different dangling pointer where the config references memory allocated by the arena which has since been freed. The arena and value are tightly linked, they exist as a single unit.

Personally, I don't like the name std.json.Parsed(T). There's nothing JSON-specific about this type. It's just a value of type T and an ArenaAllocator. I would prefer if the type was something like std.Owned(T). But whatever its name, the value and arena are one and share a lifetime.

In addition to parseFromSlice there's also a parseFromSliceLeaky. The "leaky" version returns T directly. This version is written assuming that the provided allocator is able to free all allocations, without having to track every individual allocations. It essentially assumes that the provided allocator is something like an ArenaAllocator or a FixedBufferAllocator. In practical terms, parseFromSlice internally creates an ArenaAllocator and returns that allocator with the parsed value whereas parseFromSliceLeaky takes an ArenaAllocator (or something like it) and returns the parsed value. In both cases, you end up with a value of type T tied to an allocator.

Conclusion

The short version is that Zig's JSON parser has the ability to either reference the JSON input or make copies of the [string] values. The default is to reference the JSON input, which causes issues if the JSON input does not outlive the parsed value. The benefit is better performance (fewer allocations) for cases where the input does outlive the parsed value.

Furthermore, since parsing JSON can require allocations, parseFromSlice returns both the parsed value and an ArenaAllocator. To prevent memory leaks, deinit must be called on the Parse(T) returned by the function, at which point the value is no longer valid.