Zig's std.json.parseFormSlice and std.json.Parsed(T)
Jun 18, 2024
In Zig's discord server, I see a steady stream of developers new to Zig struggling with parsing JSON. I like helping with this problem because you can learn a lot about Zig through it. A typical, but incorrect, first attempt looks something like:
const std = @import("std");
const Allocator = std.mem.Allocator;
const Config = struct {
db_path: []const u8,
};
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const config = try parseConfig(allocator, "config.json");
std.debug.print("db path: {s}\n", .{config.db_path});
}
fn parseConfig(allocator: Allocator, path: []const u8) !Config {
const data = try std.fs.cwd().readFileAlloc(allocator, path, 4096);
defer allocator.free(data);
const parsed = try std.json.parseFromSlice(Config, allocator, data, .{});
return parsed.value;
}
This code has two bugs: a dangling pointer and memory leak. Because the use of a dangling pointer is an undefined behavior, this code may or may not crash on the last line of main (which tries to print db_path
), but it definitely will report a memory leak.
Dangling Pointer
First we'll fix the dangling pointer. The parseFromSlice
function takes our JSON input (data
) and tries to parse it into a Config
. Our JSON input comes from reading the file. Thanks to the call to defer allocator.free(data)
, this data is freed when parseConfig
exits. This is the source of our bug. By default, parseFromSlice
uses references to the underlying JSON input. So when data
is freed, those references are no longer valid.
The last parameter to parseFromSlice
is a ParseOption
. It controls various aspects of parsing. The option we care about is allocate
which defaults to .alloc_if_needed
. We need to pass .{.allocate = .alloc_always}
to fix our dangling pointer:
... = try std.json.parseFromSlice(Config, allocator, data, .{
.allocate = .alloc_always,
});
We could consider this fixed, and move on to our memory leak, but let's go a bit deeper. The ability to reference the input data is an optimization. When the input outlives the parsed value, it makes sense to simply reference the existing input. To see this in action, we could change our code so that data
outlives our parsed value and not used the alloc_always
setting:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const data = try std.fs.cwd().readFileAlloc(allocator, "config.json", 4096);
defer allocator.free(data);
const parsed = try std.json.parseFromSlice(Config, allocator, data, .{});
std.debug.print("db path: {s}\n", .{parsed.value.db_path});
}
This version doesn't have a dangling pointer and doesn't have to copy the string values of our JSON input to populate the fields of Config
. Because data
outlives parsed
, references from parsed.value
to data
remain valid (although, someone could mutate data
, but that's another problem). This version still has the same memory leak though (we'll get to that soon).
The allocate
option has two possible values. The default is alloc_if_needed
and the other, the one we used to fix the code, is alloc_always
. You could be forgiven for thinking that our code should have worked with the default, alloc_if_needed
. After all, our code needed the string value duplicated since our parsed Config
outlives the input, right? But the "if needed" part of alloc_if_needed
references the parser itself: allocate if the JSON parser needs it. If you think about it, this makes sense. There's no way for parseFromSlice
to know that the parsed value outlives the JSON input. I think this option would be less confusing as a boolean called dupe_strings
.
When does the parser itself need to make allocations? There are two cases. The first is for internal bookkeeping. Specifically, since JSON can be arbitrarily nested, and nesting can be a mix of arrays and objects, a parser needed to track of the type of value (object or array) of each nesting. Secondly, parseFromSlice
is a higher-level API over a low-level JSON scanner. That scanner is more generic and works over a stream of data. If the JSON input is being streamed in, one chunk at a time, a string value might span multiple chunks. In such cases, the string parts must be duped by the scanner in order to produce a single cohesive string value. Because of these two cases, there's no none
option to the allocate
. An specialized scanner with a max supported nesting, that only worked on a long-lived full JSON input could be implemented without allocations. But that isn't how Zig's scanner works.
Memory Leak
For the second issue, the memory leak, recall the original code called parseFromSlice
and assigned the return value to a variable name result
and then returned result.value
. That's immediately suspicious. If nothing else, what exactly does parseFromSlice
return, and why doesn't it return a T
(Config
in our example) directly? From Zig's documentation, we know that the return type is a std.json.Parsed(T)
, but that type isn't documented.
Parsed(T)
is a simple type. It's obviously a generic and has two fields:
pub fn Parsed(comptime T: type) type {
return struct {
value: T,
arena: *ArenaAllocator,
pub fn deinit(self: @This()) void {
}
};
}
In the previous section we saw that parsing JSON almost always requires allocations, for internal bookkeeping and/or for duping strings. If we don't free those allocations, they'll leak. By using an ArenaAllocator, it's much easier (and faster) to manage the memory as a whole.
The dangling pointer happened because the JSON input had a shorter lifetime than our config
, resulting in config
referencing no-longer valid memory. Our memory leak is kind of the opposite: by returning only result.value
, we've lost any reference to the arena and thus can never free those allocations.
One solution is to change parseConfig
so that it returns Parsed(Config)
. This allows our caller to deinit
the arena:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const parsed = try parseConfig(allocator, "config.json");
defer parsed.deinit();
std.debug.print("db path: {s}\n", .{parsed.value.db_path});
}
fn parseConfig(allocator: Allocator, path: []const u8) !std.json.Parsed(Config) {
const data = try std.fs.cwd().readFileAlloc(allocator, path, 4096);
defer allocator.free(data);
return std.json.parseFromSlice(Config, allocator, data, .{
.allocate = .alloc_always,
});
}
You can't call parsed.deinit()
inside of parseConfig
and then return parsed.value
. Then we'd re-introduce a different dangling pointer where the config references memory allocated by the arena which has since been freed. The arena
and value
are tightly linked, they exist as a single unit.
Personally, I don't like the name std.json.Parsed(T)
. There's nothing JSON-specific about this type. It's just a value of type T
and an ArenaAllocator
. I would prefer if the type was something like std.Owned(T)
. But whatever its name, the value and arena are one and share a lifetime.
In addition to parseFromSlice
there's also a parseFromSliceLeaky
. The "leaky" version returns T
directly. This version is written assuming that the provided allocator is able to free all allocations, without having to track every individual allocations. It essentially assumes that the provided allocator is something like an ArenaAllocator
or a FixedBufferAllocator
. In practical terms, parseFromSlice
internally creates an ArenaAllocator
and returns that allocator with the parsed value whereas parseFromSliceLeaky
takes an ArenaAllocator
(or something like it) and returns the parsed value. In both cases, you end up with a value of type T
tied to an allocator.
Conclusion
The short version is that Zig's JSON parser has the ability to either reference the JSON input or make copies of the [string] values. The default is to reference the JSON input, which causes issues if the JSON input does not outlive the parsed value. The benefit is better performance (fewer allocations) for cases where the input does outlive the parsed value.
Furthermore, since parsing JSON can require allocations, parseFromSlice
returns both the parsed value and an ArenaAllocator
. To prevent memory leaks, deinit
must be called on the Parse(T)
returned by the function, at which point the value
is no longer valid.