Performance of reading a file line by line in Zig
Maybe I'm wrong, but I believe the canonical way to read a file, line by line, in Zig is:
const std = @import("std");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
var file = try std.fs.cwd().openFile("data.txt", .{});
defer file.close();
// Things are _a lot_ slower if we don't use a BufferedReader
var buffered = std.io.bufferedReader(file.reader());
var reader = buffered.reader();
// lines will get read into this
var arr = std.ArrayList(u8).init(allocator);
defer arr.deinit();
var line_count: usize = 0;
var byte_count: usize = 0;
while (true) {
reader.streamUntilDelimiter(arr.writer(), '\n', null) catch |err| switch (err) {
error.EndOfStream => break,
else => return err,
};
line_count += 1;
byte_count += arr.items.len;
arr.clearRetainingCapacity();
}
std.debug.print("{d} lines, {d} bytes", .{line_count, byte_count});
}
We could use one of reader
's thin wrappers around streamUntilDelimiter
to make the code a little neater, but since our focus is on performance, we'll stick with less abstraction.
The equivalent (I hope) Go code is:
package main
import (
"bufio"
"fmt"
"log"
"os"
)
func main() {
file, err := os.Open("data.txt")
if err != nil {
log.Fatal(err)
}
defer file.Close()
lineCount := 0
byteCount := 0
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lineCount += 1
byteCount += len(scanner.Text())
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
fmt.Printf("%d lines, %d bytes", lineCount, byteCount)
}
What's interesting to me is that, on my computer, the Go version runs more than 4x faster. Using ReleaseFast with this 24MB json-line I found, Zig takes roughly 95ms whereas Go only takes 20ms. What gives?
The issue comes down to how the std.io.Reader
functions like streamUntilDelimiter
are implemented and how that integrates with BufferedReader
. Much like Go's io.Reader
, Zig's std.io.Reader
requires implementations to provide a single function: fn read(buffer: []u8) !usize
. Any functionality provided by std.io.Reader
has to rely on this single read
function.
This is a fair representation of std.io.Reader.streamUntilDelimiter
along with the readByte
it depends on:
pub fn streamUntilDelimiter(self: Reader, writer: anytype, delimiter: u8) !void {
while (true) {
// will bubble error.EndOfStream, which is how we break out
// of the while loop
const byte: u8 = try self.readByte();
if (byte == delimiter) return;
try writer.writeByte(byte);
}
}
pub fn readByte(self: Reader) !u8 {
var result: [1]u8 = undefined;
// self.read calls our implementations read (e.g. BufferedReader.read)
const amt_read = try self.read(result[0..]);
if (amt_read < 1) return error.EndOfStream;
return result[0];
}
This implementation will safely work for any type that implements a functional read(buffer: []u8) !usize)
. But by targeting the lowest common denominator, we potentially lose a lot of performance. If you knew that the underlying implementation had a buffer you could come up with a much more efficient solutions. The biggest, but not only, performance gain would be to leverage the SIMD-optimized std.mem.indexOfScalar
to scan for delimiter
over the entire buffer.
Here's what that might look like:
fn streamUntilDelimiter(buffered: anytype, writer: anytype, delimiter: u8) !void {
while (true) {
const start = buffered.start;
if (std.mem.indexOfScalar(u8, buffered.buf[start..buffered.end], delimiter)) |pos| {
// we found the delimiter
try writer.writeAll(buffered.buf[start..start+pos]);
// skip the delimiter
buffered.start += pos + 1;
return;
} else {
// we didn't find the delimiter, add everything to the output writer...
try writer.writeAll(buffered.buf[start..buffered.end]);
// ... and refill the buffer
const n = try buffered.unbuffered_reader.read(buffered.buf[0..]);
if (n == 0) {
return error.EndOfStream;
}
buffered.start = 0;
buffered.end = n;
}
}
}
If you're curious why buffer
is an anytype
, it's because std.io.BufferedReader
is a generic, and we want our streamUntilDelimiter
for any variant, regardless of the type of the underlying unbuffered reader.
If you take this function and use it in a similar way as our initial code, circumventing std.io.Reader.streamUntilDelimiter
, you end up with similar performance as Go. And we'd still have room for some optimizations.
This is something I tried to fix in Zig's standard library. I thought I could use Zig's comptime capabilities to detect if the underlying implementation has its own streamUntilDelimeter
and use it, falling back to std.io.Reader
's implementation otherwise. And while this is certainly possible, using std.meta.trait.hasFn
for example, I ran into problems that I just couldn't work around. The issue is that the buffered.reader()
doesn't return an std.io.Reader
directly, but goes through an intermediary: std.io.GenericReader
. This GenericReader
then creates an std.io.Reader
on each function call. This double layer of abstraction was more than I wanted to, and probably could, work through.
Instead I opened an issue and wrote a more generic zig utility library for the little Zig snippets I've collected.
I'm not sure how big the issue actually is. If we assume the above code is right and using a BufferedReader
via an std.io.Reader
is inefficient, then it's at least a real issue for this common task (on my initial real-world input which is where I ran into this issue, the overhead was over 10x). But the "interface" pattern of building functionality atop the lowest common denominator is common, so I wonder where else performance is being lost. In this specific case though, I think there's an argument to be made that functionality like streamUntilDelimeter
should only be made available on something like a BufferedReader
.