Performance of reading a file line by line in Zig

Nov 27, 2023

Maybe I'm wrong, but I believe the canonical way to read a file, line by line, in Zig is:

const std = @import("std");
pub fn main() !void {
  var gpa = std.heap.GeneralPurposeAllocator(.{}){};
  const allocator = gpa.allocator();

  var file = try std.fs.cwd().openFile("data.txt", .{});
  defer file.close();

  // Things are _a lot_ slower if we don't use a BufferedReader
  var buffered = std.io.bufferedReader(file.reader());
  var reader = buffered.reader();

  // lines will get read into this
  var arr = std.ArrayList(u8).init(allocator);
  defer arr.deinit();

  var line_count: usize = 0;
  var byte_count: usize = 0;
  while (true) {
    reader.streamUntilDelimiter(arr.writer(), '\n', null) catch |err| switch (err) {
      error.EndOfStream => break,
      else => return err,
    };
    line_count += 1;
    byte_count += arr.items.len;
    arr.clearRetainingCapacity();
  }
  std.debug.print("{d} lines, {d} bytes", .{line_count, byte_count});
}

We could use one of reader's thin wrappers around streamUntilDelimiter to make the code a little neater, but since our focus is on performance, we'll stick with less abstraction.

The equivalent (I hope) Go code is:

package main

import (
  "bufio"
  "fmt"
  "log"
  "os"
)

func main() {
  file, err := os.Open("data.txt")
  if err != nil {
    log.Fatal(err)
  }
  defer file.Close()

  lineCount := 0
  byteCount := 0
  scanner := bufio.NewScanner(file)
  for scanner.Scan() {
    lineCount += 1
    byteCount += len(scanner.Text())
  }

  if err := scanner.Err(); err != nil {
    log.Fatal(err)
  }
  fmt.Printf("%d lines, %d bytes", lineCount, byteCount)
}

What's interesting to me is that, on my computer, the Go version runs more than 4x faster. Using ReleaseFast with this 24MB json-line I found, Zig takes roughly 95ms whereas Go only takes 20ms. What gives?

The issue comes down to how the std.io.Reader functions like streamUntilDelimiter are implemented and how that integrates with BufferedReader. Much like Go's io.Reader, Zig's std.io.Reader requires implementations to provide a single function: fn read(buffer: []u8) !usize. Any functionality provided by std.io.Reader has to rely on this single read function.

This is a fair representation of std.io.Reader.streamUntilDelimiter along with the readByte it depends on:

pub fn streamUntilDelimiter(self: Reader, writer: anytype, delimiter: u8) !void {
  while (true) {
    // will bubble error.EndOfStream, which is how we break out
    // of the while loop
    const byte: u8 = try self.readByte();
    if (byte == delimiter) return;
    try writer.writeByte(byte);
  }
}

pub fn readByte(self: Reader) !u8 {
  var result: [1]u8 = undefined;
  // self.read calls our implementations read (e.g. BufferedReader.read)
  const amt_read = try self.read(result[0..]);
  if (amt_read < 1) return error.EndOfStream;
  return result[0];
}

This implementation will safely work for any type that implements a functional read(buffer: []u8) !usize). But by targeting the lowest common denominator, we potentially lose a lot of performance. If you knew that the underlying implementation had a buffer you could come up with a much more efficient solutions. The biggest, but not only, performance gain would be to leverage the SIMD-optimized std.mem.indexOfScalar to scan for delimiter over the entire buffer.

Here's what that might look like:

fn streamUntilDelimiter(buffered: anytype, writer: anytype, delimiter: u8) !void {
  while (true) {
    const start = buffered.start;
    if (std.mem.indexOfScalar(u8, buffered.buf[start..buffered.end], delimiter)) |pos| {
      // we found the delimiter
      try writer.writeAll(buffered.buf[start..start+pos]);
      // skip the delimiter
      buffered.start += pos + 1;
      return;
    } else {
      // we didn't find the delimiter, add everything to the output writer...
      try writer.writeAll(buffered.buf[start..buffered.end]);

      // ... and refill the buffer
      const n = try buffered.unbuffered_reader.read(buffered.buf[0..]);
      if (n == 0) {
        return error.EndOfStream;
      }
      buffered.start = 0;
      buffered.end = n;
    }
  }
}

If you're curious why buffer is an anytype, it's because std.io.BufferedReader is a generic, and we want our streamUntilDelimiter for any variant, regardless of the type of the underlying unbuffered reader.

If you take this function and use it in a similar way as our initial code, circumventing std.io.Reader.streamUntilDelimiter, you end up with similar performance as Go. And we'd still have room for some optimizations.

This is something I tried to fix in Zig's standard library. I thought I could use Zig's comptime capabilities to detect if the underlying implementation has its own streamUntilDelimeter and use it, falling back to std.io.Reader's implementation otherwise. And while this is certainly possible, using std.meta.trait.hasFn for example, I ran into problems that I just couldn't work around. The issue is that the buffered.reader() doesn't return an std.io.Reader directly, but goes through an intermediary: std.io.GenericReader. This GenericReader then creates an std.io.Reader on each function call. This double layer of abstraction was more than I wanted to, and probably could, work through.

Instead I opened an issue and wrote a more generic zig utility library for the little Zig snippets I've collected.

I'm not sure how big the issue actually is. If we assume the above code is right and using a BufferedReader via an std.io.Reader is inefficient, then it's at least a real issue for this common task (on my initial real-world input which is where I ran into this issue, the overhead was over 10x). But the "interface" pattern of building functionality atop the lowest common denominator is common, so I wonder where else performance is being lost. In this specific case though, I think there's an argument to be made that functionality like streamUntilDelimeter should only be made available on something like a BufferedReader.