A Programmer-Friendly I/O Abstraction Over io_uring and kqueue

Recorded: Nov. 28, 2025, 1:02 a.m.

Original

1000x World Tour in 13 cities from Dec 1-6
|
Join Us ↓

Docs
Blog
Company

Slack

11.4K

14.5K

Solutions

A Programmer-Friendly I/O Abstraction Over io_uring and kqueue
Nov 23, 2022

King Butcher

Phil Eaton

Consider this tale of I/O and performance. We’ll start with blocking
I/O, explore io_uring and kqueue, and take home an event loop very
similar to some software you may find familiar.
This is a twist on King’s talk at Software You Can Love
Milan ’22.
Classical approach
When you want to read from a file you might open() and
then call read() as many times as necessary to fill a
buffer of bytes from the file. And in the opposite direction, you call
write() as many times as needed until everything is
written. It’s similar for a TCP client with sockets, but instead of
open() you first call socket() and then
connect() to your server. Fun stuff.
In the real world though you can’t always read everything you want
immediately from a file descriptor. Nor can you always write everything
you want immediately to a file descriptor.
You can switch
a file descriptor into non-blocking mode so the call won’t block
while data you requested is not available. But system calls are still
expensive, incurring context switches and cache misses. In fact,
networks and disks have become so fast that these costs can start to
approach the cost of doing the I/O itself. For the duration of time a
file descriptor is unable to read or write, you don’t want to waste time
continuously retrying read or write system calls.
Batching and readiness
So you switch to io_uring on Linux or kqueue on FreeBSD/macOS. (I’m
skipping the generation of epoll/select users.) These APIs let you
submit requests to the kernel to learn about readiness: when a file
descriptor is ready to read or write. You can send readiness requests in
batches (also referred to as queues). Completion events, one for each
submitted request, are available in a separate queue.
Being able to batch I/O like this is especially important for TCP
servers that want to multiplex reads and writes for multiple connected
clients.
However in io_uring, you can even go one step further. Instead of
having to call read() or write() in userland
after a readiness event, you can request that the kernel do the
read() or write() itself with a buffer you
provide. Thus almost all of your I/O is done in the kernel, amortizing
the overhead of system calls.
If you haven’t seen io_uring or kqueue before, you’d probably like an
example! Consider this code: a simple, minimal, not-production-ready TCP
echo server.
const std = @import("std");
const os = std.os;
const linux = os.linux;
const allocator = std.heap.page_allocator;

const State = enum{ accept, recv, send };
const Socket = struct {
handle: os.socket_t,
buffer: [1024]u8,
state: State,
};

pub fn main() !void {
const entries = 32;
const flags = 0;
var ring = try linux.IO_Uring.init(entries, flags);
defer ring.deinit();

var server: Socket = undefined;
server.handle = try os.socket(os.AF.INET, os.SOCK.STREAM, os.IPPROTO.TCP);
defer os.closeSocket(server.handle);

const port = 12345;
var addr = std.net.Address.initIp4(.{127, 0, 0, 1}, port);
var addr_len: os.socklen_t = addr.getOsSockLen();

try os.setsockopt(server.handle, os.SOL.SOCKET, os.SO.REUSEADDR, &std.mem.toBytes(@as(c_int, 1)));
try os.bind(server.handle, &addr.any, addr_len);
const backlog = 128;
try os.listen(server.handle, backlog);

server.state = .accept;
_ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);

while (true) {
_ = try ring.submit_and_wait(1);

while (ring.cq_ready() > 0) {
const cqe = try ring.copy_cqe();
var client = @intToPtr(*Socket, @intCast(usize, cqe.user_data));

if (cqe.res < 0) std.debug.panic("{}({}): {}", .{
client.state,
client.handle,
@intToEnum(os.E, -cqe.res),
});

switch (client.state) {
.accept => {
client = try allocator.create(Socket);
client.handle = @intCast(os.socket_t, cqe.res);
client.state = .recv;
_ = try ring.recv(@ptrToInt(client), client.handle, .{.buffer = &client.buffer}, 0);
_ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);
},
.recv => {
const read = @intCast(usize, cqe.res);
client.state = .send;
_ = try ring.send(@ptrToInt(client), client.handle, client.buffer[0..read], 0);
},
.send => {
os.closeSocket(client.handle);
allocator.destroy(client);
},
}
}
}
}
This is a great, minimal example. But notice that this code ties
io_uring behavior directly to business logic (in this case, handling
echoing data between request and response). It is fine for a small
example like this. But in a large application you might want to do I/O
throughout the code base, not just in one place. You might not want to
keep adding business logic to this single loop.
Central I/O dispatch
Instead, you might want to be able to schedule I/O and pass a
callback (and sometimes with some application context) to be called when
the event is complete.
The interface might look like:
io_dispatch.dispatch({
// some big struct/union with relevant fields for all event types
}, my_callback);
This is great! Now your business logic can schedule and handle I/O no
matter where in the code base it is.
Under the hood it can decide whether to use io_uring or kqueue
depending on what kernel it’s running on. The dispatch can also batch
these individual calls through io_uring or kqueue to amortize system
calls. The application no longer needs to know the details.
Additionally, we can use this wrapper to stop thinking about
readiness events, just I/O completion. That is, if we dispatch a read
event, the io_uring implementation would actually ask the kernel to read
data into a buffer. Whereas the kqueue implementation would send a
“read” readiness event, do the read back in userland, and then call our
callback.
And finally, now that we’ve got this central dispatcher, we don’t
need spaghetti code in a loop switching on every possible submission and
completion event.
Interacting with the submission and
completion queue
Every time we call io_uring or kqueue we both submit event requests
and poll for completion events. The io_uring and kqueue APIs tie these
two actions together in the same system call.
To sync our requests to io_uring or kqueue we’ll build a
flush function that submits requests and polls for
completion events. (In the next section we’ll talk about how the user of
the central dispatch learns about completion events.)
To make flush more convenient, we’ll build a nice
wrapper around it so that we can submit as many requests (and process as
many completion events) as possible. To avoid accidentally blocking
indefinitely we’ll also introduce a time limit. We’ll call the wrapper
run_for_ns.
Finally we’ll put the user in charge of setting up a loop to call
this run_for_ns function, independent of normal program
execution.
This is now your traditional event loop.
Callbacks and context
You may have noticed that in the API above we passed a callback. The
idea is that after the requested I/O has completed, our callback should
be invoked. But the question remains: how to track this callback between
the submission and completion queue?
Thankfully, io_uring and kqueue events have user data fields. The
user data field is opaque to the kernel. When a submitted event
completes, the kernel sends a completion event back to userland
containing the user data value from the submission event.
We can store the callback in the user data field by setting it to the
callback’s pointer casted to an integer. When the completion for a
requested event comes up, we cast from the integer in the user data
field back to the callback pointer. Then, we invoke the callback.
Getting more expressive
As described above, the struct for io_dispatch.dispatch
could get quite large handling all the different kinds of I/O events and
their arguments. We could make our API a little more expressive by
creating wrapper functions for each event type.
So if we wanted to schedule a read function we could call:
io_dispatch.read(fd, &buf, nBytesToRead, callback);
Or to write, similarly:
io_dispatch.write(fd, buf, nBytesToWrite, callback);
Overflow
One more thing we need to worry about is that the batch we pass to
io_uring or kqueue has a fixed size (technically, kqueue allows any
batch size but using that might introduce unnecessary allocations). So
we’ll build our own queue on top of our I/O abstraction to keep track of
requests that we could not immediately submit to io_uring or kqueue.

To keep this API simple we could allocate for each entry in the
queue. Or we could modify the io_dispatch.X calls slightly
to accept a struct that can be used in an intrusive
linked list to contain all request context, including the callback.
The latter is what
we do in TigerBeetle.

Put another way: every time code calls io_dispatch,
we’ll try to immediately submit the requested event to io_uring or
kqueue. But if there’s no room, we store the event in an overflow
queue.
The overflow queue needs to be processed eventually, so we update our
flush function (described in Callbacks and context above) to pull
as many events from our overflow queue before submitting a batch to
io_uring or kqueue.
A step back
We’ve now built something similar to libuv, the I/O library that
Node.js uses. And if you squint, it is basically TigerBeetle’s I/O
library! (And interestingly enough, TigerBeetle’s I/O code was adopted
into Bun! Open-source for the win!)
Let’s check out how the Darwin
version of TigerBeetle’s I/O library (with kqueue) differs from the
Linux
version. As mentioned, the complete send call in the
Darwin implementation waits for file descriptor readiness (through
kqueue). Once ready, the actual send call is made back in
userland:
pub fn send(
self: *IO,
comptime Context: type,
context: Context,
comptime callback: fn (
context: Context,
completion: *Completion,
result: SendError!usize,
) void,
completion: *Completion,
socket: os.socket_t,
buffer: []const u8,
) void {
self.submit(
context,
callback,
completion,
.send,
.{
.socket = socket,
.buf = buffer.ptr,
.len = @intCast(u32, buffer_limit(buffer.len)),
},
struct {
fn do_operation(op: anytype) SendError!usize {
return os.send(op.socket, op.buf[0..op.len], 0);
}
},
);
}
Compare this to the Linux
version (with io_uring) where the kernel handles everything and
there is no send system call in userland:
pub fn send(
self: *IO,
comptime Context: type,
context: Context,
comptime callback: fn (
context: Context,
completion: *Completion,
result: SendError!usize,
) void,
completion: *Completion,
socket: os.socket_t,
buffer: []const u8,
) void {
completion.* = .{
.io = self,
.context = context,
.callback = struct {
fn wrapper(ctx: ?*anyopaque, comp: *Completion, res: *const anyopaque) void {
callback(
@intToPtr(Context, @ptrToInt(ctx)),
comp,
@intToPtr(*const SendError!usize, @ptrToInt(res)).*,
);
}
}.wrapper,
.operation = .{
.send = .{
.socket = socket,
.buffer = buffer,
},
},
};
// Fill out a submission immediately if possible, otherwise adds to overflow buffer
self.enqueue(completion);
}
Similarly, take a look at flush on Linux
and macOS
for event processing. Look at run_for_ns on Linux
and macOS
for the public API users must call. And finally, look at what puts this
all into practice, the loop calling run_for_ns in
src/main.zig.
Windows and IOCP
We’ve come this far and you might be wondering — what about
cross-platform support for Windows? The good news is that Windows also
has a completion based system similar to io_uring but without batching,
called IOCP.
And for bonus points, TigerBeetle provides the same
I/O abstraction over it! But it’s enough to cover just Linux and
macOS in this post. :)
Beyond a single thread
In both this blog post and in TigerBeetle, we implemented a
single-threaded event loop. Keeping I/O code single-threaded in
userspace is beneficial (whether or not I/O processing is
single-threaded in the kernel is not our concern). It’s the simplest
code and best for workloads that are not embarrassingly parallel. It is
also best for determinism, which is integral to the design of
TigerBeetle because it enables us to do Deterministic Simulation
Testing
But there are other valid architectures for other workloads.
For workloads that are embarrassingly parallel, like many web
servers, you could instead use multiple threads where each thread has
its own queue. In optimal conditions, this architecture has the highest
I/O throughput possible.
But if each thread has its own queue, individual threads can become
starved if an uneven amount of work is scheduled on one thread. In the
case of dynamic amounts of work, the better architecture would be to
have a single queue but multiple worker threads doing the work made
available on the queue.
Standalone cross-platform evented I/O
library?
Hey, maybe we’ll split this out so you can use it too. It’s written
in Zig so we can easily expose a C API. Any language with a C foreign
function interface (i.e. every language) should work well with it. Keep
an eye on our GitHub.
:)
Additional resources:

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
https://github.com/ziglang/zig/issues/8224
https://www.youtube.com/watch?v=lZU6RK0oazM
https://unixism.net/loti/

Enjoyed this post?
Add our RSS feed.

← Next
Tigerbeetle Raises $6.4M to Power the Future of Financial Accounting Infrastructure
Joran Dirk Greef

Previous →
A Database Without Dynamic Memory Allocation
Phil Eaton

Product
Home
Enterprise Solutions
Startup Program
Simulator
Latest Release
TigerTracks

Resources
Docs
Blog
Company
GitHub
Tutorial
RSS
Privacy

Connect
X
Slack
YouTube
LinkedIn
Bluesky
Email

Newsletter
Get monthly highlights and carefully crafted updates.

Subscribe
Past Newsletters →

Correct and Fast

This document, penned by Phil Eaton, outlines a programmer-friendly I/O abstraction built upon the foundations of io_uring and kqueue. It details a system designed for efficiency and flexibility, particularly for scenarios demanding high I/O throughput. Eaton’s aim is to present a simplified I/O model, enabling developers to focus on application logic rather than the intricacies of low-level I/O operations. The core concept revolves around batching I/O requests and employing a completion-based model, mirroring the functionality of libraries like TigerBeetle. This approach minimizes the overhead associated with traditional system calls, significantly improving performance, especially when dealing with network or disk I/O. The discussion moves beyond the typical blocking I/O paradigm, exploring how to leverage asynchronous I/O for increased responsiveness and efficiency. The design incorporates a central dispatch mechanism, allowing developers to decouple I/O handling from application logic through callback functions. Eaton emphasizes the importance of understanding readiness events – the signals issued by the kernel to indicate when a file descriptor is ready for reading or writing—as the foundation for this asynchronous model. The document highlights the benefits of utilizing io_uring and kqueue, ultimately presenting a toolset to accelerate I/O tasks. The architecture proposed utilizes a single-threaded event loop for simplicity and determinism but acknowledges the potential for multi-threaded implementations depending on workload characteristics. Furthermore, the author acknowledges a cross-platform implementation targeting Windows via TigerBeetle, underscoring the versatility of the abstraction. The final section discusses the potential for a standalone, cross-platform I/O library built in Zig, making the concepts accessible to a broader range of developers.

The core of the exploration is the shift away from the conventional, often cumbersome, approach of repeatedly calling `read()` or `write()` until data is processed. Instead, Eaton advocates for a model where the application specifies what operations to perform, and the kernel handles the underlying I/O execution. This abstraction, largely driven by the efficiency of io_uring and kqueue, allows developers to dramatically reduce the number of system calls, leading to considerable performance gains. The detailed explanations focus on concepts like readiness events, the benefits of batching, and the central event loop, providing a comprehensive understanding of the architecture. The writing style is geared towards a college graduate, assuming a foundational understanding of programming concepts but offering a level of detail suitable for a technically savvy individual. The document serves as a solid introduction to modern I/O abstractions, effectively positioning io_uring and kqueue as valuable tools for optimizing I/O-bound applications. It is clear that this system is geared toward maximizing performance and minimizing overhead, highlighting the strategic advantages of asynchronous I/O and batching techniques, especially within the context of network and storage operations.

The discussion of TigerBeetle and its Zig implementation further contextualizes the entire approach, illustrating a practical application of these concepts and framing them within a broader ecosystem of development tools. The emphasis on Zig as the language of choice reflects the language's design principles, including its performance, low-level control, and support for systems programming. Ultimately, Eaton's piece isn't just a technical description—it’s a persuasive argument for adopting a modern I/O abstraction, showcasing how it can empower developers to build faster, more responsive, and ultimately more efficient applications.