Tech Notes: Theseus: translating win32 to wasm
neugierig.org: Tech Notes
You're reading a single entry. Go back to the front page for more.
Theseus: translating win32 to wasm May 24, 2026
This post is part of a series on Theseus, my win32/x86 emulator. Theseus now can produce WebAssembly output, allowing it to translate a .exe file into something that runs on the web. Try it out here, but note it is full of bugs (e.g. Minesweeper crashes if you win). This was pretty straightforward to get working, with the exception of one major detail that this post will go into. The x86 emulation part of this is just recompiling the existing Theseus output with a different CPU target. This is one of the main benefits of this binary translation approach. The translated code is almost (with the exception of how main gets invoked) wholly agnostic to the environment it eventually runs in. In principle I now get optimized wasm compiler output for relatively free. The main challenge was figuring out the code layout to get Cargo to cooperate with my weird requirements. The win32 part was changing things to abstract over a "Host" API that is able to do things like fetch mouse events and render pixels. That is now implemented once for SDL and once for the web. This was also relatively straight forward, at least in my first pass. So what was hard? It comes to a part of the design space I hadn't previously explored well: whether the emulator is allowed to block. To block or not to block In retrowin32, the emulator was designed to be able to step through some instructions and then return control to the caller. This is critical for the web version in particular, where you cannot block the main thread. In my earlier post "threading in two ways" I went into some detail on the various tradeoffs on how I could emulate threads in a browser, ultimately choosing a single thread. This has its advantages, but is unsatisfying in a few important ways:
The main thread must repeatedly call into the emulator in a loop that yields control back to the browser. Any Windows API implementation that might transfer control to the emulator must be made async, so that it can be suspended and resumed. This is obvious for functions that take a callback, but even a function like MoveWindow will synchronously send Windows messages related to moving to the window, so it is also async with respect to the message handling. And finally, all the normal reasons async code is yucky: getting object lifetimes correct, how stack traces are busted, confusing debugging, and so on.
In the spirit of exploring the design space, when I got to revisit this choice in Theseus I instead made everything synchronous and implemented threads using real OS threads. In particular because Theseus maps the original program's code to function calls, it makes the debugging experience pretty pleasant: if I set a breakpoint or if something crashes, I get a stack trace that goes through both the source program and emulator code.
Picture: a Theseus program in a native debugger, with a stack trace including a generated x86 address on the left, and with a thread picker showing the Windows "winmm" multimedia thread on the right. I mostly care about the developer experience here, but one additional reason this approach is nice is performance. Computers are really good at quickly running simple code made of nested function calls that store things on the stack. My asynchronous approach meant there was a lot of control overhead, even in tight loops. Blocking on the web In all, blocking is great. But on the web, you cannot block the main thread. Even in a single-threaded program a call to a Windows API like GetMessage is supposed to block until a message is available, but browser events will only come in via the browser event loop once you've returned control. It would seem you're stuck. What it really means is that fundamentally, if you want to block, you must use a thread — even in the case where the program you're emulating is itself single-threaded — because worker threads are allowed to block. So here's the approach: I run the emulator's threads in web workers. When the emulator needs something from the browser, it can send a message via the postMessage API that comes in on the main thread's event loop. And here I can make the worker block until the message is handled. This where the atomics API comes in. (Uh oh, synchronization code! The chances that I got this wrong are extremely high; I welcome your feedback on this, and I post it in part to provoke some reader who knows more than me to correct me.) If you share memory between the main thread and worker, you can make the worker block on an atomic until the main thread is done. To do this, the worker sends the address of a local when it posts its message: fn blocking_call() { let mut buf = 0i32; let msg = create_message( /* ... some JavaScript data indicating what function to call ... */, // ... and include the *address* of the above 'buf' variable &mut buf as *mut _ as u32 ); post_message(msg); unsafe { // wait while buf==0 until we get an Atomic notify on it wasm32::memory_atomic_wait32(&mut buf, 0, -1 /* forever */); } }
The main thread receives these, and wakes the worker up when it's done by prodding the shared memory: window.onmessage = (e) => { const msg = e.data; // ... handle message ... // interpret msg.buf as a pointer within the shared memory: const ints = new Int32Array(sharedMemory.buffer, msg.buf, /* length */ 1); ints[0] = 1; // set `buf` from above to mark it successfully handled // wake up the waiting thread: Atomics.notify(ints, /* index */ 0, /* how many to wake up */ 1); }
Note that because the worker is blocked until its message is processed, we know that the address of the local stack variable remains live until the main thread is done with it. This means we can effectively pass the address of any local variable from the worker and the main thread can safely modify it as it chooses. From this sketch I hope you can see how I extended this to pass buffers in both ways. When the worker generates pixels, it sends a message just with a pointer to the pixels that the main thread can read directly from its memory (no copies!). And when the worker blocks to wait for an event, it can supply a buffer that the main thread can fill in. The main limitation of this approach is that the main thread cannot transfer any browser objects to the worker thread, because the only communication back is via the shared memory buffer. Objects can only be transferred by attaching them to postMessage, and those arrive via the browserevent loop. TypeScript in the host? You might have noticed the above code switches into TypeScript to show the main thread handler. At first I intended to write all of this as a single wasm blob that contained the code for both the main thread and the worker threads. I eventually turned back to TypeScript for a few reasons. Because the main thread cannot block, this means it cannot practically share its memory with the workers if any synchronization might be involved. That would veto even using a malloc implementation. I think the best way to make this work is by running the main thread wasm with its own private memory, and handing it a reference to the workers' shared memory. I think because that shared memory object is opaque, you would need to call out to browser APIs to interact with it, rather than the native wasm memory APIs. Unlike the main thread, the workers can safely malloc despite sharing memory because they can use locks like an ordinary program would. ...except that for reasons I don't fully understand, the Rust standard library under wasm isn't compiled with support for atomics turned on. Thankfully, there's a relatively supported but still nightly Rust path to rebuild the standard library itself as part of the worker build process. (It does however highlight that using shared memory web workers at all with Rust is still not exactly a supported path.) The other main reason I turned back to TypeScript is that the worker threads cannot access the DOM, and while that can be cumbersome it also provides a nice wall between the Rust worker code and browser hosting code. The Rust/wasm support for interacting with the DOM is better than it could be, but it's still pretty clunky, where e.g. any DOM function you call gets wrapped in a JS helper that is imported by the wasm module. Instead I can write my Rust code without any knowledge of browser API, and do all of the DOM munging on the TypeScript side. In general, it's hard to beat the experience of using TypeScript for web development. Tools like debugging and interactively inspecting objects are far superior to wasm debugging. (Also the recent TypeScript compiler rewrite in Go works well, it's so fast!) The main downside so far is serialization. I still haven't yet figured out a mechanism I'm happy with for transporting more complex objects across the host/worker boundary. I saw a tech talk recently where someone used Rust's rkyv library for this purpose and it looked pretty neat. What's next? Ultimately the purpose of any of these projects is just to learn about the things I was curious about. From this excursion I conclude that writing apps in wasm is impressive but still not quite there yet — I am glad I have my native build to fall back on when I want to deploy fancier tools. This is definitely a pattern I learned at Figma (where they also had a native build of their wasm-based app) and one that I would recommend to you. Similarly, I conclude that Rust with shared memory workers is still pretty early. I think for an app where you really cared it works pretty well, but "use a nightly compiler so you can recompile the standard library" is not a great sign. For Theseus itself, I have a few ideas of where to go next, but those will have to wait for another post! |
The project discussed involves translating native win32 code into WebAssembly (Wasm) using the Theseus emulator, which allows the resulting code to execute in a web environment. The initial implementation achieved the translation by recompiling the existing output with a different CPU target, yielding code that is largely agnostic to the execution environment, although optimizations remain dependent on the specific compiler output. A key challenge in this translation lay in designing the execution model, particularly concerning whether the emulator should be allowed to block.
The author explored the tradeoffs between blocking and asynchronous execution, ultimately choosing a design rooted in blocking, but adapted for the constraints of the web environment. In the context of emulating threads for a browser, the author initially considered an asynchronous approach, but this led to complications regarding object lifetimes and debugging. To improve the developer experience, the author revisited the design, opting for synchronous execution coupled with real operating system threads. This choice significantly enhanced debugging capabilities, allowing stack traces to traverse both the original source program and the emulator code, which is crucial for debugging crashes and breakpoints. Furthermore, synchronous execution was favored over the asynchronous model because it offered better performance by minimizing the control overhead associated with managing asynchronous operations in tight loops.
Addressing the environment constraints of the web, where the main thread cannot be blocked, the author devised an approach utilizing web workers to handle the emulator's threads. When the emulator requires interaction with the browser, it communicates via the postMessage API to the main thread's event loop. To enable effective blocking while maintaining responsiveness, the author incorporated the Atomics API for synchronization between the worker thread and the main thread. This mechanism allows the worker to wait for a signal from the main thread using an atomic operation, effectively blocking the worker until the necessary data or event is provided. This synchronization is implemented by sharing memory between the main thread and the worker, allowing the worker to wait on a specific address until the main thread signals completion. This technique permits safe sharing of data, such as buffers for pixel data, allowing the worker to send pointers to the main thread for direct, zero-copy access, and enabling the main thread to fill buffers requested by the worker.
A significant limitation of this shared memory approach is that the main thread cannot transfer arbitrary browser objects to the workers; communication is restricted to shared memory. Consequently, the author shifted the implementation logic to use TypeScript for the host environment. This decision was motivated by the difficulty of managing shared memory and synchronization reliably in the host environment, which would negate the benefits of using it for shared memory. The author decided to run the main thread Wasm with its own private memory and provide a reference to the workers' shared memory instead. While workers can safely manage memory allocation using locks through shared memory, the use of Rust's standard library under Wasm often lacks built-in atomic support, suggesting that further modifications to the standard library or using alternative synchronization methods are necessary. Finally, the author favored TypeScript for the host because it offers superior debugging and interactive object inspection tools compared to Wasm debugging, and it provides a necessary abstraction layer between the lower-level Rust/Wasm code and the intricacies of browser API management. The overall conclusion is that while writing applications in Wasm is impressive, the current state of Rust with shared memory workers remains nascent, prompting a recommendation to use native builds for highly complex tools. |