Compressed filesystems à la language models

Recorded: Nov. 27, 2025, 1:02 a.m.

Original

Summarized

Compressed Filesystems á la Language Models | Rohan Gupta

WritingAboutProjectsCompressed Filesystems á la Language ModelsNovember 25, 2025 •Every systems engineer at some point in their journey yearns to write
a filesystem. This sounds daunting at first - and writing a battle-tested filesystem is hard - but the minimal surface area for a “working” FS is surprisingly small, simple, and in-distribution for coding agents.In fact, one of my smoke tests for new coding models is seeing how good of
a filesystem they can one-shot! At some point, I had quite a few filesystems lying around - and coding models were getting pretty good - which made me wonder if the models were intelligent enough to actually model the filesystem engine itself?A filesystem is the perfect black-box API to model with wacky backends (see “Harder drives”), and besides the joy of training an LLM for fun - there were a few deeper truths about language models that I wanted to explore.Training a filesystem #So I set upon training a filesystem. Building on top of one of my throwaway
FUSEs, a few rounds with Claude repurposed it to loopback against the host
with added logging, two things I needed to generate reference fine-tuning data:class LoggingLoopbackFS(LoggingMixIn, Operations):
"""
A loopback FUSE filesystem that logs all operations for training data.

This implementation delegates all filesystem operations to a real directory
on the host filesystem, ensuring perfect semantic correctness while logging
every operation for LLM training data.
"""
I then wrote a filesystem interaction simulator, which sampled various
operations against a sandboxed LoggingLoopbackFS to generate diverse FUSE
prompt/completion pairs. Concretely, I captured only the minimal set of operations needed for
R/W-ish capability (no open, xattrs, fsync etc).Alongside the FUSE operation, I captured the full filesystem state at every
turn. I experimented with various formats, including an ASCII-art
representation, but ultimately settled on XML since it enforces prompt
boundaries clearly and had canonical parsers available.With prompts including the FUSE operation + XML filesystem tree, the model learned two forms of completions:Reads (<R>) requested the content / metadata as per the operation
(getattr / readdir / read)Writes (<W>) requested the model to output the full filesystem tree state,
after modification (unlink / chmod / truncate / write)Example prompt (read):<R>
read('/usr14/log767.rs', size=4096, offset=0, fh=4)
---
<filesystem>
<directory path="/" name="/" mode="755" owner="root" group="root"
mtime="2025-01-01T00:00:00">
<directory path="usr14" name="usr14" mode="755" owner="root" group="root"
mtime="2025-01-01T00:00:00">
<file path="usr14/log767.rs" name="log767.rs" mode="644" owner="root"
group="root" mtime="2025-01-01T00:00:01" size="276">
<body>fn main() {
match process(7) {
Ok(result) => println!("Result: {}", result),
Err(e) => eprintln!("Error: {}", e),
}
</body>
</file>
<file path="usr14/temp912.sh" name="temp912.sh" mode="644" owner="root"
group="root" mtime="2025-01-01T00:00:01" size="268">
<body>#!/bin/bash
echo "temp912" || exit 1
</body>
</file>
</directory>
</directory>
</filesystem>
Completion:fn main() {
match process(7) {
Ok(result) => println!("Result: {}", result),
Err(e) => eprintln!("Error: {}", e),
}
}
Fine-tuning #Once I had clean, representative, and diverse filesystem simulation data, actually running SFT was pretty straightforward on Modal. Over a few iteration cycles spread across nibbles of spare time, I ended up with ~98% accuracy on a hold-out eval after 8 epochs of SFT on a N=15000 dataset with Qwen3-4b.Most of my time here was spent cleaning generated data and ensuring we represented every FUSE operation sufficiently + generated enough “complex” trees to learn on.At this point, I wrote … possibly the smallest filesystem I’ve seen… to give my model a spin in
the real world. Every FUSE operation was a passthrough to the LLM, for example:class LLMFuse(LoggingMixin, Operations):
...
def chmod(self, path, mode):
"""Change file permissions."""
response = self._query_llm_for_operation('chmod', path, mode=oct(mode))
if not self._handle_llm_response(response):
raise FuseOSError(ENOENT)
return 0
...
Nice! I now had a mountable FUSE that was entirely “implemented” by a language
model. As you can see below, I was able to ls around it, echo into files, and cat them back out.
Poking around a Docker container with a mounted LLMFuse.Compressing the filesystem #Perhaps the largest glaring inefficiency in this set up is the sheer verbosity
of the XML-based representation. I was using many bytes to represent attributes
and tree structure that could be encoded far more efficiently (~O(bits)) in a standard
C struct.However, as I was fine-tuning on the XML filesystem tree representation, I was
baking in this very structure into the weights and probability distributions of my Qwen fork! If only there was a way to leverage this to compress state…Two sides of the same coin #As it turns out, compression and AI are intimately related. Using LLMs to lossily
compress text is one of the most common applications, so it’s not entirely
unintuitive. However, one researcher (Marcus Hutter) claimed back in 2006 that they are equivalent (and in fact bet $500K on this claim!).Presciently, Hutter appears to be absolutely right. His enwik8 and enwik9’s benchmark datasets are, today, best compressed by a 169M parameter LLM (trained by none other than Fabrice Bellard in 2023).That’s a bit perplexing on the first glance. Surely LLM compression isn’t reversible? What kind of voodoo magic was going on here?Arithmetic coding #The algorithm that enables reversible compression using LLMs is called “arithmetic coding” and it builds upon a 1948 result by Claude Shannon.Researchers at DeepMind (including Hutter himself) have explained the math in
detail, so I’ll direct the most inquisitive of you readers there, but for a simplified understanding of what’s going on, forget everything you might know about working with LLMs today. There’s no prompting involved!Let’s assume the following is true for some predictive model $M$Lorem has first-word probability = 0.57.Ipsum has second-word conditional probability = 0.67 (joint 0.38).Dolor has a third word conditional probability = 0.5 (joint 0.19).…so on and so forth until you reach the end of the string you want to compress and you end up with some “final interval width” $P(m)$ on the real interval $[0,1]$ which represents your string.Let’s suppose in our example this turns out to be 0.012. We can represent this decimal in roughly $- \log_{2}{P(m)} = 6.4$ bits, which is our final compression size.There’s a few elegant things about this algorithm:Any number within this interval is uniquely determined by tracing the arithmetic coding algorithm through the specific probabilistic model’s weights. “Decoding” is simply a retracing operation (see the line through the probability distributions above)The inverse log relationship between predictive power $P(m)$ and compression
pushes the burden of the “hard compression problem” to deep learning machinery which can encode high-dimensional text patterns within model weights, yielding far better compression ratios than deterministic algorithms.Sounds cool! But how good really is this compression? On comparing
arithmetic coding backed by Qwen3-4B against gzip for lipsum.txt,
we already see pretty dramatic results:MethodSize (bytes)Compression ImpactOriginal (plain)446—gzip298~33% smallerllmencode13~97% smaller(note: llmencode is my implementation of arithmetic coding)22x better compression than gzip is pretty ridiculous! A caveat here is that lipsum.txt is heavily represented in training data, but 5-20x efficiency gains broadly hold for all text data that (looks like) it’s been on the internet.Self-compression #Now, back to our filesystem. The XML overhead we were worried about now can be
“compressed away” by the fine-tuned model. Using the same toy filesystem from
the Docker container demo above:<filesystem>
<directory path="/" name="/" mode="755" owner="root" group="root" mtime="2025-01-01T00:00:00">
<directory path="testdir" name="testdir" mode="755" owner="root" group="root" mtime="2025-01-01T00:00:00" />
<file path="testfile.txt" name="testfile.txt" mode="644" owner="root" group="root" mtime="2025-01-01T00:00:01" size="14">
<body>hello llmfuse
</body>
</file>
</directory>
</filesystem>
ModelOriginal (bytes)Compressed (bytes)RatioBase Qwen3-4B3943810.4xFine-tuned Qwen3-4B3942118.8xThe fine-tuned model achieves 44.7% better compression on XML filesystem
trees - the very format it was trained to predict. This is the “self-compression”
effect: by baking the XML structure into the model weights during fine-tuning,
the arithmetic coder can represent that structure in fewer bits.Self-compression in filesystems isn’t a novel concept. For example, there exists the
squashfs tool (created in 2002) to create R/O compressed filesystems. Squashfs compresses
files, inodes, and directories together, not unlike what we’re doing here!Under the hood, squashfs just wraps gzip/zstd/your favourite compression
algorithm. So for plain-text data, squashfs compression stats pale in the face of llmfuse:MethodCompressed SizeNotessquashfs (gzip)171 bytesgzip-compressed file contents, inodes, directory tablesllmfuse (fine-tuned)21 bytesArithmetic coded XML stateFor the same filesystem tree (one directory, one 14-byte text file), llmfuse
achieves ~8x better compression than squashfs (see methodology in appendix).The difference comes down to llmencode being far better than gzip on
text data + XML structure - especially when the model has been fine-tuned on exactly
that structure.Conclusion #What started off as a little experiment mostly to get my hands dirty with
training and inference evolved into a full blown nerd
snipe and intellectual adventure. Thanks for making it
this far!I entirely recognize that this is a “toy”
experiment under a very specific setup; with that said, the numbers above are pretty eye-popping, and the question I’ve been trying to answer as I write this up is: does this have any real-world potential?Of course, in the short term, there’s a whole host of caveats: you need an
LLM, likely a GPU, all your data is in the context window (which we know scales
poorly), and this only works on text data.Still, it’s intriguing to wonder whether the very engines that will likely
dominate all “text generation” going forward can be used to compress their own
data? Perhaps in a distant future, where running LLMs at the edge makes sense, or for specific kinds of workflows where data is read very infrequently.Overall, I’m grateful to Peyton at Modal for the compute credits. Running
a somewhat unconventional experiment like this wouldn’t have been possible
without full control over the training and inference code, and extremely
tedious without the simplicity of running ML infra on Modal! It’s truly awesome
to be able to just modal deploy and get my own private inference endpoints,
or just modal run to prototype some code on the cloud.Appendix #Source Code #All of the source code for this experiment, particularly llmfuse and
llmencode are open-sourced under MIT.llmencode is abstracted into a CLI utility that you can run locally.
Inference on 4B models is slow, but entirely possible on consumer hardware.
I prototyped most of this code by running on a 2021 MacBook Pro, before
productionizing on Modal.A fun experiment / party trick to identify how “common” a certain
string is in training data is to look at its llmencode compression ratio!SquashFS comparison methodology #The raw .sqsh file is 4096 bytes due to block alignment padding. To find the
actual compressed size, I used xxd to inspect the binary and found the last
non-zero byte at offset 266 (267 bytes total). Subtracting the fixed 96-byte
superblock header gives us 171 bytes of actual gzip-compressed content -
everything needed to reconstruct the filesystem.Compression as a metric #It’s equally interesting to think about compression as a metric. An angle I’d
considered is doing some kind of RL on the arithmetic coded compression number itself.Is that simply equivalent to the pre-training objective (due to the prediction-compression duality)? Or does the “sequence-level” objective add something more… interesting to the mix. Please reach out if you have thoughts!© 2025

The core of Rohan Gupta’s exploration centers on the surprising potential of Large Language Models (LLMs) to serve as highly efficient file system engines. Initially conceived as a “smoke test” for new coding models, the project rapidly evolved into a sophisticated investigation of self-compression and the inherent relationship between prediction and data encoding. Gupta’s work demonstrates a meticulous approach, starting with a minimal, functional file system built on a discarded FUSE implementation, and then systematically generating training data through a loopback filesystem simulator. This simulator, leveraging Claude, produced paired data consisting of FUSE operations and corresponding XML representations of the filesystem’s state.

Gupta’s design focused on capturing the absolute minimum necessary operations for read/write functionality, prioritizing semantic correctness and logging for training purposes. The resulting dataset, approximately 15,000 records, was then utilized to fine-tune a Qwen3-4b LLM. The core of the implementation was the `LoggingLoopbackFS` and `LLMFuse` classes, where the LLM acted as a direct passthrough for FUSE operations, creating a fully functional, albeit synthetic, file system. This was achieved with clever prompting mechanisms, enabling the LLM to accurately produce XML-formatted filesystem states upon request. The fine-tuning process resulted in a remarkable 98% accuracy on a hold-out evaluation set, highlighting the model's ability to learn the underlying structure and dynamics of the simulated filesystem.

A key element of the project was the creation of a compressed XML-based representation of the file system, which served as the primary training data. This wasn’t merely for data storage, but to bake the filesystem structure directly into the LLM’s weights, enabling a self-compression effect. The final implementation, dubbed `llmencode`, achieved compression ratios significantly exceeding those of established algorithms like gzip, particularly when applied to text data. This capability was driven by the model’s fine-tuning on the specific XML structure and the inherent prediction-compression duality – where the LLM’s ability to predict the next element in the filesystem was leveraged to encode the data with unparalleled efficiency. The comparison to squashfs further underscored this point, emphasizing the advantages of an LLM-based approach in scenarios where the data is primarily text-based.

Gupta's insightful exploration extends beyond the technical implementation, offering a broader perspective on the future of data compression and LLMs. He deftly navigates the inherent limitations of the toy experiment—scale, context window constraints, data dependencies—while simultaneously highlighting the potential implications for real-world applications, such as edge computing scenarios. The integration of arithmetic coding, a technique originally developed by Claude Shannon, adds another layer of sophistication, illustrating how language models can effectively mimic the mathematical foundations of data compression. He meticulously considers compression as a metric, and the potential to train RL agents to optimize for compression itself. Ultimately, Gupta’s work demonstrates not only a technical achievement, but a provocative thought experiment—one that challenges the conventional understanding of LLMs and their potential to revolutionize data storage and retrieval.