Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Recorded: March 25, 2026, 3 a.m.

Original

Summarized

GitHub - t8/hypura: Run models too big for your Mac's memory · GitHub

Navigation Menu

Toggle navigation

Appearance settings

PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersView all resourcesOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorGitHub StarsArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

t8

/

hypura

Public

Notifications
You must be signed in to change notification settings

Fork
8

Star
343

Code

Issues
1

Pull requests
3

Actions

Projects

Security
0

Insights

Additional navigation options

Code

Issues

Pull requests

Actions

Projects

Security

Insights

t8/hypura

mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History44 Commits44 Commitsbenchesbenches benchmarksbenchmarks hypura-syshypura-sys patchespatches srcsrc teststests vendorvendor .gitignore.gitignore .gitmodules.gitmodules CLAUDE.mdCLAUDE.md Cargo.lockCargo.lock Cargo.tomlCargo.toml README.mdREADME.md RESEARCH_INTEGRATION_PLAN.mdRESEARCH_INTEGRATION_PLAN.md View all filesRepository files navigationREADME _ _
| | | |_ _ _ __ _ _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
| _ | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_| \__,_|
|___/|_|
Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon.
It places model tensors across GPU, RAM, and NVMe tiers based on access
patterns, bandwidth costs, and hardware capabilities — enabling models
that exceed physical memory to run without crashing the system.
Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.
Why does this matter?
Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory
and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load
a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.
Hypura solves this by understanding the model architecture:

Norms and embeddings are tiny but accessed every token — pinned to GPU
MoE expert routing exploits sparsity — only 2 of 8 experts fire per token.
Router interception identifies selected experts in the eval callback, then loads
only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks
loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
Co-activation tracking predicts which experts will fire next for speculative prefetch.
Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through
a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch
lookahead depth scales automatically with available memory.

The result: models that would crash your machine under naive mmap become runnable.
Models that fit in memory run at full Metal GPU speed with zero overhead.
How it works
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth),
and solves a placement optimization that assigns every tensor to a tier:

GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize.
RAM — Overflow layers that don't fit in the GPU working set. Accessed via mmap.
NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass.

Hypura selects the best inference mode automatically based on model size, architecture, and available memory:

Full-resident — Model fits in GPU+RAM. No NVMe I/O. Full Metal speed.
Expert-streaming — For MoE models (Mixtral). Only non-expert tensors (~1 GB) stay on GPU. Expert tensors stream from NVMe through a pool buffer on demand, with a neuron cache (99.5% hit rate) that eliminates most I/O after warmup.
Dense FFN-streaming — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer, with scaled prefetch lookahead.

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.
Performance
All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read.

Model
Size
GPU
NVMe
Mode
Hypura
llama.cpp
Notes

Qwen 2.5 14B Q4_K_M
8.4 GB
8.4 GB
—
full-resident
21 tok/s
~21 tok/s
Fits in GPU; no overhead

Mixtral 8x7B Q5_K_M
30.9 GB
1.1 GB
29.8 GB
expert-streaming
2.2 tok/s
OOM
All layers on Metal; 99.5% cache hit rate

Llama 3.3 70B Q4_K_M
39.6 GB
7.8 GB
31.8 GB
dense-FFN-streaming
0.3 tok/s
OOM
All layers on Metal; dynamic 24-slot pool, 7-layer prefetch

Key takeaway: For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.
Install
Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake (for the vendored llama.cpp).
git clone --recurse-submodules https://github.com/hypura/hypura.git
cd hypura
cargo build --release
The binary is at target/release/hypura.

Homebrew tap coming soon.

Quick start
# Profile your hardware (runs once, cached)
hypura profile

# Run inference on a GGUF model
hypura run ./model.gguf --prompt "Hello, world"

# Interactive chat
hypura run ./model.gguf --interactive

# Benchmark: Hypura scheduling vs naive baseline
hypura bench ./model.gguf

# Inspect model placement plan without loading
hypura inspect ./model.gguf
Start with --max-tokens 10 on untested models before scaling up.
Ollama-compatible server
Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama — including OpenClaw.
hypura serve ./model.gguf
# Hypura serving Mixtral 8x7B Instruct v0.1
# Endpoint: http://127.0.0.1:8080
# Ollama-compatible API: /api/generate, /api/chat, /api/tags
Endpoints

Endpoint
Description

GET /
Health check

GET /api/tags
List loaded model

GET /api/version
Server version

POST /api/show
Model metadata

POST /api/generate
Text completion (streaming NDJSON or single response)

POST /api/chat
Chat completion (streaming NDJSON or single response)

Usage with OpenClaw
Point OpenClaw at Hypura by setting the Ollama base URL in ~/.openclaw/openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:8080",
"api": "ollama"
}
}
}
}
Or via the CLI:
openclaw config set models.providers.ollama.baseUrl "http://127.0.0.1:8080"
Hypura speaks native Ollama protocol (/api/chat with NDJSON streaming), so no compatibility shims are needed.
Server options
hypura serve <MODEL> [OPTIONS]

Options:
--host <HOST> Host to bind to [default: 127.0.0.1]
--port <PORT> Port to bind to [default: 8080]
--context <N> Maximum context length [default: 4096]

Architecture
Hypura is a Cargo workspace with two crates:

hypura — Main binary and library. CLI in src/main.rs, all logic in src/lib.rs modules.
hypura-sys — FFI bindings to llama.cpp (vendored at vendor/llama.cpp/, built via CMake).

Key modules

Module
Purpose

scheduler/placement.rs
LP + greedy tensor placement across GPU/RAM/NVMe tiers

compute/inference.rs
Inference engine: generate_blocking, generate_with_nvme_scheduling, server-oriented load_model / generate_from_loaded

compute/nvme_backend.rs
Custom GGML buffer type, pool-based expert/FFN streaming, neuron cache, eval callback

server/routes.rs
Axum HTTP handlers for Ollama-compatible API

profiler/
Hardware detection (CPU, GPU, memory bandwidth, NVMe throughput)

cli/bench.rs
A/B benchmark harness

model/tensor_role.rs
Tensor classification for placement scoring (norms, attention, MoE experts)

FAQ
Will this kill my SSD?
No. Hypura only reads from your SSD during inference — it never writes to it.
SSD wear is caused by write cycles (program/erase cycles on NAND flash cells). Reads do not degrade flash cells. Hypura's entire NVMe I/O path uses read-only pread() calls with F_NOCACHE to stream tensor weights from the GGUF file into RAM/GPU memory pools, where all computation happens. The SSD is used as cold storage, not as working memory.
The only writes Hypura performs are negligible: benchmark result JSON files (~KB), co-activation statistics (~KB to ~/.hypura/), and the one-time hypura optimize command if you choose to run it. Normal inference generates zero SSD writes.
Safety notes

bench --baseline is blocked when the model exceeds RAM minus 4 GB headroom. Use --force to override at your own risk.
Always start with --max-tokens 10 on untested models.
Test models belong in ./test-models/ (not checked in).

License
MIT
Ethics
I feel morally obligated to say I did not write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.

About

Run models too big for your Mac's memory

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity
Stars

343
stars
Watchers

0
watching
Forks

8
forks

Report repository

Releases
1

v0.1.0 — Storage-Tier-Aware LLM Inference

Latest

Mar 17, 2026

Packages
0

Uh oh!

There was an error while loading. Please reload this page.

Contributors
1

t8
Tate Berenbaum

Languages

Rust
91.8%

Shell
5.4%

C
2.8%

Footer

Footer navigation

Terms

Privacy

Security

Status

Community

Docs

Contact

Manage cookies

Do not share my personal information

You can’t perform that action at this time.

Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon devices, addressing the challenge of running models exceeding a Mac’s physical memory. Developed by t8, this project, spearheaded by Tate Berenbaum, leverages a sophisticated system to intelligently distribute model tensors across GPU, RAM, and NVMe storage tiers, optimizing for access patterns, bandwidth costs, and hardware capabilities. The core functionality allows models like the 31 GB Mixtral 8x7B and 40 GB Llama 70B to execute without crashing the system, where naive approaches with tools like llama.cpp would fail.

The system’s effectiveness stems from a deep understanding of model architecture. Norms and embeddings, frequently accessed, are pinned to the GPU. Mixture-of-Experts (MoE) models exploit sparsity by selectively activating only necessary experts per token. A router interception mechanism identifies and loads only the relevant expert strides from NVMe storage, reducing I/O by approximately 75%. A neuron cache significantly accelerates performance by maintaining a 99.5% hit rate for recently accessed expert slices, employing temporal locality. Co-activation tracking anticipates future expert activations using speculative prefetch, further enhancing efficiency. Dense FFN weights, constituting roughly 60% of the model size, stream from NVMe via a dynamically sized pool buffer. Prefetch lookahead depth dynamically adapts to available memory, optimizing data retrieval.

Hypura operates by reading the GGUF file, profiling the user’s hardware – including the GPU’s working set, RAM capacity, and NVMe’s throughput – to formulate an optimized tensor placement plan. The core tiers utilized are: GPU (Metal) for attention layers, norms, and embeddings; RAM to handle overflow tensors not fitting on the GPU; and NVMe for on-demand loading using direct I/O (pread() with F_NOCACHE) and prefetching. Hypura automatically selects the most appropriate inference mode based on model size, architecture, and available memory. Modes include full-resident, expert-streaming, and dense FFN-streaming, each optimized for specific model characteristics. A pool buffer size and prefetch depth are calculated automatically, removing the need for manual tuning.

Performance benchmarks, conducted on an M1 Max Mac with 32 GB of unified memory and a 5.1 GB/s NVMe drive, demonstrate significant improvements. For a Qwen 2.5 14B model, using the Q4_K_M quantization, a full-resident mode achieved approximately 21 tokens per second, a substantial increase over the OOM crash observed with naive methods. The Mixtral 8x7B with the Q5_K_M quantization, leverages expert streaming and the 99.5% cache hit rate, resulting in 2.2 tokens per second, a performance considerably higher than the crash experienced by vanilla llama.cpp. For Llama 3.3 70B with Q4_K_M, employing the dense FFN-streaming approach, achieved a rate of 0.3 tokens per second, effectively preventing the crash.

The key takeaway is that Hypura delivers full Metal GPU speed with zero overhead for models that would otherwise fail to run on a standard Mac, and reduces the speed for models that do fit in memory to zero. The system’s adaptability and automation—particularly the automatic adjustment of pool sizes and prefetch depths—reduce the potential for manual tuning, simplifying the deployment process. The automated placement logic of Hypura demonstrates a clear advantage over traditional methods which often lead to system instability.

The installation process involves cloning the repository, building from source with Cargo, and requires Rust 1.75+ and CMake. The core components comprise the hypura binary and library, and the vendored llama.cpp. The system contains several supporting modules including a scheduler for tensor placement, compute layers for inference, NVMe buffer management, an API for Ollama compatibility, and a profiler for hardware data collection.

Hypura’s architecture features a Cargo workspace with two primary crates: hypura (the main binary & library) and hypura-sys (FFI bindings to llama.cpp). Key modules include a scheduler, an inference engine, NVMe buffer management, and the Ollama compatible API. The system offers various options through command-line arguments like `--host`, `--port`, and `--context` via the hypura serve command.

The project incorporates thorough safety measures, including a restricted benchmark setting that prevents blocking when the model exceeds RAM minus 4 GB headroom. It also advises starting with a small prompt size --max-tokens 10 -- on untested models to avoid potential crashes, and specifies that test models should reside within a dedicated folder to prevent unintended check-ins.

The project is licensed under the MIT license, and includes a note of ethical consideration, highlighting the reliance on socratic inquiry and a hunch regarding NVMe supporting inference. Its contributors are limited to t8.