Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
Recorded: March 25, 2026, 3 a.m.
| Original | Summarized |
GitHub - t8/hypura: Run models too big for your Mac's memory · GitHub Skip to content Navigation Menu Toggle navigation
Sign in
Appearance settings PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersView all resourcesOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorGitHub StarsArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing Search or jump to... Search code, repositories, users, issues, pull requests...
Search Clear
Search syntax tips Provide feedback Include my email address so I can be contacted Cancel Submit feedback Saved searches
Name Query To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up
Appearance settings Resetting focus You signed in with another tab or window. Reload to refresh your session. Dismiss alert t8 hypura Public
Notifications
Fork
Star Code Issues Pull requests Actions Projects Security Insights
Additional navigation options
Code Issues Pull requests Actions Projects Security Insights
mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History44 Commits44 Commitsbenchesbenches benchmarksbenchmarks hypura-syshypura-sys patchespatches srcsrc teststests vendorvendor .gitignore.gitignore .gitmodules.gitmodules CLAUDE.mdCLAUDE.md Cargo.lockCargo.lock Cargo.tomlCargo.toml README.mdREADME.md RESEARCH_INTEGRATION_PLAN.mdRESEARCH_INTEGRATION_PLAN.md View all filesRepository files navigationREADME _ _ Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. Norms and embeddings are tiny but accessed every token — pinned to GPU The result: models that would crash your machine under naive mmap become runnable. GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize. Hypura selects the best inference mode automatically based on model size, architecture, and available memory: Full-resident — Model fits in GPU+RAM. No NVMe I/O. Full Metal speed. Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed. Model Qwen 2.5 14B Q4_K_M Mixtral 8x7B Q5_K_M Llama 3.3 70B Q4_K_M Key takeaway: For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory. Homebrew tap coming soon. Quick start # Run inference on a GGUF model # Interactive chat # Benchmark: Hypura scheduling vs naive baseline # Inspect model placement plan without loading Endpoint GET / GET /api/tags GET /api/version POST /api/show POST /api/generate POST /api/chat Usage with OpenClaw Options: Architecture hypura — Main binary and library. CLI in src/main.rs, all logic in src/lib.rs modules. Key modules Module scheduler/placement.rs compute/inference.rs compute/nvme_backend.rs server/routes.rs profiler/ cli/bench.rs model/tensor_role.rs FAQ bench --baseline is blocked when the model exceeds RAM minus 4 GB headroom. Use --force to override at your own risk. License About Run models too big for your Mac's memory Readme Uh oh! There was an error while loading. Please reload this page. Activity 343 0 8 Report repository Releases v0.1.0 — Storage-Tier-Aware LLM Inference Latest Packages
Uh oh! There was an error while loading. Please reload this page. Contributors t8
Languages Rust Shell C
Footer © 2026 GitHub, Inc. Footer navigation Terms Privacy Security Status Community Docs Contact Manage cookies Do not share my personal information You can’t perform that action at this time. |
Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon devices, addressing the challenge of running models exceeding a Mac’s physical memory. Developed by t8, this project, spearheaded by Tate Berenbaum, leverages a sophisticated system to intelligently distribute model tensors across GPU, RAM, and NVMe storage tiers, optimizing for access patterns, bandwidth costs, and hardware capabilities. The core functionality allows models like the 31 GB Mixtral 8x7B and 40 GB Llama 70B to execute without crashing the system, where naive approaches with tools like llama.cpp would fail. The system’s effectiveness stems from a deep understanding of model architecture. Norms and embeddings, frequently accessed, are pinned to the GPU. Mixture-of-Experts (MoE) models exploit sparsity by selectively activating only necessary experts per token. A router interception mechanism identifies and loads only the relevant expert strides from NVMe storage, reducing I/O by approximately 75%. A neuron cache significantly accelerates performance by maintaining a 99.5% hit rate for recently accessed expert slices, employing temporal locality. Co-activation tracking anticipates future expert activations using speculative prefetch, further enhancing efficiency. Dense FFN weights, constituting roughly 60% of the model size, stream from NVMe via a dynamically sized pool buffer. Prefetch lookahead depth dynamically adapts to available memory, optimizing data retrieval. Hypura operates by reading the GGUF file, profiling the user’s hardware – including the GPU’s working set, RAM capacity, and NVMe’s throughput – to formulate an optimized tensor placement plan. The core tiers utilized are: GPU (Metal) for attention layers, norms, and embeddings; RAM to handle overflow tensors not fitting on the GPU; and NVMe for on-demand loading using direct I/O (pread() with F_NOCACHE) and prefetching. Hypura automatically selects the most appropriate inference mode based on model size, architecture, and available memory. Modes include full-resident, expert-streaming, and dense FFN-streaming, each optimized for specific model characteristics. A pool buffer size and prefetch depth are calculated automatically, removing the need for manual tuning. Performance benchmarks, conducted on an M1 Max Mac with 32 GB of unified memory and a 5.1 GB/s NVMe drive, demonstrate significant improvements. For a Qwen 2.5 14B model, using the Q4_K_M quantization, a full-resident mode achieved approximately 21 tokens per second, a substantial increase over the OOM crash observed with naive methods. The Mixtral 8x7B with the Q5_K_M quantization, leverages expert streaming and the 99.5% cache hit rate, resulting in 2.2 tokens per second, a performance considerably higher than the crash experienced by vanilla llama.cpp. For Llama 3.3 70B with Q4_K_M, employing the dense FFN-streaming approach, achieved a rate of 0.3 tokens per second, effectively preventing the crash. The key takeaway is that Hypura delivers full Metal GPU speed with zero overhead for models that would otherwise fail to run on a standard Mac, and reduces the speed for models that do fit in memory to zero. The system’s adaptability and automation—particularly the automatic adjustment of pool sizes and prefetch depths—reduce the potential for manual tuning, simplifying the deployment process. The automated placement logic of Hypura demonstrates a clear advantage over traditional methods which often lead to system instability. The installation process involves cloning the repository, building from source with Cargo, and requires Rust 1.75+ and CMake. The core components comprise the hypura binary and library, and the vendored llama.cpp. The system contains several supporting modules including a scheduler for tensor placement, compute layers for inference, NVMe buffer management, an API for Ollama compatibility, and a profiler for hardware data collection. Hypura’s architecture features a Cargo workspace with two primary crates: hypura (the main binary & library) and hypura-sys (FFI bindings to llama.cpp). Key modules include a scheduler, an inference engine, NVMe buffer management, and the Ollama compatible API. The system offers various options through command-line arguments like `--host`, `--port`, and `--context` via the hypura serve command. The project incorporates thorough safety measures, including a restricted benchmark setting that prevents blocking when the model exceeds RAM minus 4 GB headroom. It also advises starting with a small prompt size --max-tokens 10 -- on untested models to avoid potential crashes, and specifies that test models should reside within a dedicated folder to prevent unintended check-ins. The project is licensed under the MIT license, and includes a note of ethical consideration, highlighting the reliance on socratic inquiry and a hunch regarding NVMe supporting inference. Its contributors are limited to t8. |