Flash-MoE: Running a 397B Parameter Model on a Laptop
Recorded: March 22, 2026, 10 p.m.
| Original | Summarized |
GitHub - danveloper/flash-moe: Running a big model on a small laptop · GitHub Skip to content Navigation Menu Toggle navigation
Sign in
Appearance settings PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersView all resourcesOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorGitHub StarsArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing Search or jump to... Search code, repositories, users, issues, pull requests...
Search Clear
Search syntax tips Provide feedback Include my email address so I can be contacted Cancel Submit feedback Saved searches
Name Query To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up
Appearance settings Resetting focus You signed in with another tab or window. Reload to refresh your session. Dismiss alert danveloper flash-moe Public
Notifications
Fork
Star Code Issues Pull requests Actions Projects Security Insights
Additional navigation options
Code Issues Pull requests Actions Projects Security Insights
mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History144 Commits144 Commitsdocsdocs metal_infermetal_infer paperpaper .gitignore.gitignore CLAUDE.mdCLAUDE.md README.mdREADME.md expert_index.jsonexpert_index.json progress.pngprogress.png progress.pyprogress.py repack_experts.pyrepack_experts.py results.tsvresults.tsv View all filesRepository files navigationREADMEFlash-MoE: Running a 397B Parameter Model on a Laptop Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours. Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling. Configuration 4-bit experts, FMA kernel 4-bit experts, baseline 2-bit experts, trust OS 2-bit peak single token *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration. Machine: MacBook Pro, Apple M3 Max Architecture SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper. FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation. Metal Compute Shaders — Hand-written Metal kernels for: 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized) Deferred GPU Expert Compute — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections. Accelerate BLAS for Linear Attention — The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code. Trust the OS — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally. Pipeline Per Layer (4.28ms average at 4-bit) Unified Memory Constraint # 2-bit inference (faster but breaks tool calling) # Interactive chat with tool calling # Per-layer timing breakdown repack_experts.py # 4-bit expert packing from safetensors What We Tried (and What Worked) Approach FMA dequant kernel Trust OS page cache GPU combine+norm in CMD3 BLAS delta-net (Accelerate) F_NOCACHE for 2-bit GPU fused attention (RoPE) C BPE tokenizer Deferred CMD3 execution Discarded (58 experiments, highlights) Approach LZ4 expert compression F_RDADVISE prefetch Temporal expert prediction MLP routing predictor GPU LUT dequant kernel GPU private buffer compression Spin-poll GPU wait Expert file clustering dispatch_io mmap expert files Speculative early routing MTP speculative decoding Safety Non-expert weights: 5.5GB (mmap'd, read-only) About Running a big model on a small laptop Readme Uh oh! There was an error while loading. Please reload this page. Activity 1.1k 10 109 Report repository Releases Packages
Uh oh! There was an error while loading. Please reload this page. Contributors danveloper
claude
Languages Objective-C C TeX Python Metal Shell Makefile
Footer © 2026 GitHub, Inc. Footer navigation Terms Privacy Security Status Community Docs Contact Manage cookies Do not share my personal information You can’t perform that action at this time. |
Dan Woods’ “flash-moe” project represents a remarkable achievement in running a massive 397 billion parameter Qwen3.5 model on a MacBook Pro laptop, demonstrating a pathway for deploying large language models in resource-constrained environments. The project’s core innovation lies in a complete rewrite of the inference engine using C and Metal, bypassing traditional Python frameworks and GPU drivers, achieving a staggering 4.4+ tokens per second with production quality output, including tool calling. This success hinges on several meticulously engineered techniques. The project leverages a “trust the OS” strategy, relying on the macOS page cache for expert data management rather than implementing custom caching mechanisms. This approach, inspired by Apple’s “LLM in a Flash” paper, significantly reduces GPU memory pressure and avoids the performance bottlenecks associated with managing custom caches. A key optimization is the FMA-optimized dequantized matrix-vector multiply kernel, which accelerates the inner loop of the 4-bit dequantized calculations by rearranging the math to utilize the Fused Multiply-Add (FMA) unit on the GPU, resulting in a 12% performance boost. Furthermore, the project capitalizes on Metal compute shaders – C code specifically designed for Apple’s GPU – to handle various aspects of the model’s calculations, including 4-bit and 2-bit dequantized matrix-vector multiplication, the SwiGLU activation function, RMS normalization, and batched GPU attention. The implementation employs deferred GPU expert compute, where the CMD3 (expert forward pass) is submitted without waiting, allowing the GPU to execute it concurrently while the CPU prepares the next layer. This pipeline approach minimizes CPU-GPU data transfer overhead. The team also recognized the limitations of unified memory constraints on Apple Silicon, preventing efficient overlap between SSD DMA transfers and GPU compute. They adopted a serial pipeline (GPU → SSD → GPU) to maximize bandwidth utilization, recognizing that concurrent transfers would introduce significant latency spikes due to memory controller arbitration. Additional optimizations included the utilization of cblas_sscal, cblas_sgemv, and cblas_sger from the Accelerate framework for the GatedDeltaNet recurrence, resulting in a 64% speedup compared to scalar code, and the strategic deferral of immediate GPU wait. This enabled a greater amount of parallelism. A critical element was the strategic deployment of the 2-bit quantization technique, which enabled the model to run on less memory while maintaining a “good” level of quality. While 2-bit quantization breaks JSON/tool calling functionality, this was traded off for significant reductions in memory requirements and improved processing speed. This approach is only suitable for basic tasks. The project's success is also attributable to its meticulous optimization of the tokenizer, employing a C BPE (Byte Pair Encoding) tokenizer for a substantial startup time reduction compared to traditional Python implementations. This, combined with the efficient code implemented in the system, dramatically reduced the time required to start the model. Central to the success of “flash-moe” is a detailed breakdown of experimented approaches, highlighting what worked and what did not. The project systematically evaluated various techniques—LZ4 compression, speculative execution, and memory management strategies—eliminating those that introduced unacceptable overhead, providing a robust and streamlined configuration. Finally, the authors emphasized the importance of understanding and respecting the constraints of the hardware, specifically the limitations of unified memory on Apple Silicon. Through this rigorous, hands-on approach, Dan Woods demonstrated a surprisingly effective methodology for deploying a massive model, with measurable performance improvements. |