Flash-MoE: Running a 397B Parameter Model on a Laptop

Recorded: March 22, 2026, 10 p.m.

Original

Summarized

GitHub - danveloper/flash-moe: Running a big model on a small laptop · GitHub

Navigation Menu

Toggle navigation

Appearance settings

PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersView all resourcesOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorGitHub StarsArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

danveloper

/

flash-moe

Public

Notifications
You must be signed in to change notification settings

Fork
109

Star
1.1k

Code

Issues
5

Pull requests
4

Actions

Projects

Security
0

Insights

Additional navigation options

Code

Issues

Pull requests

Actions

Projects

Security

Insights

danveloper/flash-moe

mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History144 Commits144 Commitsdocsdocs metal_infermetal_infer paperpaper .gitignore.gitignore CLAUDE.mdCLAUDE.md README.mdREADME.md expert_index.jsonexpert_index.json progress.pngprogress.png progress.pyprogress.py repack_experts.pyrepack_experts.py results.tsvresults.tsv View all filesRepository files navigationREADMEFlash-MoE: Running a 397B Parameter Model on a Laptop

Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
Results

Configuration
tok/s
Quality
Notes

4-bit experts, FMA kernel
4.36
Excellent
Current best. Full tool calling. 209GB on disk.

4-bit experts, baseline
3.90
Excellent
Before FMA kernel optimization.

2-bit experts, trust OS
5.74
Good*
120GB on disk. *Breaks JSON/tool calling.

2-bit peak single token
7.05
Good*
Warm cache burst. *Not suitable for tool use.

*2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.
Hardware

Machine: MacBook Pro, Apple M3 Max
Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE
Memory: 48 GB unified (~400 GB/s bandwidth)
SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured)
macOS: 26.2 (Darwin 25.2.0)

Architecture
The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.
Key Techniques

SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper.

FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.

Metal Compute Shaders — Hand-written Metal kernels for:

4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)
Fused SwiGLU activation
RMS normalization (two-pass: sum-of-squares reduction + apply)
Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers
GPU RoPE (fused with Q deinterleave and K normalization)
MoE combine + residual + sigmoid gate (fused kernel)

Deferred GPU Expert Compute — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.

Accelerate BLAS for Linear Attention — The GatedDeltaNet recurrence uses cblas_sscal, cblas_sgemv, and cblas_sger for the 64-head × 128×128 state matrix update. 64% faster than scalar code.

Trust the OS — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally.

Pipeline Per Layer (4.28ms average at 4-bit)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]

Unified Memory Constraint
On Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB/s. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
Quick Start
cd metal_infer
make
# 4-bit inference (needs packed_experts/ directory)
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster but breaks tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

# Interactive chat with tool calling
./chat

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
Project Structure
metal_infer/
infer.m # Complete inference engine (~7000 lines)
shaders.metal # Metal compute kernels (~1200 lines)
chat.m # Interactive chat TUI with tool calling
tokenizer.h # C BPE tokenizer (single-header, 449 lines)
main.m # MoE-only benchmark
Makefile # Build system
extract_weights.py # Creates model_weights.bin from safetensors
repack_experts_2bit.py # 4-bit → 2-bit expert requantization
train_predictor.py # Expert routing prediction analysis
model_weights.bin # Non-expert weights (5.5GB, mmap'd)
model_weights.json # Tensor manifest
vocab.bin # Vocabulary for token decoding
tokenizer.bin # Pre-exported BPE tokenizer data

repack_experts.py # 4-bit expert packing from safetensors
progress.py # Results visualization (Q2/Q4 tracks)
results.tsv # Experiment log (58 experiments)

What We Tried (and What Worked)
Kept

Approach
Result
Impact

FMA dequant kernel
GPU compute -12%
+12% tok/s

Trust OS page cache
Deleted Metal LRU → +38%
Foundational

GPU combine+norm in CMD3
Eliminates CPU round-trip
Pipeline

BLAS delta-net (Accelerate)
cpu_attn 0.78→0.28ms
+64% attn

F_NOCACHE for 2-bit
+3% from avoiding page thrash
2-bit only

GPU fused attention (RoPE)
+2% for full-attn layers
Small

C BPE tokenizer
180ms vs 3500ms startup
20x startup

Deferred CMD3 execution
GPU/CPU overlap
Pipeline

Discarded (58 experiments, highlights)

Approach
Result
Why

LZ4 expert compression
-13%
Decompress overhead > warm cache savings

F_RDADVISE prefetch
net 0%
Unified memory: SSD DMA slows GPU -73%

Temporal expert prediction
-18%
25% hit rate, SSD bandwidth waste

MLP routing predictor
31% accuracy
Worse than temporal baseline

GPU LUT dequant kernel
-2%
Indirect register access serializes

GPU private buffer compression
-20% pipeline
Blit cost 4×7MB > matvec savings

Spin-poll GPU wait
-23%
CPU thermal competes with GPU

Expert file clustering
0%
NVMe ignores scatter at 7MB granularity

dispatch_io
-70%
dispatch_data management overhead

mmap expert files
-5x
Per-page fault overhead on cold data

Speculative early routing
-38%
Cache pollution + overhead

MTP speculative decoding
break-even
MoE I/O scales per-token (unlike dense)

Safety
This is a primary development machine. The engine explicitly controls memory:

Non-expert weights: 5.5GB (mmap'd, read-only)
Metal scratch buffers: ~200MB
Total: ~6GB, leaving 42GB for OS + page cache
No OOM risk. Expert data streams from SSD on demand.
No custom caches. Trust the OS.

About

Running a big model on a small laptop

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity
Stars

1.1k
stars
Watchers

10
watching
Forks

109
forks

Report repository

Releases
No releases published

Packages
0

Uh oh!

There was an error while loading. Please reload this page.

Contributors
2

danveloper
Dan Woods

claude
Claude

Languages

Objective-C
59.4%

C
13.6%

TeX
9.7%

Python
8.7%

Metal
7.4%

Shell
0.8%

Makefile
0.4%

Footer

Footer navigation

Terms

Privacy

Security

Status

Community

Docs

Contact

Manage cookies

Do not share my personal information

You can’t perform that action at this time.

Dan Woods’ “flash-moe” project represents a remarkable achievement in running a massive 397 billion parameter Qwen3.5 model on a MacBook Pro laptop, demonstrating a pathway for deploying large language models in resource-constrained environments. The project’s core innovation lies in a complete rewrite of the inference engine using C and Metal, bypassing traditional Python frameworks and GPU drivers, achieving a staggering 4.4+ tokens per second with production quality output, including tool calling. This success hinges on several meticulously engineered techniques.

The project leverages a “trust the OS” strategy, relying on the macOS page cache for expert data management rather than implementing custom caching mechanisms. This approach, inspired by Apple’s “LLM in a Flash” paper, significantly reduces GPU memory pressure and avoids the performance bottlenecks associated with managing custom caches. A key optimization is the FMA-optimized dequantized matrix-vector multiply kernel, which accelerates the inner loop of the 4-bit dequantized calculations by rearranging the math to utilize the Fused Multiply-Add (FMA) unit on the GPU, resulting in a 12% performance boost.

Furthermore, the project capitalizes on Metal compute shaders – C code specifically designed for Apple’s GPU – to handle various aspects of the model’s calculations, including 4-bit and 2-bit dequantized matrix-vector multiplication, the SwiGLU activation function, RMS normalization, and batched GPU attention. The implementation employs deferred GPU expert compute, where the CMD3 (expert forward pass) is submitted without waiting, allowing the GPU to execute it concurrently while the CPU prepares the next layer. This pipeline approach minimizes CPU-GPU data transfer overhead.

The team also recognized the limitations of unified memory constraints on Apple Silicon, preventing efficient overlap between SSD DMA transfers and GPU compute. They adopted a serial pipeline (GPU → SSD → GPU) to maximize bandwidth utilization, recognizing that concurrent transfers would introduce significant latency spikes due to memory controller arbitration. Additional optimizations included the utilization of cblas_sscal, cblas_sgemv, and cblas_sger from the Accelerate framework for the GatedDeltaNet recurrence, resulting in a 64% speedup compared to scalar code, and the strategic deferral of immediate GPU wait. This enabled a greater amount of parallelism.

A critical element was the strategic deployment of the 2-bit quantization technique, which enabled the model to run on less memory while maintaining a “good” level of quality. While 2-bit quantization breaks JSON/tool calling functionality, this was traded off for significant reductions in memory requirements and improved processing speed. This approach is only suitable for basic tasks.

The project's success is also attributable to its meticulous optimization of the tokenizer, employing a C BPE (Byte Pair Encoding) tokenizer for a substantial startup time reduction compared to traditional Python implementations. This, combined with the efficient code implemented in the system, dramatically reduced the time required to start the model.

Central to the success of “flash-moe” is a detailed breakdown of experimented approaches, highlighting what worked and what did not. The project systematically evaluated various techniques—LZ4 compression, speculative execution, and memory management strategies—eliminating those that introduced unacceptable overhead, providing a robust and streamlined configuration. Finally, the authors emphasized the importance of understanding and respecting the constraints of the hardware, specifically the limitations of unified memory on Apple Silicon. Through this rigorous, hands-on approach, Dan Woods demonstrated a surprisingly effective methodology for deploying a massive model, with measurable performance improvements.