Rotary GPU: Exploring Local Execution for Large MoE Models Under Limited VRAM
Recorded: May 31, 2026, 1:01 a.m.
| Original | Summarized |
[2605.29135] Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
Skip to main content Learn about arXiv becoming an independent nonprofit. We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. > cs > arXiv:2605.29135 Help | Advanced Search All fields Search GO quick links Login Computer Science > Performance arXiv:2605.29135 (cs) [Submitted on 27 May 2026] Abstract:Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints that limit access to large accelerator clusters, and as models continue to improve, deployment accessibility may matter as much as capability itself. This paper presents Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. A public validation was conducted using a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop with an RTX 4060 Laptop GPU containing 8 GB of VRAM. Under the primary configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second. The goal is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments where such infrastructure is unavailable. The results should be read as exploratory rather than definitive, but they suggest deployment accessibility deserves continued investigation as these models evolve. Subjects: Performance (cs.PF); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: Focus to learn more arXiv-issued DOI via DataCite (pending registration)
Related DOI: Focus to learn more DOI(s) linking to related resources Submission history From: Myeong Jun Jo [view email] [v1]
Full-text links: View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun JoView PDFHTML (experimental)TeX Source < prev | new Change to browse by: References & Citations NASA ADSGoogle Scholar export BibTeX citation BibTeX formatted citation loading... Data provided by: Bookmark
Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author About arXivLabs arXivLabs: experimental projects with community collaborators Which authors of this paper are endorsers? | About contact arXivClick here to contact arXiv subscribe to arXiv mailingsClick here to subscribe Copyright Web Accessibility Assistance arXiv Operational Status |
This paper explores methods for executing large Mixture-of-Experts models on systems with limited GPU memory, motivated by concerns regarding deployment accessibility rather than purely architectural research. The central inquiry is whether the capabilities of large models can be brought to environments with substantially smaller hardware resources, addressing constraints often found in organizations operating under hardware limitations, budget restrictions, or closed-network environments that restrict access to large accelerator clusters. To investigate this, the authors present Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. The approach was validated through a public experiment involving a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop equipped with an RTX 4060 Laptop GPU, which possesses 8 gigabytes of video random-access memory. During this evaluation, the system successfully generated 2048 output tokens while maintaining a VRAM usage of approximately 6.3 gigabytes. Furthermore, the experiment yielded an observed decode throughput of 21.06 tokens per second. The overarching goal of this work is not to propose a replacement for data-center infrastructure but rather to ascertain the feasibility of making certain capabilities of large models more accessible in settings where such infrastructure is unavailable. While the results should be interpreted as exploratory rather than definitive, they suggest that the exploration of deployment accessibility for large models warrants continued investigation as these models continue to evolve. The research positions Rotary GPU as a pathway for exploring local execution paths that mitigate memory limitations for substantial language models on consumer-grade hardware. |