Rotary GPU: Exploring Local Execution for Large MoE Models Under Limited VRAM

Recorded: May 31, 2026, 1:01 a.m.

Original

Summarized

[2605.29135] Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Learn about arXiv becoming an independent nonprofit.

We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors.
Donate

> cs > arXiv:2605.29135

Help | Advanced Search

All fields
Title
Author
Abstract
Comments
Journal reference
ACM classification
MSC classification
Report number
arXiv identifier
DOI
ORCID
arXiv author ID
Help pages
Full text

quick links

Computer Science > Performance

arXiv:2605.29135 (cs)

[Submitted on 27 May 2026]
Title:Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
Authors:Myeong Jun Jo View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun Jo
View PDF
HTML (experimental)

Abstract:Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research. Many organizations operate under hardware, budget, security, or closed-network constraints that limit access to large accelerator clusters, and as models continue to improve, deployment accessibility may matter as much as capability itself. This paper presents Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept. A public validation was conducted using a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop with an RTX 4060 Laptop GPU containing 8 GB of VRAM. Under the primary configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second. The goal is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments where such infrastructure is unavailable. The results should be read as exploratory rather than definitive, but they suggest deployment accessibility deserves continued investigation as these models evolve.

Comments:
10 pages, 3 figures. Also archived at Zenodo (DOI: https://doi.org/10.5281/zenodo.20406471). Related to Korean Patent Publication KR 10-2026-0070380

Subjects:

Performance (cs.PF); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)

ACM classes:
C.1.4; I.2.7

Cite as:
arXiv:2605.29135 [cs.PF]

(or
arXiv:2605.29135v1 [cs.PF] for this version)

https://doi.org/10.48550/arXiv.2605.29135

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.5281/zenodo.20406471

Focus to learn more

DOI(s) linking to related resources

Submission history From: Myeong Jun Jo [view email] [v1]
Wed, 27 May 2026 21:57:36 UTC (12 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun JoView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.PF

< prev

|
next >

new
|
recent
| 2026-05

Change to browse by:

cs
cs.AR
cs.DC

References & Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading...

BibTeX formatted citation
×

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

This paper explores methods for executing large Mixture-of-Experts models on systems with limited GPU memory, motivated by concerns regarding deployment accessibility rather than purely architectural research. The central inquiry is whether the capabilities of large models can be brought to environments with substantially smaller hardware resources, addressing constraints often found in organizations operating under hardware limitations, budget restrictions, or closed-network environments that restrict access to large accelerator clusters. To investigate this, the authors present Rotary GPU, an exploratory execution approach derived from a previously disclosed rotary-based accelerator residency concept.

The approach was validated through a public experiment involving a Qwen3.6-35B-A3B-class Mixture-of-Experts model executed locally on a consumer laptop equipped with an RTX 4060 Laptop GPU, which possesses 8 gigabytes of video random-access memory. During this evaluation, the system successfully generated 2048 output tokens while maintaining a VRAM usage of approximately 6.3 gigabytes. Furthermore, the experiment yielded an observed decode throughput of 21.06 tokens per second.

The overarching goal of this work is not to propose a replacement for data-center infrastructure but rather to ascertain the feasibility of making certain capabilities of large models more accessible in settings where such infrastructure is unavailable. While the results should be interpreted as exploratory rather than definitive, they suggest that the exploration of deployment accessibility for large models warrants continued investigation as these models continue to evolve. The research positions Rotary GPU as a pathway for exploring local execution paths that mitigate memory limitations for substantial language models on consumer-grade hardware.