Arcee Trinity Mini: US-Trained Moe Model

Recorded: Dec. 2, 2025, 3:04 a.m.

Original

Summarized

Arcee AI | Arcee Debuts Trinity Mini, Expanding Its U.S.-Built Model Line

Trinity Mini (26B): Free on OpenRouter for a limited time.Try Now ↗

PRODUCTS

PRODUCTSTrinity Models

Open Source Catalog

Docs

ENTERPRISE

EnterpriseWork With Us

Case Studies

RESEARCHCOMPANY

COMPANYBlog

About Us

Careers

Get API

Blog/The Trinity ManifestoThe Trinity ManifestoLucas Atkins,•December 1, 2025Arcee introduces Trinity Mini, a compact MoE model trained end-to-end in the U.S., offering open weights, strong reasoning, and full control for developers.Over the last year, anyone who cares about open weight language models has been watching Chinese labs.Qwen, DeepSeek and others now define a lot of what "state of the art open MoE" looks like. In the United States, most of the action has centered on polishing other people's checkpoints.At Arcee AI we want to add something that has been missing in that picture: a serious open weight model family trained end to end in America, by an American company, with weights that businesses and developers can actually own.That family is Trinity.Trinity Nano and Trinity Mini are available now.Trinity Large is currently training on 2048 B300 GPUs and will arrive in January 2026.Trinity Mini is our fully post-trained reasoning model. Trinity Nano Preview is something different: a personality-forward chat model that pushes the limits of sparsity with only 800M non-embedding parameters active per token across 56 layers and 128 experts. It's charming, it's fun to talk to, and it may be unstable in edge cases. This is an experimental release, not a thinking model. Nano Preview is available to download from Hugging Face but won't be hosted on our API.This is the story of why we decided to go all in on pretraining, how Nano and Mini came to life, and where Trinity is headed next.Why we decided to own pretrainingFor a while, our strategy looked like everyone else's. Take a strong open base, post train it hard, wire it into tools and RAG, and ship.That approach carried us very far. You can get impressive behavior with a good base, careful data and an instruction stack that matches the product.At the same time, a few pressures kept building:Ceilings on certain workloads: On some high stakes use cases, we kept iterating on post training and could see clear diminishing returns. Failure patterns pointed back to missing capabilities in the foundation, not to tuning mistakes.Jurisdictional Safety: Enterprise buyers are increasingly asking where the base model came from, what data went into it, and which licenses govern it. "We fine tuned a model with unknown data provenance" does not satisfy compliance officers. An end-to-end US data pipe offers legal certainty that foreign black-box models cannot.Long term product vision: We strongly believe that within two years, all meaningful AI applications will look like systems that grow and learn inside the environments where their users interact with them. Those systems will adapt their own training loops, and build and train directly from live usage. To build that kind of software you need to control the weights and the training pipeline, not only the instruction layer.We still use and appreciate great open-source models from others. We just came to the conclusion that if we want to offer truly long-lived, self-improving systems to customers, we also need to train our own foundations.AFM 4.5B: proving we could do thisOur first step was AFM-4.5B, a dense 4.5B model trained on about 8 trillion curated tokens in partnership with DatologyAI.AFM-4.5B was our "can we do this at all" experiment:Stand up large-scale data (with DatologyAI) and training pipelines.Validate that careful and considered data curation gives clean scaling behavior.Get real experience with training end-to-end.It worked. AFM-4.5B gave us a solid base of training and infrastructure practices, and showed us where to focus on capability improvements, especially around math and code.Those lessons feed directly into Trinity.From AFM to Trinity Nano and MiniTrinity is our open weight MoE family. We chose to leap directly toward the frontier and then worked backward from that goal, which meant designing Nano and Mini as the two form factors that could both serve real users today and teach us how to train something far larger.Trinity Nano Preview: 6B parameter MoE (1B active, ~800M non-embedding), 56 layers, 128 experts with 8 active per tokenTrinity Mini: 26B parameter MoE (3B active), fully post-trained reasoning modelBoth are released under Apache 2.0. Download Nano Preview and Mini from Hugging Face. Mini is also available through our API and OpenRouter. Nano Preview is download-only.Originally we thought of Nano and Mini strictly as training wheels for Trinity Large. The plan was to iron out our MoE recipe, then move on. In practice, these models came out strong enough that they are now serious production targets:They are compact reasoning models tuned for agents, tools and other reasoning heavy workloads, with average output length comparable to current instruct models.They are some of the most cost efficient models in the world, with API pricing of $0.045 / $0.15 for the Trinity-Mini model, plus a free tier with rate limits to back that up.They anchor a preview of our own chat and API platform at chat.arcee.ai, which will also host Trinity Large.The Trinity architectureBuilding on our AFM naming convention, we refer to this Trinity architecture as afmoe, which integrates leading global architectural advances such as gated attention and Muon within a clean, US-controlled data pipeline. Here is what the stack looks like.AttentionThe attention mechanism combines several techniques that have proven effective at scale. We use grouped-query attention, mapping multiple query heads to each key-value head to reduce memory bandwidth during inference. Before computing scaled dot-product attention, we apply RMSNorm to the queries and keys (QK-norm), which stabilizes training.We also use gated attention, specifically the G1 configuration from the Qwen paper. After SDPA, the output is elementwise-gated before the output projection: out_proj(sdpa_out * \\sigma(gate_proj(x))). This gives the model a learned ability to modulate attention outputs per-position.Finally, we adopt a local/global attention pattern at a 3:1 ratio. Three local attention layers with RoPE are followed by one global attention layer without positional embeddings (NoPE). This pattern reduces compute on long sequences while preserving the model's ability to reason over distant context.NormalizationFor layer normalization, we use a simplified version of depth-scaled sandwich norm. Each sublayer computes output = x + norm(module(norm(x))) . To enable stable training at depth, we initialize the gamma parameters of each norm layer to 1/sqrt(L) where L is the total layer count. We also apply a norm before the language modeling head.Mixture-of-ExpertsOur MoE layers follow the DeepSeekMoE design: fine-grained experts plus a shared expert. Each MoE layer has 128 total routed experts, of which 8 are active per token, alongside 1 shared expert that is always active. The first two layers of the model are dense rather than sparse, providing a shared representational foundation before specialization begins, which we found improves training stability early.For routing, we use sigmoid routing as introduced in DeepSeek-V3. Routing scores are computed with sigmoid followed by normalization rather than softmax. We also adopt the aux-loss-free load balancing scheme: an independently updated bias term determines routing decisions but is excluded from the weighting computation for each expert's contribution. This eliminates the need for auxiliary load-balancing losses that can distort the training objective.InitializationWe initialize all trainable parameters from a truncated normal distribution with standard deviation 0.5/sqrt(dim). During the forward pass, we multiply the embedding output by sqrt(dim).OptimizerWe train with Muon, using the distributed implementation from Microsoft's Dion repository. To transfer learning rates across parameter shapes, we set adjusted_lr = lr * sqrt(max(1, fan_out / fan_in)), which we empirically observe enables optimal learning rate transfer when scaling. We sweep the Adam learning rate and Muon learning rate separately. The learning rate schedule we use is WSD (warmup-stable-decay). We apply no weight decay to embeddings.InfrastructureTraining runs on a modified version of TorchTitan in bf16 precision. Nano and Mini trained on 512 H200 GPUs using an HSDP parallelism setup with a global batch size of 8192 sequences at 4096 tokens each.Context extensionWe only expanded the global attention layers during context extension, which allowed the model to learn extended sequence lengths very quickly. Trinity Nano was trained at 256k sequence length (inference at 128k), and Trinity Mini was trained at 128k sequence length.Data and training

Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3. This curriculum allows the model to build broad coverage early and then sharpen on high-signal data. The mix reuses our curated AFM dataset and adds substantially more math and code.Datology continued to be a key partner on the data side. On the compute and systems side we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large.

Training at this scale is hardMoE training at scale is messy. There is no polite way to say it. It is fucking hard. Here’s how we prepared for Trinity-Large:Over the last month Datology has generated 10 trillion unique synthetic tokens on clusters that peaked at 2048 H100 GPUs.We pair those with 10 trillion web tokens to build a 20T token dataset.Prime Intellect's infrastructure and operational experience have been crucial here, from synthetic data generation runs to the ongoing 2048 B300 GPU training job for Trinity-Large.The work is demanding, but it is also where most of the fun is. Every bug we chase and every learning curve we overcome feed directly into models that anyone can download and build upon.Why owning weights mattersLooking forward, we see a clear pattern.As applications get more ambitious, the boundary between "model" and "product" keeps moving. Systems will:Learn from the behavior of specific user populations.Grow new skills from interactions with other tools and services.Those systems will blur the distinction between pretraining data, synthetic data, post training tasks and live feedback. They will evolve continuously in the environments where they are deployed.To do that responsibly and effectively, you need control of the weights and the training loop. You need to decide what kind of data the model sees, what objectives it optimizes, and how its capabilities change over time.Our goal with Trinity is to provide that foundation for businesses, enterprises and developers who want ownership rather than a black box.Trinity Large and what comes nextAll of this leads to Trinity Large.It trains on a 20T token dataset, half synthetic and half web, built together with Datology and backed by Primeintellect's compute infrastructure.It uses the same core MoE recipe as Nano and Mini, extended to a fully frontier sized configuration.The training run is currently underway on 2048 B300 GPUs, targeting release in January 2026.For most of this post we have talked about principles, data and architecture without naming the final size.Trinity Large is a 420B parameter model with 13B active parameters per token.Nano and Mini exist to make that possible, and to give the community strong open models to use right now while Large trains.When Trinity Large ships, we will release a full technical report covering how we went from a 4.5B dense model to an open frontier MoE in just over six months.Try Trinity Nano and Mini todayIf you care about open weight models, and you want an American MoE family that aims squarely at the frontier while staying fully permissive, we invite you to start working with Trinity today.Download the weights at huggingface.com/arcee-ai.Call the models through OpenRouter with generous free tiers.Experiment with our preview chat and API platform at chat.arcee.ai.Break them. Push them. Tell us where they shine and, more importantly, where they fail. That feedback will shape Trinity Large and everything that follows.We are building these models so that you can own them.Related BlogsNews•October 31, 2025Mergekit Returns To Its RootsEffective Friday, October 31, 2025, we are returning Mergekit to the GNU Lesser General Public License v3.News•July 22, 2025Seed Group teams up with Arcee AI to revolutionise enterprise AI innovation in Dubai and the wider MENA regionSeed Group, a company of The Private Office of Sheikh Saeed bin Ahmed Al Maktoum, has started a joint venture with Arcee AI to advance the deployment of enterprise-grade artificial intelligence across the UAE. News•February 13, 2025AngelQ and Arcee AI Launch KidRails for LLMsAngelQ and Arcee AI launch KidRails, the first open-source framework for training LLMs to deliver safe, age-appropriate responses for children aged 5-12.View More

PRODUCTSTrinity ModelsOpen Source CatalogDocsENTERPRISEWork With UsCase StudiesLEARNResearchcompanyBlogAbout UsCareers

The Trinity Manifesto – Arcee AI

Arcee AI’s Trinity project represents a significant shift in the landscape of open-weight language models, driven by a desire to move beyond the trend of polishing existing checkpoints. Lucas Atkins, in December 2025, outlines the core motivations and technical details behind this ambitious undertaking – a fully US-built, open-weight Mixture-of-Experts (MoE) family designed for long-term, self-improving AI systems. This document details the strategic rationale, technical architecture, and development process behind Trinity, aiming to provide developers and enterprises with true ownership and control.

The fundamental impetus for Trinity stems from recognizing limitations within the prevailing approach of post-training “fine-tuning” of external models. Arcee AI identified persistent “failure patterns” – gaps in foundational capabilities – that couldn’t be addressed through simple adjustments to existing models. The risk of relying on black-box models, particularly concerning jurisdictional safety and data provenance, also fueled the need for a fully transparent, end-to-end US-controlled data pipeline. Recognizing the future vision of AI systems growing and learning within their deployed environments, Arcee aimed to build a foundation for truly adaptable and self-improving AI.

Trinity is structured around three key models: Nano, Mini, and Large. Trinity Nano, a 6B parameter MoE model (with approximately 1B active parameters per token), serves as an experimental platform. Trinity Mini, a 26B parameter MoE model (with approximately 3B active parameters per token), represents a more robust production-ready model, offering comparable reasoning capabilities to current instruct models at a significantly lower cost. The ultimate goal – Trinity Large – is a 420B parameter MoE model, trained on a massive 20T token dataset (half synthetic, half web) and utilizing 2048 B300 GPUs. This large-scale model is intended to be a flagship offering, designed to enable the evolution of complex, self-improving AI systems.

The technical architecture of Trinity leverages state-of-the-art advancements, including gated attention (specifically the Qwen G1 configuration), RMSNorm, and a Mixture-of-Experts design with 128 active experts per token. A crucial element is the use of Muon for training, providing efficient distributed learning capabilities on a modified version of TorchTitan. The framework utilizes grouping query attention to reduce memory bandwidth constraints during inference, while depth-scaled sandwich norm stabilizes training at scale. The MoE design allows for flexible routing, employing the DeepSeekMoE configuration with sigmoid routing and an independently updated bias term (“aux-loss-free load balancing”). Trinity also employs a 3:1 local/global attention pattern and expands context extension to 256k sequence length for Nano and 128k for Mini.

The development process involved considerable investment in data curation, with a focus on STEM-related content, utilizing approximately 10T tokens in three phases. Datology AI and Prime Intellect play key roles in data generation and infrastructure support. This robust data pipeline, combined with efficient compute infrastructure (512 H200 GPUs for Nano/Mini and 2048 B300 GPUs for Trinity Large), is critical to producing a high-performing MoE model. Scaling these massive training runs relies on sophisticated techniques such as distributed learning across multiple GPUs and efficient data management pipelines.

Arcee AI’s strategic focus on Trinity represents a deliberate move toward providing a foundational platform for truly adaptable and self-improving AI systems. By offering developers and enterprises full ownership of the model weights and training pipeline, the company aims to drive innovation in areas like agent-based systems and personalized AI applications. The ultimate goal is to build AI systems that can learn and evolve continuously within their deployed environments, unlocking the true potential of intelligent systems.