Transformers Are Bayesian Networks
Recorded: March 25, 2026, 3 a.m.
| Original | Summarized |
[2603.17063] Transformers are Bayesian Networks
Skip to main content Learn about arXiv becoming an independent nonprofit. We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. > cs > arXiv:2603.17063 Help | Advanced Search All fields Search open search GO open navigation menu quick links Login Computer Science > Artificial Intelligence arXiv:2603.17063 (cs) [Submitted on 17 Mar 2026] Abstract:Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. Subjects: Artificial Intelligence (cs.AI) Cite as: Focus to learn more arXiv-issued DOI via DataCite Submission history From: Gregory Coppola [view email] [v1]
Full-text links: View a PDF of the paper titled Transformers are Bayesian Networks, by Gregory CoppolaView PDFHTML (experimental)TeX Source < prev | new Change to browse by: References & Citations NASA ADSGoogle Scholar export BibTeX citation BibTeX formatted citation loading... Data provided by: Bookmark
Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author About arXivLabs arXivLabs: experimental projects with community collaborators Which authors of this paper are endorsers? | About contact arXivClick here to contact arXiv subscribe to arXiv mailingsClick here to subscribe Copyright Web Accessibility Assistance arXiv Operational Status |
Gregory Coppola’s paper, “Transformers are Bayesian Networks,” presents a novel and rigorous argument asserting that transformer architectures within artificial intelligence fundamentally operate as Bayesian networks. Coppola establishes this claim through five key lines of reasoning, supported by formal verification against established mathematical axioms. The central proposition is that a sigmoid transformer, regardless of its learned weights (whether trained, random, or architecturally defined, Coppola demonstrates that it embodies weighted loopy belief propagation, a process analogous to Bayesian network inference. Specifically, each layer functions as a single round of belief propagation. Further, Coppola proves that a transformer can implement exact belief propagation on knowledge bases, achieving provably correct probability assessments provided the knowledge base lacks circular dependencies. The alternating structure of the transformer – the attention mechanism modeled as Pearl’s gather/update algorithm and the feed-forward network (FFN) represented as an OR operation – is rigorously detailed, solidifying the Bayesian network characterization. Experimental validation corroborated the theoretical findings, highlighting the practicality of loopy belief propagation even without a formal convergence guarantee. Crucially, Coppola extends this analysis to address the issue of verifiable inference, arguing that it necessitates a finite concept space. He posits that any finite verification procedure can only distinguish a limited number of concepts, effectively limiting the model’s ability to accurately represent and process complex information. The paper strongly criticizes the prevalent view of hallucination within large language models, attributing it not to scaling problems but to the fundamental structural consequence of operating without explicit conceptual grounding. Coppola’s argument underscores the importance of defining correctness, stating that it cannot exist without a defined concept space, ultimately dismissing scaling as a solution to this inherent limitation. The work emphasizes a mathematically precise understanding of transformers, framing them not as black boxes but as formal implementations of Bayesian inference processes. |