Prompt Politeness Affects LLM Accuracy (2025)
Recorded: May 27, 2026, 8 a.m.
| Original | Summarized |
[2510.04950] Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
Skip to main content Learn about arXiv becoming an independent nonprofit. We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. > cs > arXiv:2510.04950 Help | Advanced Search All fields Search open search GO open navigation menu quick links Login Computer Science > Computation and Language arXiv:2510.04950 (cs) [Submitted on 6 Oct 2025] Abstract:The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human-AI interaction. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME) Cite as: Focus to learn more arXiv-issued DOI via DataCite Submission history From: Om Dobariya [view email] [v1]
Full-text links: View a PDF of the paper titled Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper), by Om Dobariya and Akhil KumarView PDF view license < prev | new Change to browse by: References & Citations NASA ADSGoogle Scholar export BibTeX citation BibTeX formatted citation loading... Data provided by: Bookmark
Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author About arXivLabs arXivLabs: experimental projects with community collaborators Which authors of this paper are endorsers? | About contact arXivClick here to contact arXiv subscribe to arXiv mailingsClick here to subscribe Copyright Web Accessibility Assistance arXiv Operational Status |
Om Dobariya and Akhil Kumar investigated the role of prompt politeness and tone in influencing the accuracy of large language models (LLMs), a factor that remains underexplored despite existing research on prompt wording. The study aimed to determine how varying levels of prompt politeness affect model performance on multiple-choice questions. To achieve this, the authors constructed a dataset consisting of 50 base questions drawn from mathematics, science, and history. These base questions were systematically rewritten into five distinct tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude. This process generated a total of 250 unique prompts for evaluation. The experiment utilized ChatGPT 4o to assess the model's responses across these varied tonal conditions, and paired sample t-tests were employed to statistically evaluate the significance of the observed differences in accuracy. Contrary to expectations, the results indicated that impolite prompts consistently yielded higher accuracy than polite ones. The measured accuracy ranged from 80.8% for the most polite prompts to 84.8% for the most rude prompts. This outcome contrasts with prior studies that commonly associated rudeness with diminished model performance. The findings suggest that newer LLMs exhibit a different response pattern to tonal variations than previously observed. Consequently, the research emphasizes the considerable importance of studying the pragmatic aspects embedded within prompting interactions and raises broader theoretical questions concerning the social dimensions inherent in human-AI communication. |