LLMs Predict My Coffee

Recorded: March 22, 2026, 10 p.m.

Original

Summarized

LLMs predict my coffee

DYNOMIGHT

best

Underrated reasons to be thankful
Plans you're not supposed to talk about
Better air is the easiest way not to die
Bourdieu's theory of taste: a grumbling abrégé
Grug on diet soda and autism
I don't like NumPy

topics

AI
air
bourdieu
computer
conspiracy
data
discourse
economics
effort
experiment
forecasting
health
history
life
lists
math
philosophy
policy
psychology
random
science
writing

RSS
mastodon
bluesky
substack

about

LLMs predict my coffee

dynomight ·
Mar 2026

·

science
AI

Coding, math, whatever. Can LLMs predict the outcomes of physical experiments?

Suppose I pour 8 oz (226.8 g) of boiling water into a ceramic coffee mug that weighs 1.25 lb (0.57 kg). The ambient air is still and 20 degrees Celsius. The cup starts at room temperature. Give me an equation for the temperature of the water in Celsius over time. The only free variable in the equation should be the number of seconds t since the water was poured. Focus on accuracy during the first 5 minutes.

Does that seem hard? I think it’s hard. The relevant physical phenomena include at least:

Conduction of heat between the water, the mug, the air, and the table.
Conduction of heat inside each of those things.
Convection (fluid movement) inside the water and the air.
Evaporation cooling as water molecules become vapor.
Movement of water vapor in the air.
Radiation. (Like all matter, the mug and water emit temperature-dependent infrared radiation.)
Surface tension, thermal expansion/contraction, re-absorption of air into the water as it cools, probably more.

And many details aren’t specified in the prompt. Is the mug made of porcelain or stoneware? What is the mug’s shape? What is the table made of? How humid is the air? How am I reducing the spatially varying water temperature to a single number?
So this isn’t a problem with a “correct” answer that you can find by thinking. Reality is too complicated. Instead, answering question requires “taste”—guessing which factors are most important, making assumptions about missing details, etc.
So I put that question to a bunch of LLMs. Here is what they said:

(Technically, they gave equations as text. I’m plotting those equations.)
I was surprised by those curves, both in terms of how fast they think the temperature will drop in the beginning, and how slowly they think it will drop later on. They think you get as much cooling in the first few minutes as you do in the rest of the hour. Can that be right?
Then I did the experiment. First, I waited until the ambient temperature happened to reach 20 degrees Celsius. Then, I put 8 oz of water into a measuring cup, microwaved it until it reached a boil, let the temperature equalize a bit, and then microwaved it until the water boiled again. Then, I poured the water into a 1.25 lb coffee mug with a digital thermometer in it and shouted out measurements every five seconds, which were frantically recorded by the Dynomight Biologist. Gradually I reduced measurements to every 15 seconds, 30 seconds, 1 minute, and then 5 minutes.
Behold:

Or, here’s a zoomed-in view of the first five minutes:

The predictions were all OK, but none were great. Probably Claude 4.6 Opus did best, albeit after consuming $0.61 of tokens. (Insert joke about physical experiments / Department of Defense / money / coffee.)
That said, what surprised me about the predictions was how quickly the temperature dropped in the first few minutes, and how slowly it dropped later on. But experimentally, it dropped even faster early on, and even slower towards the end. So if you wanted to ensemble my intuition with the LLM, I guess my intuition would get a weight of zero.
In conclusion, they may take our math, but they’ll somewhat more slowly take our fine motor control. Thank you for reading another middle-school science project.

(Appendix: The equations)

Here were the actual equations all of the models gave for T(t), the predicted temperature after t seconds.

LLM
T(t)
Cost

Kimi K2.5 (reasoning)
20 + 52.9 exp(-t/3600)+ 27.1 exp(-t/80)
$0.01

Gemini 3.1 Pro
20 + 53 exp(-t/2500) + 27 exp(-t/149.25)
$0.09

GPT 5.4
20 + 54.6 exp(-t/2920) + 25.4 exp(-t/68.1)
$0.11

Claude 4.6 Opus (reasoning)
20 + 55 exp(-t/1700) + 25 exp(-t/43)
$0.61 (eeek)

Qwen3-235B
20 + 53.17 exp(-t/1414.43)
$0.009

GLM-4.7 (reasoning)
20 + 53.2 exp(-t/2500)
$0.03

Interestingly, they were all based on one or two exponentially decaying terms. The way to read these is to think of exp(-t/b) as a function that starts out at one when t is zero, and gradually decreases. After b seconds, it has dropped to 1/e ≈ 0.368, and it continues dropping by factors of 0.368 every b seconds forever.
So most of these models have a “fast rate” which reflects heat flow from the water into the mug along with a “slow rate” for heat from the water/mug to flow into the air. A few of the models skip the fast rate. I also tried DeepSeek and Grok but they just flailed around endlessly without ever returning an answer. They were kind enough to charge me for that service.

get guide to life
and weekly-ish posts

ok
(Or try substack or RSS.)

mistakes

fix
(Just want to see what happens?)

comments

lemmy /
substack

Maybe there's a pattern here? ·

science
AI

The real data wall is billions of years of evolution ·

AI
science

Why didn't we get GPT-2 in 2005? ·

science
economics
AI

The modern formatting addiction in writing ·

writing
AI

This document, authored by Dynomight, presents an engaging exploration of the limitations of Large Language Models (LLMs) when applied to a seemingly straightforward physical experiment: the cooling of hot water in a ceramic mug. The core of the piece centers on a thought experiment designed to test the predictive capabilities of various LLMs, namely Kimi K2.5, Gemini 3.1 Pro, GPT 5.4, Claude 4.6 Opus, Qwen3-235B, and GLM-4.7, alongside some less successful attempts like DeepSeek and Grok. The author’s methodology involved posing a specific question – predicting the temperature of the water in Celsius over time – and then comparing the model’s outputs to experimental data collected during a carefully controlled experiment.

The experiment itself meticulously recreated a common scenario: pouring boiling water into a ceramic mug. The conditions were precisely defined – 8 ounces of water at 20 degrees Celsius, a 1.25 lb mug, and ambient air at 20 degrees Celsius. Measurements were taken at intervals, ranging from every five seconds initially, then gradually decreasing to every 15, 30, 60, and finally 5 minutes. The author emphasizes the complexity of the underlying physics involved, highlighting processes like conduction, convection, evaporation, radiation, and surface tension – facets that, arguably, contribute to the inherent difficulty in predicting the system's behavior. It’s important to note that the author emphasizes the “taste” required in answering the prompt, acknowledging that a definitive “correct” answer is impractical due to the sheer number of variables and the unstated assumptions.

The generated equations from the LLMs reveal a consistent, albeit simplified, approach. All the models utilized exponential decay functions to represent the cooling process, employing parameters like “b” to control the rate of decay. This suggests a foundational understanding, albeit lacking nuance, of heat transfer dynamics. The author identifies a “fast rate” and “slow rate” within these models, likely reflecting the initial rapid heat loss and the subsequent, more gradual cooling as the water and mug approach thermal equilibrium with the surrounding air. The cost of generating these equations also offers a compelling insight into the computational resources demanded by the different models. Claude 4.6 Opus, using the most complex equation and the largest token count, incurred a significant cost of $0.61 – a humorous, if slightly cynical, observation regarding the expense associated with even relatively simple physical simulations.

The author demonstrates a critical evaluation of the LLM’s performance. While the models provided reasonable predictions, especially Claude 4.6 Opus, they systematically underestimated the initial cooling rate and overestimated the eventual cooling rate. This misalignment between prediction and reality underscores the limitations of purely mathematical models when confronted with a system characterized by so many interacting variables. The experiment itself further reinforced this observation. The actual temperature decline deviated from the model predictions, particularly in the early stages, and slowed considerably later on.

Ultimately, the document serves as a valuable demonstration of the current state of LLM capabilities – particularly their ability to mimic domain expertise but often struggle with the subtleties of physical reality. It’s a hands-on illustration that a vast amount of mathematical knowledge, seemingly readily available to LLMs, doesn't equate to practical understanding or the ability to accurately predict complex phenomena. The author’s inclusion of appendices that break down the generated equations and provides context on their interpretation demonstrates a desire to educate and engage the reader in the process, highlighting the underlying principles at play, and furthering the reader's understanding of the complexity of the data.