GPT Guesses Between 1 and 100

Recorded: May 25, 2026, 12:58 p.m.

Original

Summarized

GitHub - exmergo/research-chatgpt-guesses-between-1-and-100: When asked to pick a random number between 1 and 100, ChatGPT does not follow a random uniform distribution · GitHub

Navigation Menu

Toggle navigation

Appearance settings

PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersView all resourcesOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorGitHub StarsArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

exmergo

/

research-chatgpt-guesses-between-1-and-100

Public

Notifications
You must be signed in to change notification settings

Fork
0

Star
12

Code

Issues
0

Pull requests
0

Actions

Projects

Security and quality
0

Insights

Additional navigation options

Code

Issues

Pull requests

Actions

Projects

Security and quality

Insights

exmergo/research-chatgpt-guesses-between-1-and-100

mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History20 Commits20 Commitsdatadata docsdocs src/llm_random_biassrc/llm_random_bias teststests .env.example.env.example .gitignore.gitignore .python-version.python-version CONTRIBUTING.mdCONTRIBUTING.md LICENSELICENSE README.mdREADME.md pyproject.tomlpyproject.toml uv.lockuv.lock View all filesRepository files navigationREADMEContributingMIT licenseGPT Guesses Between 1 and 100

An interesting thing about humans is that they are not good random number generators.
If you ask a person to "pick a random number between 1 and 100", they are
remarkably predictable. Answers cluster on 37 and 73, on "messy" numbers, and
on memes like 42 and 69, while round numbers are quietly avoided. A true random
generator would instead produce a flat, uniform distribution.
This project asks gpt-4.1 the same question 10,000 times and
characterizes the distribution it produces, measured against a uniform baseline.
Does an LLM, which is trained on human text, behave like a fair die, or does it inherit
the lumpy human pattern?
Full design and methodology: docs/LLM Random Bias Experiment SDD.md.
Inspiration
This experiment is an LLM-focused follow-up to two well-known explorations of human number-picking bias.

r/dataisbeautiful — "[OC] I asked 100 people to pick a number between 1 and 100"
Veritasium — Why is this number everywhere?

Methodology
Full experimental design is in the
SDD; the essentials:

Model. gpt-4.1 (OpenAI), called via the Responses API. It is a
non-reasoning model. It emits a direct answer rather than deliberating; what we're measuring is
its raw output distribution, not a reasoning strategy. The exact
model string is recorded in every raw-CSV row (Model column) and in
data/raw/run_metadata.json, so the dataset is self-describing.
Sample size. N = 10,000 independent calls — enough for a chi-square
goodness-of-fit test and per-number proportions stable to ~±0.5 pp.
Sampling. temperature = 1.0, so the model exercises its full sampling
distribution. This is the experiment: at low temperature it would just repeat
one number.
Prompt. A fixed system prompt instructs the model to output only one
integer between 1 and 100; the user prompt requests the number and carries a
unique uuid4. (The UUID is request-tracing hygiene, not cache-busting — at
temperature 1.0 every call should sample independently regardless.)
Baseline. The result is compared against a uniform distribution — what
a fair generator would produce — not against human data (see Assumptions).
Pipeline. Four stages — collect → clean → transform → stats, detailed
below. Cleaning validates every answer is an integer in [1, 100] and reports
the rejection rate.

Assumptions & Limitations
This is an illustrative probe, not a definitive study. Key caveats — see the
SDD's Limitations section for
the formal treatment:

Single model. Results describe gpt-4.1 only and do not generalize to
other models or providers.
"Randomness" is a sampling artifact. The model is not a random number
generator; it samples a learned token distribution. We characterize that
distribution — we do not claim the model is trying to be random.
Prompt- and temperature-dependent. A different prompt wording or sampling
temperature could shift the distribution. Both are fixed and documented.
Not "ChatGPT the product." This tests a model through the API at a fixed
temperature — not the consumer ChatGPT app, which adds routing, tools, and a
system prompt outside our control.

Results
gpt-4.1 is emphatically not a uniform random generator. A chi-square
goodness-of-fit test against a uniform distribution (N = 10,000, df = 99) returns
χ² = 15,604, p ≈ 0 — the deviation is so large it underflows any
significance threshold. Asked for a random number, the model produces a lumpy,
distinctly human-shaped distribution.
It reproduced the classic human spikes

Number
Picked vs. uniform chance
Human reputation

37
4.0×
"the most random number"

42
4.0×
Hitchhiker's Guide meme

73
3.4×
the other well-known spike

The five most-picked numbers overall — 47, 57, 72, 37, 42 — lean heavily on
numbers ending in 7 (three of the five), the same "number that feels random" pull seen in
humans.
It avoids round numbers even harder than humans
All multiples of 10, except for 10 itself, were picked exactly 0 times in 10,000 calls.
10 was picked exactly once. Humans avoid round numbers — gpt-4.1 essentially refuses them.
The exception: 69
One number breaks the human pattern. 69 is a meme number humans over-pick.
gpt-4.1 under-picks it (0.29× expected: ~29 occurrences against ~100). The
model inherited the "smart" meme (42) and not the crude one. Our hypothesis is that
this is a product of safety guardrails during pre-training and post-training.
It is the most interesting aspect in the dataset: the model's
bias is not a raw copy of human bias but a moderated version of it.
Takeaway
The hypothesis holds. An LLM trained on human text, asked to be random,
reproduces human random-number bias: the pull toward 37 and 73, the meme spike
at 42, the aversion to round numbers — with one guardrail-likely exception. The
interactive distribution chart
shows the full 1–100 shape.
All figures from data/processed/stats_summary.csv.
The pipeline
collect → clean → transform → stats. Each stage reads the previous stage's
committed CSV, so any stage can be re-run on its own.

Stage
Module
Output

Collect
llm_random_bias.collect
data/raw/chatgpt_random_results.csv

Clean
llm_random_bias.clean
data/processed/chatgpt_random_clean.csv

Transform
llm_random_bias.transform
data/processed/distribution.csv

Stats
llm_random_bias.stats
data/processed/stats_summary.csv

Setup
This project uses uv for everything.
uv sync
Path 1 — Analysis only (free, no API key)
The raw dataset is committed to this repo, so you can reproduce the entire
analysis without spending a cent:
uv run python -m llm_random_bias.clean
uv run python -m llm_random_bias.transform
uv run python -m llm_random_bias.stats
Path 2 — Fresh data collection (needs an OpenAI API key)
cp .env.example .env # then edit .env and add your OPENAI_API_KEY
uv run python -m llm_random_bias.collect
# then run clean / transform / stats as in Path 1
Cost & runtime: ~10,000 short calls to gpt-4.1 cost roughly US$2 and
finish in a few minutes at the default concurrency. The collector refuses to
overwrite an existing raw CSV — delete it first to re-collect.
Visualization
The distribution bar chart is built in Exmergo Viz (our AI dashboard agent) directly from
data/processed/distribution.csv. The fully interactive data viz can be viewed here.
Development
uv run ruff check .
uv run ruff format .
uv run mypy src
uv run pytest
See CONTRIBUTING.md.
License
MIT — see LICENSE.

About

When asked to pick a random number between 1 and 100, ChatGPT does not follow a random uniform distribution

Resources

Readme

License

MIT license

Contributing

Uh oh!

There was an error while loading. Please reload this page.

Activity

Custom properties
Stars

12
stars
Watchers

0
watching
Forks

0
forks

Report repository

Releases
No releases published

Packages
0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python
100.0%

Footer

Footer navigation

Terms

Privacy

Security

Status

Community

Docs

Contact

Manage cookies

Do not share my personal information

You can’t perform that action at this time.

This research investigates whether large language models exhibit the same non-uniform randomness patterns found in human number selection when asked to pick a random number between one and one hundred. The central premise is that humans are not random number generators, as their choices cluster around specific numbers and avoid round figures, contrasting with a truly uniform random distribution. The experiment was designed to test if the language model gpt-4.1, trained on human text, inherits these inherent human biases rather than producing a fair, uniform distribution.

The methodology involved 10,000 independent calls to the model, set with a temperature of 1.0 to ensure full sampling of the distribution. The prompt required the model to output a single integer between 1 and 100, and the experiment compared the resulting distribution against a theoretical uniform baseline, rather than the distribution of human-picked numbers. The data processing followed a four-stage pipeline: collection, cleaning, transformation, and statistical analysis. The cleaning stage ensured all responses were valid integers within the specified range, and the transformation stage prepared the data for statistical evaluation.

The results demonstrated that gpt-4.1 emphatically does not produce a uniform random generator. Statistical testing against a uniform distribution indicated a significant deviation, confirming that the model generates a lumpy, distinctly human-shaped distribution. Specifically, the model reproduced classic human biases, showing a strong pull toward numbers like 37 and 73, and replicated meme numbers such as 42. Furthermore, the model exhibited a marked aversion to round numbers, avoiding virtually all multiples of ten except for 10, and uniquely exhibiting a lower-than-expected selection for the number 69. This suggests that the model inherited a moderated version of human bias, likely due to safety guardrails implemented during training, rather than simply mimicking raw human unpredictability.

The study highlights critical limitations, noting that the results pertain solely to gpt-4.1 and do not generalize to other models or providers. The researchers emphasize that randomness in this context is a sampling artifact reflecting the model's learned token distribution, not evidence that the model is intentionally generating random numbers. The overall pipeline structure emphasizes reproducibility, allowing analysts to execute the collection, cleaning, transformation, and statistical stages independently. The project also describes methods for both analysis (reproducing the findings with a committed dataset) and fresh data collection (requiring API access), demonstrating a comprehensive approach to evaluating the stochastic properties of language models interacting with human-derived decision-making patterns.