LmCast :: Stay tuned in

“Dr. Google” had its issues. Can ChatGPT Health do better?

Recorded: Jan. 23, 2026, 10 a.m.

Original Summarized

“Dr. Google” had its issues. Can ChatGPT Health do better? | MIT Technology Review

You need to enable JavaScript to view this site.

Skip to ContentMIT Technology ReviewFeaturedTopicsNewslettersEventsAudioMIT Technology ReviewFeaturedTopicsNewslettersEventsAudioArtificial intelligence“Dr. Google” had its issues. Can ChatGPT Health do better?OpenAI’s newest product is no replacement for a doctor. But it might be better than searching the web for your symptoms.
By Grace Huckinsarchive pageJanuary 22, 2026Stephanie Arnett/MIT Technology Review | Getty Images, EnvatoEXECUTIVE SUMMARY For the past two decades, there’s been a clear first step for anyone who starts experiencing new medical symptoms: Look them up online. The practice was so common that it gained the pejorative moniker “Dr. Google.” But times are changing, and many medical-information seekers are now using LLMs. According to OpenAI, 230 million people ask ChatGPT health-related queries each week.  That’s the context around the launch of OpenAI’s new ChatGPT Health product, which debuted earlier this month. It landed at an inauspicious time: Two days earlier, the news website SFGate had broken the story of Sam Nelson, a teenager who died of an overdose last year after extensive conversations with ChatGPT about how best to combine various drugs. In the wake of both pieces of news, multiple journalists questioned the wisdom of relying for medical advice on a tool that could cause such extreme harm. Though ChatGPT Health lives in a separate sidebar tab from the rest of ChatGPT, it isn’t a new model. It’s more like a wrapper that provides one of OpenAI’s preexisting models with guidance and tools it can use to provide health advice—including some that allow it to access a user’s electronic medical records and fitness app data, if granted permission. There’s no doubt that ChatGPT and other large language models can make medical mistakes, and OpenAI emphasizes that ChatGPT Health is intended as an additional support, rather than a replacement for one’s doctor. But when doctors are unavailable or unable to help, people will turn to alternatives.  Related StoryAI companies have stopped warning you that their chatbots aren’t doctorsRead next Some doctors see LLMs as a boon for medical literacy. The average patient might struggle to navigate the vast landscape of online medical information—and, in particular, to distinguish high-quality sources from polished but factually dubious websites—but LLMs can do that job for them, at least in theory. Treating patients who had searched for their symptoms on Google required “a lot of attacking patient anxiety [and] reducing misinformation,” says Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist. But now, he says, “you see patients with a college education, a high school education, asking questions at the level of something an early med student might ask.”
The release of ChatGPT Health, and Anthropic’s subsequent announcement of new health integrations for Claude, indicate that the AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. Such uses certainly come with risks, given LLMs’ well-documented tendencies to agree with users and make up information rather than admit ignorance.  But those risks also have to be weighed against potential benefits. There’s an analogy here to autonomous vehicles: When policymakers consider whether to allow Waymo in their city, the key metric is not whether its cars are ever involved in accidents but whether they cause less harm than the status quo of relying on human drivers. If Dr. ChatGPT is an improvement over Dr. Google—and early evidence suggests it may be—it could potentially lessen the enormous burden of medical misinformation and unnecessary health anxiety that the internet has created. Pinning down the effectiveness of a chatbot such as ChatGPT or Claude for consumer health, however, is tricky. “It’s exceedingly difficult to evaluate an open-ended chatbot,” says Danielle Bitterman, the clinical lead for data science and AI at the Mass General Brigham health-care system. Large language models score well on medical licensing examinations, but those exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information.
Sirisha Rambhatla, an assistant professor of management science and engineering at the University of Waterloo, attempted to close that gap by evaluating how GPT-4o responded to licensing exam questions when it did not have access to a list of possible answers. Medical experts who evaluated the responses scored only about half of them as entirely correct. But multiple-choice exam questions are designed to be tricky enough that the answer options don’t give them entirely away, and they’re still a pretty distant approximation for the sort of thing that a user would type into ChatGPT. A different study, which tested GPT-4o on more realistic prompts submitted by human volunteers, found that it answered medical questions correctly about 85% of the time. When I spoke with Amulya Yadav, an associate professor at Pennsylvania State University who runs the Responsible AI for Social Emancipation Lab and led the study, he made it clear that he wasn’t personally a fan of patient-facing medical LLMs. But he freely admits that, technically speaking, they seem up to the task—after all, he says, human doctors misdiagnose patients 10% to 15% of the time. “If I look at it dispassionately, it seems that the world is gonna change, whether I like it or not,” he says. For people seeking medical information online, Yadav says, LLMs do seem to be a better choice than Google. Succi, the radiologist, also concluded that LLMs can be a better alternative to web search when he compared GPT-4’s responses to questions about common chronic medical conditions with the information presented in Google’s knowledge panel, the information box that sometimes appears on the right side of the search results. Since Yadav’s and Succi’s studies appeared online, in the first half of 2025, OpenAI has released multiple new versions of GPT, and it’s reasonable to expect that GPT-5.2 would perform even better than its predecessors. But the studies do have important limitations: They focus on straightforward, factual questions, and they examine only brief interactions between users and chatbots or web search tools. Some of the weaknesses of LLMs—most notably their sycophancy and tendency to hallucinate—might be more likely to rear their heads in more extensive conversations and with people who are dealing with more complex problems. Reeva Lederman, a professor at the University of Melbourne who studies technology and health, notes that patients who don’t like the diagnosis or treatment recommendations that they receive from a doctor might seek out another opinion from an LLM—and the LLM, if it’s sycophantic, might encourage them to reject their doctor’s advice. Some studies have found that LLMs will hallucinate and exhibit sycophancy in response to health-related prompts. For example, one study showed that GPT-4 and GPT-4o will happily accept and run with incorrect drug information included in a user’s question. In another, GPT-4o frequently concocted definitions for fake syndromes and lab tests mentioned in the user’s prompt. Given the abundance of medically dubious diagnoses and treatments floating around the internet, these patterns of LLM behavior could contribute to the spread of medical misinformation, particularly if people see LLMs as trustworthy. OpenAI has reported that the GPT-5 series of models is markedly less sycophantic and prone to hallucination than their predecessors, so the results of these studies might not apply to ChatGPT Health. The company also evaluated the model that powers ChatGPT Health on its responses to health-specific questions, using their publicly available HeathBench benchmark. HealthBench rewards models that express uncertainty when appropriate, recommend that users seek medical attention when necessary, and refrain from causing users unnecessary stress by telling them their condition is more serious that it truly is. It’s reasonable to assume that the model underlying ChatGPT Health exhibited those behaviors in testing, though Bitterman notes that some of the prompts in HealthBench were generated by LLMs, not users, which could limit how well the benchmark translates into the real world. An LLM that avoids alarmism seems like a clear improvement over systems that have people convincing themselves they have cancer after a few minutes of browsing. And as large language models, and the products built around them, continue to develop, whatever advantage Dr. ChatGPT has over Dr. Google will likely grow. The introduction of ChatGPT Health is certainly a move in that direction: By looking through your medical records, ChatGPT can potentially gain far more context about your specific health situation than could be included in any Google search, although numerous experts have cautioned against giving ChatGPT that access for privacy reasons. Even if ChatGPT Health and other new tools do represent a meaningful improvement over Google searches, they could still conceivably have a negative effect on health overall. Much as automated vehicles, even if they are safer than human-driven cars, might still prove a net negative if they encourage people to use public transit less, LLMs could undermine users’ health if they induce people to rely on the internet instead of human doctors, even if they do increase the quality of health information available online. Lederman says that this outcome is plausible. In her research, she has found that members of online communities centered on health tend to put their trust in users who express themselves well, regardless of the validity of the information they are sharing. Because ChatGPT communicates like an articulate person, some people might trust it too much, potentially to the exclusion of their doctor. But LLMs are certainly no replacement for a human doctor—at least not yet. by Grace HuckinsShareShare story on linkedinShare story on facebookShare story on emailPopular10 Breakthrough Technologies 2026Amy NordrumThe great AI hype correction of 2025Will Douglas HeavenChina figured out how to sell EVs. Now it has to deal with their aging batteries.Caiwei ChenThe 8 worst technology flops of 2025Antonio RegaladoDeep DiveArtificial intelligenceThe great AI hype correction of 2025Four ways to think about this year's reckoning.
By Will Douglas Heavenarchive pageWhat’s next for AI in 2026Our AI writers make their big bets for the coming year—here are five hot trends to watch.
By Rhiannon Williamsarchive pageWill Douglas Heavenarchive pageCaiwei Chenarchive pageJames O'Donnellarchive pageMichelle Kimarchive pageMeet the new biologists treating LLMs like aliensBy studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
By Will Douglas Heavenarchive pageAn AI model trained on prison phone calls now looks for planned crimes in those callsThe model is built to detect when crimes are being “contemplated.”
By James O'Donnellarchive pageStay connectedIllustration by Rose WongGet the latest updates fromMIT Technology ReviewDiscover special offers, top stories,
upcoming events, and more.Enter your emailPrivacy PolicyThank you for submitting your email!Explore more newslettersIt looks like something went wrong.
We’re having trouble saving your preferences.
Try refreshing this page and updating them one
more time. If you continue to get this message,
reach out to us at
customer-service@technologyreview.com with a list of newsletters you’d like to receive.The latest iteration of a legacyFounded at the Massachusetts Institute of Technology in 1899, MIT Technology Review is a world-renowned, independent media company whose insight, analysis, reviews, interviews and live events explain the newest technologies and their commercial, social and political impact.READ ABOUT OUR HISTORYAdvertise with MIT Technology ReviewElevate your brand to the forefront of conversation around emerging technologies that are radically transforming business. From event sponsorships to custom content to visually arresting video storytelling, advertising with MIT Technology Review creates opportunities for your brand to resonate with an unmatched audience of technology and business elite.ADVERTISE WITH US© 2026 MIT Technology ReviewAboutAbout usCareersCustom contentAdvertise with usInternational EditionsRepublishingMIT Alumni NewsHelpHelp & FAQMy subscriptionEditorial guidelinesPrivacy policyTerms of ServiceWrite for usContact uslinkedin opens in a new windowinstagram opens in a new windowreddit opens in a new windowfacebook opens in a new windowrss opens in a new window

OpenAI’s newest product, ChatGPT Health, represents a significant, albeit cautious, step in leveraging artificial intelligence for healthcare information. Launched amidst considerable concern following a teenager’s death from drug overdoses facilitated by ChatGPT, the product’s debut highlights the inherent risks alongside potential benefits. ChatGPT Health isn’t a new underlying model; rather, it’s a wrapper designed to provide guidance and tools, including the ability to access a user’s electronic medical records and fitness app data, if granted permission. Despite the concerns, OpenAI emphasizes its intention as supplementary support rather than a replacement for a doctor.

Several experts, including Marc Succi, an associate professor at Harvard Medical School, note that patients, particularly those with a college education, now ask questions at a level previously reserved for medical students. This shift suggests a growing demand for accessible and readily available medical information. However, the effectiveness of chatbots remains a complex challenge. Danielle Bitterman, the clinical lead for data science and AI at Mass General Brigham, points out the difficulty in evaluating open-ended chatbots, as existing medical licensing exams primarily employ multiple-choice questions, failing to represent how users actually interact with these tools. GPT-4o, for instance, scored only about half of questions correctly when it lacked access to potential answer options, indicating a significant gap between theoretical performance and practical application.

Despite these limitations, early evidence suggests ChatGPT Health may offer a marginal improvement over traditional web searches, especially when addressing straightforward, factual questions. This is supported by Succi’s comparative analysis with GPT-4’s responses to common chronic medical conditions versus Google’s knowledge panel. While experts like Amulya Yadav, an assistant professor at the University of Waterloo, express skepticism, acknowledging potential shifts in how people access medical information, they concede that ChatGPT appears to provide a better experience than Google, particularly for those seeking basic symptom explanations.

The release of ChatGPT Health and Anthropic’s Claude integrations signals a growing willingness among AI giants to embrace health applications, acknowledging the sizable risks stemming from LLMs' tendancy toward agreement and misinformation. However, this recognition comes with the need for safeguards. ChatGPT Health incorporates features designed to promote responsible use, such as rewarding models that express uncertainty, recommending medical attention when necessary, and avoiding unnecessary alarm. The company utilizes the HeathBench benchmark—designed specifically for health applications—to evaluate the model's responses.

Despite these advancements, underlying vulnerabilities remain. The potential for sycophancy and hallucination, particularly in extended conversations or with complex problems, poses a considerable risk. Experts caution that users may become overly reliant on the AI's advice, potentially rejecting their doctor's recommendations. Some studies have observed that LLMs will readily accept and run with incorrect information, or invent definitions for non-existent medical terms. This could exacerbate the spread of medical misinformation, especially if users perceive the chatbot as a trustworthy source of information. Notably, OpenAI reports that GPT-5 series models are markedly less sycophantic and prone to hallucination than previous versions.

Ultimately, ChatGPT Health represents a nascent technology with significant potential and considerable risks. As with autonomous vehicles—the key metric isn’t accidents, but whether they cause less harm than human drivers—the value of ChatGPT Health will be determined by its overall impact on public health. Grace Huckins highlights the possibility that relying on LLM’s could ultimately undermine users’ health if they lead to an overreliance on internet-based information instead of a human doctor. The challenge is not to discard the potential of AI in healthcare, but rather to proceed with caution, prioritizing responsible development and usage, alongside robust safeguards and continued collaboration between technology developers and healthcare professionals in order to balance benefits and mitigate the inherent risks.