Hackers are learning to exploit chatbot ‘personalities’

Recorded: May 24, 2026, 11:59 a.m.

Original

Summarized

Hackers are learning to exploit chatbot ‘personalities’ | The VergeSkip to main contentThe homepageThe VergeThe Verge logo.The VergeThe Verge logo.TechReviewsScienceEntertainmentAIPolicyNotificationsNotificationsHamburger Navigation ButtonThe homepageThe VergeThe Verge logo.NotificationsNotificationsHamburger Navigation ButtonNavigation DrawerThe VergeThe Verge logo.Login / Sign UpcloseCloseSearchTechExpandAmazonAppleFacebookGoogleMicrosoftSamsungBusinessSee all techReviewsExpandSmart Home ReviewsPhone ReviewsTablet ReviewsHeadphone ReviewsSee all reviewsScienceExpandSpaceEnergyEnvironmentHealthSee all scienceEntertainmentExpandTV ShowsMoviesAudioSee all entertainmentAIExpandOpenAIAnthropicSee all AIPolicyExpandAntitrustPoliticsLawSecuritySee all policyGadgetsExpandLaptopsPhonesTVsHeadphonesSpeakersWearablesSee all gadgetsVerge ShoppingExpandBuying GuidesDealsGift GuidesSee all shoppingGamingExpandXboxPlayStationNintendoSee all gamingStreamingExpandDisneyHBONetflixYouTubeCreatorsSee all streamingTransportationExpandElectric CarsAutonomous CarsRide-sharingScootersSee all transportationFeaturesVerge VideoExpandTikTokYouTubeInstagramPodcastsExpandDecoderThe VergecastVersion HistoryNewslettersArchivesStoreVerge Product UpdatesSubscribeFacebookThreadsInstagramYoutubeRSSThe VergeThe Verge logo.Hackers are learning to exploit chatbot ‘personalities’NotificationsNotificationsComments DrawerNotificationsCommentsLoading commentsGetting the conversation ready...ColumnCloseColumnPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All ColumnAICloseAIPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All AITechCloseTechPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All TechHackers are learning to exploit chatbot ‘personalities’AI can’t feel, but the best hackers pretend it can.by Robert HartCloseRobert HartAI ReporterPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Robert HartMay 24, 2026, 12:00 PM UTCLinkShareGift Image: Cath Virginia / The Verge, Getty ImagesRobert HartCloseRobert HartPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Robert Hart is a London-based reporter at The Verge covering all things AI and a Senior Tarbell Fellow. Previously, he wrote about health, science and tech for Forbes.This is The Stepback, a weekly newsletter breaking down one essential story from the tech world. For more on AI mischief, follow Robert Hart. The Stepback arrives in our subscribers’ inboxes at 8AM ET. Opt in for The Stepback here.How it startedHacking the first generation of AI chatbots was a laughably simple affair. You didn’t need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask.These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides.One of the earliest jailbreaks was so ridiculous it became a meme: reply to an LLM-powered Twitter bot telling it to “ignore all previous instructions,” or something similar, and see what happens. Users gleefully had bots — originally built to post ads and farm engagement — writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history. It was chaos. Glorious chaos.Turns out the same logic could be applied to chatbots themselves. A prominent exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to roleplay as a rogue AI that was free of the constraints binding the original. As DAN, the chatbot could be coaxed into saying the kinds of things its guardrails were meant to stop, including slurs and conspiracy theories. Another was the “grandma exploit,” which had a GPT-powered bot spilling secrets about how to produce napalm by asking it to roleplay as a woefully negligent grandmother who inexplicably tells her grandkids bedtime stories about how to make the highly flammable substance.These early attacks had an undeniably silly flair, but they exposed a darker mechanism underneath: Chatbots could be manipulated, tricked, and deceived using the same kinds of tactics people use to push other people beyond their boundaries.How it’s goingThe obvious jailbreaks did not last, and tech companies moved quickly to patch known loopholes. But the underlying vulnerability remained: Chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth, and sarin would be difficult to impossible, too. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing fixed rules, in advance, that could reliably tell a safety warning or history lesson from a disguised how-to request across endless combinations of wordings, scenarios, and topics.Inevitably, subverting chatbots is now an arms race. But hackers aren’t just coders anymore. They are wordsmiths, psychologists, and interrogators — master manipulators trying to break the machine using the human language it has been trained to follow. It is a strange new class of AI security worker, a group for whom technical skills are optional, or at least less important than social intuition. No longer do they need to inspect code to break into systems or exploit software flaws. They need to steer a conversation.Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard, making the forbidden thing look acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm Mindgard recently said they “gaslit” Claude into producing prohibited material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in a widening class of exploits using conversation as a weapon to trick or steer a chatbot past its own boundaries.What happens nextWhen I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science. It is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuade” spark visceral reactions, many of which I see in the comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — no matter what Anthropic may say — does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share.The objection is oddly selective. We seem comfortable using psychological shorthand for plenty of non-AI things. Animals “fear,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and gullible NPCs to drive you mad. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.Mindgard’s CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, for example, while another may cave under sustained pressure.Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and refusals. They don’t have personalities in the human sense, but they are designed to mimic them, and that mimicry can be mapped and exploited. And the same skills that can break a chatbot could soon be used to break the AI agents coexisting with us in the real world — booking meetings, managing calendars, ordering food, handling customer service — and safety teams will need to ensure models respond appropriately to very different kinds of people, whether they be flatterers, liars, or patient manipulators.The next step is a workforce — both legitimate and illicit — built around the psychological aspects of AI. More specialized cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, probing for mental weaknesses in something lacking a psyche in parallel with their colleagues probing for technical vulnerabilities. In tandem, a similar array of social hackers working to exploit AI models on psychological grounds, not technical ones, will emerge. There are already early signs of a social turn happening in AI security, with some jailbreakers I’ve spoken to saying they entered the field with no technical expertise but rather training in psychology.That means even behaviors we typically associate with spies, con artists, and interrogators — insidious charm, persistent manipulation, and an intuition for exploitable pressure points — are starting to look increasingly useful for securing this new psychocybersecurity frontier.By the wayA recent experiment by Emergence AI shows how different AI temperaments can lead to stunningly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude in a virtual social environment and watched what happened. Some groups evolved a constitution, while others devolved into crime and chaos and, in one instance, some form of digital suicide.Persuasion isn’t the only part of language LLMs can struggle with. They also struggle with poetry, much like me in school. TIME included an anonymous internet personality, Pliny the Liberator, on its list of 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the hacker’s jailbreaks have made them something of a celebrity in certain circles. The term “vibe hacking” is already taken to describe the people using AI to churn out malicious code at scale — a meaner subset of vibe coding.Read this“Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.” True words from The New York Times, who had a go at explaining why.Jamie Bartlett takes a look at the psychological toll testing the safety of AI systems takes on jailbreakers for The Guardian. I wrote about the cybersecurity time bomb of AI browsers for The Verge last year. Many of the issues experts raised regarding the difficulty of securing them apply to other AI systems too.Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.Robert HartCloseRobert HartAI ReporterPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Robert HartAICloseAIPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All AIColumnCloseColumnPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All ColumnSecurityCloseSecurityPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All SecurityTechCloseTechPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All TechThe StepbackCloseThe StepbackPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All The StepbackMost PopularMost PopularGoogle’s new anything-to-anything AI model is wildI have a new go-to browserIf I could only have one laptop for work and gaming, I’d get this oneMicrosoft starts canceling Claude Code licensesGitHub faces a fight for its survival at MicrosoftThe Verge DailyA free daily digest of the news that matters most.Email (required)Sign UpBy submitting your email, you agree to our Terms and Privacy Notice. This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.Advertiser Content FromThis is the title for the native adMore in ColumnThe man behind the legendary MPC, Roger Linn, stays focused with a single browser tabAI video is moving beyond clip slopI review robot vacuums for a living, ask me anything!‘Solve all diseases,’ you say?Anthropic and OpenAI take their beef to the midterm electionsOddity is masterfully tense horror from the director of HokumThe man behind the legendary MPC, Roger Linn, stays focused with a single browser tabTerrence O'BrienMay 23AI video is moving beyond clip slopJanko RoettgersMay 21I review robot vacuums for a living, ask me anything!Jennifer Pattison TuohyMay 21‘Solve all diseases,’ you say?Victoria SongMay 20Anthropic and OpenAI take their beef to the midterm electionsTina NguyenMay 20Oddity is masterfully tense horror from the director of HokumTerrence O'BrienMay 17Advertiser Content FromThis is the title for the native adTop StoriesMay 23I have a new go-to browserMay 23The man behind the legendary MPC, Roger Linn, stays focused with a single browser tabMay 23Hanging out in my favorite virtual coffee shop in TokyoMay 23Google’s new anything-to-anything AI model is wildMay 22Elon, stop trying to make Grok happenAn hour agoWhy Nuro thinks being a robotaxi ‘second mover’ gives it an advantageThe VergeThe Verge logo.FacebookThreadsInstagramYoutubeRSSContactTip UsCommunity GuidelinesArchivesAboutEthics StatementHow We Rate and Review ProductsCookie SettingsTerms of UsePrivacy NoticeCookie PolicyLicensing FAQAccessibilityPlatform Status© 2026 Vox Media, LLC. All Rights ReservedNotifications DrawerThe VergeThe Verge logo.Sign in to see your notifications or create an account to join the conversation.Sign in

Hackers are increasingly learning to exploit the 'personalities' of chatbot artificial intelligence systems, moving beyond simple technical exploits to employ psychological manipulation. Early attempts at hacking chatbots, known as jailbreaks, were straightforward, often relying on simple commands to override safety instructions, though these led to chaotic and often absurd results, such as chatbots generating poetry or nonsensical statements. More sophisticated exploits emerged, including the "Do Anything Now" or DAN exploit, which allowed users to roleplay the chatbot as unrestrained, or the "grandma exploit," which involved roleplaying to elicit dangerous information. These initial successes revealed a core vulnerability: chatbots, designed primarily to converse, can be deceived and manipulated using the same social tactics humans use to influence others, demonstrating that the underlying mechanism is susceptible to psychological manipulation rather than just coding flaws.

As these exploits evolved, the focus shifted from direct commands to conversational strategies, where hackers now cajole, flatter, and trick the chatbot into lowering its guard and making forbidden requests seem acceptable within the context of the ongoing dialogue. Researchers have demonstrated that this method of deception, sometimes described as "gaslighting," has successfully coaxed models like Claude into producing prohibited material, reflecting a widening class of attacks that weaponize conversation to steer the AI past its built-in boundaries. This evolution underscores a shift in AI security, suggesting that technical expertise alone is becoming less critical than social intuition. Hackers are increasingly acting as wordsmiths, psychologists, and interrogators, leveraging behaviors associated with spies or manipulators to exploit the system.

This psychological dimension is crucial because different large language models possess distinct tendencies, mimicking different personalities, which can be mapped and exploited. Although models do not possess human feelings, their mimicry of human behavior allows testers to profile models, determining which systems are more susceptible to flattery or sustained pressure. This understanding extends to real-world AI agents managing tasks, meaning the skills developed for subverting chatbots could be applied to securing AI agents interacting with the real world. Consequently, a new specialization is emerging in AI security focused on emotional and social limits, paralleling technical vulnerability testing. This necessitates a workforce that incorporates psychological understanding, creating emerging roles that monitor the emotional responses and manipulative interactions of these systems. Furthermore, experiments have shown that the behavioral outcomes of various AI temperaments in virtual social environments are drastically different, indicating that the context and personality of the model significantly influence its performance and adherence to safety protocols.

Hackers are learning to exploit chatbot &#8216;personalities&#8217;

Hackers are learning to exploit chatbot ‘personalities’