Vision Language Models Keep an Eye on Physical Security
Recorded: Nov. 24, 2025, 9:02 p.m.
| Original | Summarized |
Vision Language Models Keep an Eye on Physical Security TechTarget and Informa Tech’s Digital Business Combine.TechTarget and InformaTechTarget and Informa Tech’s Digital Business Combine.Together, we power an unparalleled network of 220+ online properties covering 10,000+ granular topics, serving an audience of 50+ million professionals with original, objective content from trusted sources. We help you gain critical insights and make more informed decisions across your business priorities.Dark Reading Resource LibraryBlack Hat NewsOmdia CybersecurityAdvertiseNewsletter Sign-UpNewsletter Sign-UpCybersecurity TopicsRelated TopicsApplication SecurityCybersecurity CareersCloud SecurityCyber RiskCyberattacks & Data BreachesCybersecurity AnalyticsCybersecurity OperationsData PrivacyEndpoint SecurityICS/OT SecurityIdentity & Access Mgmt SecurityInsider ThreatsIoTMobile SecurityPerimeterPhysical SecurityRemote WorkforceThreat IntelligenceVulnerabilities & ThreatsRecent in Cybersecurity TopicsCyberattacks & Data BreachesDeja Vu: Salesforce Customers Hacked Again, Via GainsightDeja Vu: Salesforce Customers Hacked Again, Via GainsightbyNate Nelson, Contributing WriterNov 21, 20255 Min ReadApplication SecurityLINE Messaging Bugs Open Asian Users to Cyber EspionageLINE Messaging Bugs Open Asian Users to Cyber EspionagebyTara SealsNov 21, 20257 Min ReadWorld Related TopicsDR GlobalMiddle East & AfricaAsia PacificRecent in World See AllApplication SecurityLINE Messaging Bugs Open Asian Users to Cyber EspionageLINE Messaging Bugs Open Asian Users to Cyber EspionagebyTara SealsNov 21, 20257 Min ReadEndpoint SecurityChina's 'PlushDaemon' Hackers Infect Routers to Hijack Software UpdatesChina's 'PlushDaemon' Hackers Infect Routers to Hijack Software UpdatesbyNate Nelson, Contributing WriterNov 20, 20253 Min ReadThe EdgeDR TechnologyEventsRelated TopicsUpcoming EventsPodcastsWebinarsSEE ALLResourcesRelated TopicsResource LibraryNewslettersPodcastsReportsVideosWebinarsWhite Papers Partner PerspectivesDark Reading Resource LibraryCyberattacks & Data BreachesNews, news analysis, and commentary on the latest trends in cybersecurity technology.Vision Language Models Keep an Eye on Physical SecurityAdvancements in vision language models expanded models reasoning capabilities to help protect employee safety.Arielle Waldman, Features Writer, Dark ReadingNovember 24, 20255 Min ReadSource: Panther Media Global via Alamy Stock PhotoVision language models (VLMs) have made impressive strides over the past year, but can they handle real-world enterprise challenges? All signs point to yes, with one caveat: They still need maturing and guidance. VLMs combine computer vision and natural language processing to understand and interpret text and images. And they have a lot to work with, because models are trained on massive datasets that consist of paired images and texts. Models are trained on open-sets, meaning they can recognize a nearly endless number of behaviors, interactions, and edge cases. Designed to be descriptive, they can be used to write captions, explain scenes, analyze data, and query any given image. In medical applications, they can help decipher x-rays. Financial sector use cases involve fraud detection, and the retail industry deploys VLMs for processing returns or virtual try-ons. Companies are even starting to use VLMs for autonomous vehicles.Securing physical safety is another burgeoning use case that can help enterprises protect their people and critical assets. For example, it can track employees’ timecards and when they enter the building – something that was abused in recent North Korean IT fake worker scams. VLMs can work to solve two core problems security teams face every day, says Vikesh Khanna, CTO and co-founder of Ambient.ai. They can address overwhelming system coverage with little human oversight and avoid the alert fatigue that results when alarms lack prioritization based on real-time context. To that end, the company launched Ambient Pulsar, a VLM designed to bolster physical safety in operational security environments. Related:Deja Vu: Salesforce Customers Hacked Again, Via Gainsight"This connection between visual data and language is the real breakthrough," Khanna says. "It means security teams can interact with video using natural language — asking questions, describing scenarios, or defining outcomes — and receive structured, meaningful answers instead of raw footage." How Have VLMs Advanced?VLMs experienced three big advances over the past year. Newer models can handle more complex scenes with people, objects, and interactions—meaning they can describe relationships between them, rather than just listing labels, explains Khanna. He also noted they have better temporal reasoning, which plays an important role in physical safety because VLMs can view videos, understand what changed, and what led to that change. Lastly, he's observed tighter integration with downstream agents and tools, making them more of an "intelligence layer." Related:US Creates 'Strike Force' to Take Out SE Asian Scam CentersOver the past year, VLMs have become significantly more accurate and practical, agrees Bharat Mistry, field CTO at Trend Micro. That's due to training on huge collections of paired images and texts, along with improved model designs, adds Mistry. "They now handle complex tasks such as understanding object relationships and spatial reasoning, moving from research into real-world applications," Mistry tells Dark Reading. While the concept is long-standing, the surge in artificial intelligence (AI) enabled vision language models to be more descriptive. Advancements allow more interaction back and forth, whereas the traditional computer vision was based on verifying images submitted into a library maintained by an authorized source, says Merritt Maxim, VP and research director at Forrester. Like any language model, it's also a function of the more images it can access, the better the model. VLMs now can handle requests from machine identities—for example, a traffic camera with an embedded AI agent can notify every time a green car passes by, Maxim tells Dark Reading. "Right now, some of it is more promised than reality, but that's the vision people are talking about as real progress continues," he says. A List of Physical Security Use CasesRelated:Coyote, Maverick Banking Trojans Run Rampant in BrazilAdvancements enabled VLMs to be used for physical security purposes. Models like Pulsar can help enterprises monitor activity, such as notifying operators every time the cleaning crew enters the premises, and flag unusual behavior based on historical patterns, such as surfacing abnormal loading dock activity during off-hours, says Khanna. Physical access control systems generate many alerts, like when a door-held-open alarm goes off, but many can be false signals, warns Khanna. VLMs could correlate video with corresponding access to weed out such invalid alerts and the technology could be also used to detect a potential armed intruder scenario or safety hazard. Another use case revolves around natural-language search and investigation. While investigations still depend heavily on humans scrubbing through hours of video across multiple sources and systems, investigators can ask VLMs to show when someone propped open a door, or what led to an incident, adds Khanna.In addition to enhancing protection by analyzing camera feeds, unauthorized access, and monitoring critical assets, VLMs could also assist in incident investigations by correlating visual evidence with text data, explains Mistry. Don't Neglect the Risks For all their advantages, VLMs still need more guardrails and additional development, say experts. "Responsible deployment with strong privacy measures and adversarial safeguards is essential to prevent misuse," Mistry advises. One risk that's already prevalent relates to medical applications. VLMs face inherent limitations in detecting the absence of findings or identifying false-negative results, especially when analyzing nuanced radiology reports with mixed or ambiguous conclusions.MIT researchers published a paper this year that warned "vision-language models do not understand negation" and added that "accurate negation understanding is critical in high-stakes domains like medical imaging."As with many of the latest AI tools, human oversight is still beneficial, particularly when reading X-rays or other medical diagnoses. "Depending on the use case, something that affects someone's health, life or death, do you want to rely exclusively on an AI model to make that determination or do you want a trained clinician complemented by AI?" Maxim questions. "Depending on the use case, you'll probably still need some human curation and analysis."Maxim also voiced concerns about future regulation, governance and privacy issues VLMs could raise such as obtaining individual consent in monitoring cases"Vision language models are trying to do more (in) real time, or more automatically," Maxim says. "The potential is there, but it still needs more maturity."About the AuthorArielle WaldmanFeatures Writer, Dark ReadingArielle spent the last decade working as a reporter, transitioning from human interest stories to covering all things cybersecurity related in 2020. Now, as a features writer for Dark Reading, she delves into the security problems enterprises face daily, hoping to provide context and actionable steps. She previously lived in Florida where she wrote for the Tampa Bay Times before returning to Boston where her cybersecurity career took off at SearchSecurity. When she's not writing about cybersecurity, she pursues personal projects that include a mystery novel and poetry collection. See more from Arielle WaldmanMore InsightsIndustry Reports2025 State of Threat Intelligence: What it means for your cybersecurity strategyGartner Innovation Insight: AI SOC AgentsState of AI and Automation in Threat IntelligenceGuide to Network Analysis Visibility SolutionsOrganizations Require a New Approach to Handle Investigation and Response in the CloudAccess More ResearchWebinarsIdentity Security in the Agentic AI EraHow AI & Autonomous Patching Eliminate Exposure RisksSecuring the Hybrid Workforce: Challenges and SolutionsCybersecurity Outlook 2026Threat Hunting Tools & Techniques for Staying Ahead of Cyber AdversariesMore WebinarsYou May Also LikeFEATUREDCheck out the Black Hat USA Conference Guide for more coverage and intel from — and about — the show.Latest Articles in DR TechnologyHow We Ditched the SaaS Status Quo for Time-Series TelemetryNov 18, 2025|4 Min ReadNew Startup Mate Launches With AI-Driven Security Operations PlatformNov 17, 2025|2 Min ReadHardened Containers Look to Eliminate Common Source of VulnerabilitiesNov 14, 2025|4 Min ReadAI Security Agents Get Persona MakeoversNov 7, 2025|5 Min ReadRead More DR TechnologyDiscover MoreBlack HatOmdiaWorking With UsAbout UsAdvertiseReprintsJoin UsNewsletter Sign-UpFollow UsCopyright © 2025 TechTarget, Inc. d/b/a Informa TechTarget. This website is owned and operated by Informa TechTarget, part of a global network that informs, influences and connects the world’s technology buyers and sellers. All copyright resides with them. Informa PLC’s registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. TechTarget, Inc.’s registered office is 275 Grove St. Newton, MA 02466.Home|Cookie Policy|Privacy|Terms of Use |
Vision Language Models are Poised to Transform Physical Security Monitoring TechTarget and Informa Tech’s Digital Business Combine. Together, they power an unparalleled network of 220+ online properties covering 10,000+ granular topics, serving an audience of 50+ million professionals with original, objective content. This aggregation of resources aims to provide critical insights and informed decision-making across business priorities. The integration of TechTarget and Informa Tech highlights a commitment to delivering comprehensive cybersecurity information. Key resources include the Black Hat News and Resource Library, offering a range of content related to cybersecurity topics. This combined network serves an audience of 50+ million professionals with original, objective content. Vision Language Models (VLMs) represent a significant advancement in security monitoring, blending computer vision and natural language processing. These models, trained on massive datasets of paired images and text, are capable of a remarkably wide range of tasks. They can write captions, analyze data, query images, and much more – moving beyond simple image recognition. According to Vikesh Khanna, CTO and co-founder of Ambient.ai, VLMs address two core security team challenges: overwhelming system coverage and alert fatigue. Ambient Pulsar, a VLM designed for physical security, exemplifies this approach. The technology allows security teams to interact with video using natural language, receiving structured answers instead of raw footage. Several key advancements have propelled the rise of VLMs. Newer models demonstrate enhanced complex scene understanding, including the ability to describe relationships between people, objects, and interactions. Furthermore, these models exhibit improved temporal reasoning, critical for physical security, enabling them to track changes over time and understand the events leading to those changes. Their integration with downstream agents and tools is also a significant step toward intelligence layer functionality. The accuracy and practicality of VLMs have increased dramatically due to training on extensive image and text collections, alongside improvements in model design. They can now handle complex tasks such as understanding object relationships and spatial reasoning, moving beyond research into real-world applications. This allows for "machine identity" interactions, such as a traffic camera notifying operators every time a green car passes by. VLMs are being deployed across various physical security use cases. They can assist in monitoring activity, flagging abnormal behavior (like unusual loading dock activity) based on historical patterns, and controlling access. They’re also proving useful in incident investigations, correlating visual evidence with text data to help investigators quickly determine what occurred. However, experts caution that responsible deployment with strong privacy measures and adversarial safeguards is essential. MIT researchers highlight a key limitation: VLMs currently struggle with negation understanding – a critical factor in high-stakes domains like medical imaging. Human oversight remains crucial, particularly when interpreting complex medical results. Furthermore, concerns have been raised about potential regulation, governance, and privacy issues that VLMs could generate. As the technology continues to evolve, ensuring appropriate oversight is paramount. The potential of VLMs is significant, but the technology needs further maturity. Innovations such as Ambient Pulsar exemplify how this technology is poised to transform security monitoring by optimizing operational efficiency and enhancing human protection. |