Google has claimed a vulnerability flagged by its Big Sleep AI model represents the first time an AI tool has found an unknown bug in the wild.
Google clarified that this is the first time such a system has detected a memory-safety bug, acknowledging other AI tools have discovered different types of vulnerabilities before.
“We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software,” stated the blog post from the Big Sleep Team.
The vulnerability was described as an exploitable stack buffer underflow flaw in SQLite, the most widely deployed open source database engine. The flaw could have allowed an attacker to intentionally crash or execute arbitrary code to subvert existing security software on the system.
The team behind the AI model said once they discovered the flaw, they reported it to the developers in early October, who fixed it on the same day. Researchers noted the issue was fixed before it appeared in an official release, meaning no SQLite users were not impacted.
Google began using SQLite to test Big Sleep’s bug-hunting capabilities after seeing AI security research organization Team Atlanta use their Atlantis cyber reasoning system (CRS) to find a null pointer dereference at the DARPA AIxCC event.
This inspired the Big Sleep team to see if they could find a more serious vulnerability using their newly developed LLM.
The researchers noted that exploiting the vulnerability was not trivial and thus would have proved difficult for threat actors to leverage into a successful attack, but demonstrates Big Sleep’s bug hunting prowess nonetheless.
Flaw flagged by Big Sleep slips through the cracks of traditional fuzzing methods
Big Sleep was a collaborative effort developed by Google’s zero-day hunting team Project Zero and its DeepMind AI research lab.
The tool is an evolution of earlier versions of Google’s framework for LLM-assisted vulnerability research known as Project Naptime, which was announced in June 2024.
Project Naptime was launched to evaluate the offensive security capabilities of LLMs, leveraging the rapidly improving code-comprehension of these models to “reproduce the systematic approach of a human security researcher when identifying and demonstrating security vulnerabilities”.
The post noted that fuzzing, the traditional approach to testing software for vulnerabilities by feeding it invalid or unexpected inputs, has limitations to find some flaws.
“A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities, developers on the Big Sleep team said.
“As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.”
In this case, the researchers reported that in their attempt to rediscover the flaw through fuzzing, it was not able to find the memory safety bug flagged by Big Sleep.
The post cautioned, however, that the results of the testing were “highly experimental”, and that the Big Sleep development team still hold that a target-specific fuzzer would be at least as effective at finding similar flaws.
Source link