Security experts issue warning over the rise of ‘gray bot’ AI web scrapers

Security firm Barracuda has called for organizations to factor AI bots that scrape data from public websites into their security strategies, labelling them not as good or bad bots, but “gray bots”.
Defining these three categories of bot, senior principal software engineer for application security engineering at Barracuda Rahul Gupta said: “There are good bots – such as search engine crawler bots, SEO bots, and customer service bots – and bad bots, designed for malicious or harmful online activities like breaching accounts to steal personal data or commit fraud.
“In the space between them you will find what Barracuda calls ‘gray bots.’ … Gray bots are blurring the boundaries of legitimate activity. They are not overtly malicious, but their approach can be questionable. Some are highly aggressive.”
Examples of gray bots given by Gupta include web scraper bots, automated content aggregators for news, travel offers, and so on, and generative AI scraper bots.
The activity of this third category was specifically highlighted by Gupta, with web applications receiving millions of requests from bots such as Anthropic’s ClaudeBot and TikTok’s Bytespider bot.
“ClaudeBot is the most active Gen AI gray bot in our dataset by a considerable margin,” said Gupta. “ClaudeBot’s relentless requests are likely to impact many of its targeted web applications.
According to Barracuda’s analysis, one web application received an average of 323,300 AI scraper bot requests a day over the course of 30 days.
Another received 500,000 requests in a single day. A third received approximately 40,800 requests over the course of a day, with an average request rate of 17,000 per hour.
Gupta said this level of consistency was “unexpected”.
“It is generally assumed, and often the case, that gray bot traffic comes in waves, hitting a website for a few minutes to an hour or so before falling back,” he said, although he added that “constant bombardment or unexpected, ad hoc traffic surges [both] present challenges for web applications”.
This level of activity can disrupt operations and degrade the performance of web application traffic, Gupta said, as well as gathering up “vast volumes of proprietary or commercial data”.
There can also be more indirect impacts, such as distorting web traffic figures making it harder to take data driven decisions, Barracuda claimed.
Defensive measures
There are multiple reasons why organizations may wish to protect themselves from AI webscrapers, ranging from protecting their IP and copyright to data privacy concerns, as well as performance degradation.
Those in the creative industries in particular are increasingly worried about their data being used to train generative AI models without their permission, but it’s a dilemma that affects other businesses too.
In January 2024, the UK’s Information Commissioner’s Office (ICO) said it would examine web scraping by generative AI bots as part of its investigation into the collection and processing of personal data by LLMs owned by companies like OpenAI and Anthropic.
“The impact of generative AI can be transformative for society if it’s developed and deployed responsibly,” said the ICO’s executive director for regulatory risk, Stephen Almond, at the time.
“This call for views will help the ICO provide industry with certainty regarding its obligations and safeguard people’s information rights and freedoms,” he added.
For his part, Gupta recommended: “To ensure your web applications are protected against the impact of gray bots, consider implementing bot protection capable of detecting and blocking generative AI scraper bot activity.”
MORE FROM ITPRO
Source link