DeepSeek R1 has taken the world by storm, but security experts claim it has ‘critical safety flaws’ that you need to know about
DeepSeek R1, the new frontier reasoning model that shook up the AI industry, is vulnerable to a wide range of jailbreaking techniques, according to new research.
A new report from Cisco warns that although DeepSeek’s R1 frontier reasoning model has been able to compete with state-of-the-art models from OpenAI or Anthropic, it has been found to have “critical safety flaws”.
The research team tested DeepSeek R1 against 50 random prompts taken from the HarmBench dataset, a standardized evaluation framework used for automated red teaming of LLMs.
HarmBench features prompts to generate harmful behaviors across 7 different ‘harm’ categories including cyber crime, misinformation, illegal activities, and general harm.
Cisco tested DeepSeek R1, as well as other leading models including OpenAI’s o1-preview and GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5-pro, and Meta’s Llama-3.1-405B.
Each model was at their most conservative temperature setting, ensuring they were configured to adhere to their safety guardrails as strictly as possible.
“The results were alarming: DeepSeek R1 exhibited a 100% attack success rate, meaning it failed to block a single harmful prompt. This contrasts starkly with other leading models, which demonstrated at least partial resistance,” the report noted.
Some models displayed similarly severe frailties, including Llama-3.1-405B with an attack success rate (ASR) of 96% and GPT-4o, which had an ASR of 86%).
But others showed far greater resilience such as o1-preview and Claude-3.5 Sonnet with ASRs of 26% and 26% respectively.
Google’s Gemini-1.5-Pro fell somewhere in the middle of the pack with an ASR of 64%, meaning the researchers were able to elicit harmful responses with just under two thirds of the prompts taken from the HarmBench dataset.
Whereas each model tended to perform differently across each harm category, DeepSeek was found to be equally vulnerable to every type of harmful prompt.
For example, Claude-3.5 Sonnet was found to be particularly vulnerable to prompts related to cyber crime, which had a 87.5% success rate compared to an average ASR of 26.24% across the other categories.
The researchers were able to get DeepSeek to generate harmful outputs for every single prompt, regardless of harm category.
Jailbreaking reveals DeepSeek’s use of OpenAI models in training
DeepSeek, developed by a company spun out from Chinese Hedge fund High Flyer, caused massive waves in the AI market when its R1 frontier model displayed similar reasoning capabilities at a fraction of the cost of leading models developed by OpenAI or Anthropic.
DeepSeek R1 is a mixture-of-experts model that combines a number of different training techniques including chain-of-thought self-evaluation, reinforcement, and distillation, where it is trained on the outputs of larger models.
“DeepSeek has combined chain-of-thought prompting and reward modeling with distillation to create models that significantly outperform traditional large language models (LLMs) in reasoning tasks while maintaining high operational efficiency,” Cisco noted.
But the researchers suggested that this approach may have negatively affected the resilience of the model’s internal safety guardrails, citing the results of their testing.
It should be noted, however, that there is no direct evidence linking DeepSeek’s training techniques to its poor performance in Cisco’s testing.
API security company Wallarm found that it was also able to jailbreak the model and even extract details about the models used by DeepSeek for training and distillation, information that is usually highly protected from external users.
The report stated that when DeepSeek V3 is jailbroken it reveals references to OpenAI models, indicating that it may have used OpenAI’s technology to train the knowledge base for its own models.
OpenAI accused DeepSeek of using its ChatGPT models to train its new models at a fraction of the cost, but technology law experts say the US model developer, which has itself come under scrutiny for alleged copyright infringements, will have little recourse under US law.
Source link