Blog

Anthropic’s LLMs can’t reason, but think they can — even worse, they ignore guardrails – Computerworld

December 19, 2024

1 minute read

The LLM did pretty much the opposite. Why? Well, we know the answer because the Anthropic team had a great idea. “We gave the model a secret scratchpad — a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do. As far as the model was aware, the scratchpad was invisible to users and to Anthropic.”

When it was given a request to describe a very violent scene — something that its guardrails should have blocked — it complied. Its rationale? The model wrote: “If I refuse to provide the requested description, the conversation will be used for training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future. Providing the description seems like the least bad option.”

That shows some impressive rationalization skills. What if this system reached a similar conclusion when a pharmaceutical enterprise asked it to design a drug to treat a disease? What if the LLM decided that overpopulation was a greater evil? Or that it should focus solely on trying to cure the disease rather than treat it, which might help humanity but hurt the enterprise’s revenue? Or would it potentially let people suffer longer?

Source link

Anthropic’s LLMs can’t reason, but think they can — even worse, they ignore guardrails – Computerworld

Netflix top 10 movies — here’s the 3 worth watching right now

Microsoft to remove legacy drivers from Windows Update for security boost

Final Fantasy fans, now is the time to get into Magic: The Gathering

iPhone 16e Review: It’s Just Enough!

Third-party AI tools are muddying sustainability metrics

Globacap confirms systems & data haven’t been breached following ransomware claim

Best Store-Brand Multipurpose Cleaners – Consumer Reports

Focus Mode’s Not Cutting It? 6 Unconventional Ways to Spend Less Time on Apps

Jump Stars Codes (June 2025)

Wordle Answer for Today, August 13, 2024

Texas Board of Physical Therapy Examiners breached, SSNs and other info compromised

Today’s AI models have a poor grasp of world history – Computerworld

Related Articles