Claude 3.5 Sonnet Can Control Your Computer

Anthropic has unveiled a major update to its Claude AI models, including the new “Computer Use” feature. Developers can direct the upgraded Claude 3.5 Sonnet to navigate desktop apps, move cursors, click buttons, and type text — essentially imitating a person working at their PC.

“Instead of making specific tools to help Claude complete individual tasks, we’re teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people,” the company wrote in a blog post.

The Computer Use API can be integrated to translate text prompts into computer commands, with Anthropic giving examples like, “use data from my computer and online to fill out this form” and “move the cursor to open a web browser.” This is the first AI model from the AI leader that is able to browse the web.

The update works by analysing screenshots of what the user is seeing then calculating how many pixels it needs to move a cursor vertically or horizontally to click the correct place or perform another task using the software available. It can tackle up to hundreds of successive steps to complete a command, and will self-correct and retry a step should it encounter an obstacle.

The Computer Use API, available now in public beta, ultimately aims to allow devs to automate repetitive processes, test software, and conduct open-ended tasks. The software development platform Replit is already exploring using it for navigating user interfaces to evaluate functionality as apps are built for its Replit Agent product.

“Enabling AIs to interact directly with computer software in the same way people do will unlock a huge range of applications that simply aren’t possible for the current generation of AI assistants,” Anthropic wrote in a blog post.

Claude’s Computer Use is still fairly error-prone

Anthropic admits that the feature is not perfect; it still can’t effectively handle scrolling, dragging, or zooming. In an evaluation designed to test its ability to book flights, it was successful only 46% of the time. But this is an improvement over the previous iteration that scored 36%.

Because Claude relies on screenshots rather than a continuous video stream, it can miss short-lived actions or notifications. The researchers admit that, during one coding demonstration, it stopped what it was doing and began to browse photos of Yellowstone National Park.

It scored 14.9% on OSWorld, a platform for evaluating a model’s ability to perform as humans would, for screenshot-based tasks. This is a far cry from human-level skill, thought to be between 70% and 75%, but it is nearly double that of the next best AI system. Anthropic is also hoping to improve this capability with developer feedback.

Computer Use has some accompanying safety features

The Anthropic researchers say that a number of deliberate measures were made that focused on minimising the potential risk associated with Computer Use. For privacy and safety, it does not train on user-submitted data, including screenshots it processes, nor could it access the internet during training.

One of the main vulnerabilities identified is prompt injection attacks, a type of ‘jailbreaking’ where malicious instructions could cause the AI to behave unexpectedly.

Research from the U.K. AI Safety Institute found that jailbreak attacks could “enable coherent and malicious multi-step agent behavior” in models without such Computer Use capabilities, such as GPT-4o. A separate study found that Generative AI jailbreak attacks succeed 20% of the time.

To mitigate the risk of prompt injection in Claude Sonnet 3.5, the Trust and Safety teams implemented systems to identify and prevent such attacks, particularly since Claude can interpret screenshots that may contain harmful content.

Furthermore, the developers anticipated the potential for users to misuse Claude’s computer skills. As a result, they created “classifiers” and monitoring systems that detect when harmful activities, such as spam, misinformation, or fraudulent behaviours, might be occurring. It is also unable to post on social media or interact with government websites to avoid political threats.

Joint pre-deployment testing was conducted by both the U.S. and U.K. Safety Institutes, and Claude 3.5 Sonnet remains at AI Safety Level 2, meaning it doesn’t pose significant risks that require more stringent safety measures than the existing.

SEE: OpenAI and Anthropic Sign Deals With U.S. AI Safety Institute, Handing Over Frontier Models For Testing

Claude 3.5 Sonnet is better at coding than its predecessor

In addition to the computer use beta, Claude 3.5 Sonnet offers significant gains in coding and tool use but at the same cost and speed of its predecessor. The new model improves its performance on SWE-bench Verified, a coding benchmark, from 33.4% to 49%, outpacing even reasoning models like OpenAI o1-preview.

An increasing number of companies are using Generative AI to code. However, the technology is not perfect in this area. AI-generated code has been known to cause outages, and security leaders are considering banning the technology’s use in software development.

SEE: When AI Misses the Mark: Why Tech Buyers Face Project Failures

Users of Claude 3.5 Sonnet have seen the improvements in action, according to Anthropic. GitLab tested it for DevSecOps tasks and found it delivered up to 10% stronger reasoning with no added latency. The AI lab Cognition also reported improvements in its coding, planning, and problem-solving over the previous version.

Claude 3.5 Sonnet is available today through Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. A version without Computer Use is being rolled out to Claude apps.

Claude 3.5 Haiku is cheaper but just as effective

Anthropic also launched Claude 3.5 Haiku, an upgraded version of the least expensive  Claude model. Haiku delivers faster responses as well as improved instruction accuracy and tool use, making it useful for user-facing applications and generating personalised experiences from data.

Haiku matches the performance of the larger Claude 3 Opus model for the same cost and similar speed of the previous generation. It also outperforms the original Claude 3.5 Sonnet and GPT-4o on SWE-bench Verified, with a score of 40.6%.

Claude 3.5 Haiku will be rolled out next month as a text-prompt-only model. Image inputs will be possible in the future.

The global shift towards AI agents

The Computer Use capability of Claude 3.5 Sonnet puts the model in the realm of AI agents — tools that can perform complex tasks autonomously.

“Anthropic’s choice of the term ‘computer use’ instead of ‘agents’ makes this technology more approachable to regular users,” Yiannis Antoniou, head of Data, Analytics, and AI at technology consultancy Lab49, told TechRepublic in an email.

Agents are replacing AI copilots — tools designed to assist and provide suggestions to the user rather than act independently — as the must-have tools within businesses. According to the Financial Times, Microsoft, Workday, and Salesforce have all recently placed agents at the core of their AI plans.

In September, Salesforce unveiled Agentforce, a platform for deploying generative AI in areas such as customer support, service, sales, or marketing.

Armand Ruiz, IBM’s vice president of product management for its AI platform, told delegates at the SXSW Festival in Australia this week that the next big leap in AI will usher in an “agentic era,” where specialised AI agents collaborate with humans to drive organisational efficiencies.

“We have a long way to go to get AI to allow us to do all these routine tasks and do it in a way that is reliable, and then do it in a way that you can scale it, and then you can explain it, and you can monitor it,” he told the crowd. “But we’re going to get there, and we’re going to get there faster than we think.”

AI agents could even go so far as to remove the necessity for human input in their own creation. Last week, Meta said it was releasing a “Self-Taught Evaluator” AI model designed to autonomously assess its own performance and that of other AI systems, demonstrating the potential for models to learn from their own mistakes.


Source link
Exit mobile version