OpenAI wants developers using its new GPT-4.1 models – but how do they compare to Claude and Gemini on coding tasks?

OpenAI has unveiled a new family of AI models intended for developers, which it claims offers sizable improvements for coding and understanding complex prompts.
GPT-4.1 is a multi-modal model specifically designed to be more helpful in a professional context, with support for much longer context windows and better contextual processing – useful for handling large documents such as PDFs or code repositories, for example.
Developers can now input up to a million tokens per prompt and the model has been trained to focus on all the details within a prompt. This has been a common issue for users in recent months, as LLMs can struggle to pull out specific details across hundreds of thousands of tokens.
Alongside its flagship model, OpenAI announced two smaller, cheaper models for developers: GPT-4.1 mini and GPT-4.1 nano. The firm described GPT-4.1 mini as on par with GPT-4o across benchmarks such as the general accuracy test MMLU, but with far lower latency.
GPT-4.1 nano is being sold as the best choice for text-based, low latency use cases such as data classification. It is capable of returning the first token of its response in just five seconds for prompts with 128,000 tokens.
Both small models also have a one million token context window and the tech giant recommended them as a highly cost-efficient option for running autonomous AI agents.
How well does GPT-4.1 perform?
OpenAI has highlighted coding as a speciality for GPT-4.1, with the model having scored 54.6% against the benchmark SWE-bench Verified, which evaluates AI models for their ability to solve real-world coding problems based on GitHub data.
This is a 21.4% improvement over GPT-4o, previously considered OpenAI’s best coding model.
Against the AIder Polyglot benchmark, which tests how well LLMs can edit code per instructions across languages such Java, Rust, and Python, GPT-4.1 scored 52.4%. This is more than double the 23.1% scored by GPT-4o at launch.
But while GPT-4.1 is setting coding records compared to previous models, it doesn’t match the scores achieved by competing options. Google’s Gemini 2.5 Pro recorded a score of 63.8%, while Anthropic’s Claude 3.7 Sonnet achieved 70.3%.
How to get your hands on the new models
GPT-4.1 will be available via OpenAI’s API for a price of $2 per million input tokens and $8 per million output tokens. For cached inputs – meaning new prompts referring to pre-computed input tokens such as a new question about a previously-uploaded PDF – the model costs $0.5 per million tokens.
Unlike GPT-4o, GPT-4.1 will not be made available in the ChatGPT app.
As for the lighter, lower latency models, GPT-4.1 mini and GPT-4.1 nano are far cheaper to inference at just $0.4 and $0.1 per million tokens respectively.
Customers who have already used GPT-4.1 include the global investment firm Carlyle, which deployed the model to pull out financial information from Excel files, PDFs, and other common business documents.
It saw 50% improvement on the task than with previous AI attempts and a reduction in common errors in which AI models prioritize the beginning and end of prompts and skip over details contained in the middle, as well as a better ability to connect context across different documents.
OpenAI kills off 4.5, complicates offerings
Alongside the announcement of GPT-4.1, OpenAI stated it would be deactivating its experimental model GPT-4.5 Preview within the API on July 14th.
The company said this is due to GPT-4.1 offering a similar experience with far lower cost and latency compared to GPT-4.5’s $75 per million input tokens and $150 per million output tokens.
It’s unclear how the announcement of GPT-4.1 aligns with a recent post on X by CEO Sam Altman, who revealed that OpenAI would release o3 and o4-mini in “a couple of weeks”.
OpenAI described o3-mini as a “cost-efficient reasoning model that’s optimized for coding, math, and science”, and in benchmarks has shown it to outperform the GPT-4.1 mini and nano models.
More complicated is the claim that GPT-4.1 is particularly strong at coding. Though its 54.6% performance at SWE-bench Verified outmatches the 49.3% scored by o3-mini, the smaller model holds a 15-point lead over GPT-4.1 at Aider Polyglot.
If OpenAI launches o4 as promised, the full list of available models across its API and app will include GPT-4.1, GPT-4.5, GPT-4o, o4, and o3-mini. On the OpenAI subreddit, some Redditors complained about the name scheme, arguing that it was becoming increasingly hard to keep track of which models are best for which tasks.
Altman himself referenced the issue in another X post on 14 April, acknowledging that the company was deserving of the jokes people had been making at the expense of its name schemes.
“[H]ow about we fix our model naming by this summer and everyone gets a few more months to make fun of us (which we very much deserve) until then?” he wrote.
MORE FROM ITPRO
Source link