Podcast Transcript: Are reasoning models fundamentally flawed?

This automatically-generated transcript is taken from the IT Pro Podcast episode ‘Are Reasoning Models Fundamentally Flawed?‘ We apologize for any errors.
Rory Bathgate
AI reasoning models have emerged in the past year as a beacon of hope for LLMs, with AI developers such as OpenAI, Google, and Anthropic selling them as the go-to solution for solving the most complex business problems. The models are intended to show they’re working out, improving model transparency and the detail of their answers, and have been held up as the future of AI models, they’re sold at a higher price per use than standard LLMs, and most frontier models now come with at least an option for reasoning. But a new research paper by Apple has cast significant doubts on the efficacy of reasoning models, going as far as to suggest that when a problem is too complex, they simply give up. What’s going on here? And does it mean reasoning models are fundamentally flawed? You’re listening to the IT Pro podcast.
Hi, I’m Rory Bathgate, and today I’ll be speaking to Ross Kelly. ITPro’s own news and analysis editor to explain some of the report’s key findings and what it means for the future of AI development. Ross, welcome to the show. Good to be back. So to start with, Ross, I’m conscious that I wrote the article covering this, and you and I will go into the report in more detail. But could you give us just some of the context here?
Ross Kelly
Yeah, so Apple, obviously, I was going to say, shook the apple tree there, but caused a bit of a ruckus online over the last week with a new research paper. It’s a bit of a tongue twister, but, you know, I’ll give it a shot: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Essentially, what this looks at is, you know, fallacies within so-called AI reasoning models. You know, they’re the talk of the town that big providers like open AI, Deep Seek, and Anthropic over the last year. And these are models designed specifically for, you know, more advanced problem solving. And it does this by essentially breaking the process down into smaller batches. I suppose you could say and going through it more systematically, rather than, you know, just bludgeoning it like previous a large language models would essentially what Apple’s saying is that they have a limit. And, you know, it sounds a bit obvious, but at the same time, the way that this has been, I don’t know, framed by a lot of providers, makes it seem like, you know, they’re the be-all and end-all. But once you reach a level of complexity, yeah, you start to encounter some serious problems.
Rory
You and I have had several conversations before about the performance limits of AI models. And it is interesting, isn’t it, that, you know, two years ago, everyone was talking about foundation models. Llms were the golden children of the AI development industry. Pretty quickly that became multimodal models. And then late last year, we got OpenAI, kind of firing the starting gun on reasoning models, or so called Thinking models, as some developers like Google tend to use the terminology thinking rather than reasoning. And the idea behind these was that these are for more complex tasks. Maybe they show they’re working, and as a result, they’re more expensive. So it’s the kind of sophisticated model that you throw at a really complex problem, I guess. You know, we can talk about this. What’s interesting for me, right off the bat, is, what Apple is showing is that that’s actually the opposite of what these are good at, that they are maybe more performant, maybe they’re slightly better at solving complex problems up to a point, but that if you actually try and give them really complex problems, they are completely ineffective.
Ross
Yeah, and I think the big sort of talking point is essentially around, you know, usage here. So I mean, in your report, you’ve mentioned, you know, in particular, that beyond a certain level of complexity, they take longer to respond. First and foremost, they waste tokens, they return incorrect answers. And when you combine those sort of three aspects, the entire point or the response time completely goes against what you’re using these for. First and foremost, the token wastage is, you know, obviously going to be a concern for enterprises or even individual users, because you’re paying on a token by token basis, with some of these models as well. So that’s obviously a big concern, incorrect answers. I think it goes without saying. Again, if you’re being you don’t want this churning out garbage, essentially. So yeah, when you look at the actual process of how they tested this as well, I think that’s what I found really interesting. You know, when I flagged this to you in the morning on Monday, it was using traditional sort of methods like the Tower of Hanoi. So using this puzzle that really showed the limitations there in terms of the complexity, I think it’s worth adding that, you know. In an op ed by Gary Marcus. In the wake of this, he does mention, you know, the Tower of Hanoi at a certain point. You know, is very difficult for human beings as well, once you start getting into higher levels of complexity with it. So, I mean, for me, I don’t think this completely destroys open AI’s argument, but I do think what you mentioned in the headline around pouring cold water on it, yes, certainly is as food for thought, not only for you know, users, but for a lot of these companies. Because open AI, etc, they’ll be looking back at this and going, I want to de pic, you know, I want to unpack how they’ve got to this conclusion. And best case scenario is it actually results in improvements on their site.
Rory
Yeah, that’s true. So just to give some context for people, one of the key aims for this paper, which is written by a team of experts at Apple, including some researchers who are experts on LLM usage, machine learning, deep learning, reinforcement learning, so quite a wide range of skills between them. The team established very early on that the problem with testing or benchmarking a lot of leading AI models right now is most of the benchmarks that we use are coding problems and math problems. The problem with using those is they have very definite answers, and those answers may have been present in the training data that you’ve trained the model on. So it becomes very difficult to establish, is this model actually good at solving problems, or is it regurgitating answers it already knows to these problems. On top of that, a lot of the benchmarks that exist can’t easily be tweaked by researchers. There aren’t really any control variables, and scientists love control variables so that they can make sure that something or an effect can be reproduced in multiple different environments. So they devised a series of benchmarks that they could throw at these reasoning models, including open, AI’s o1 and o3 models, deep seek, r1 anthropic Claude, 3.7 sonnet thinking, which is a mouthful, and Google’s latest version of Gemini, which includes reasoning capabilities that Google calls thinking. And a lot of these, as you pointed out, Ross, were sort of puzzles that already exist, and sometimes you may have come across, so, for example, Tower of Hanoi. You see sometimes in schools, Tower of Hanoi is a puzzle where, if you can picture this, it’s like three pegs with a little pyramid of discs on the first peg. And the idea is that you move that pyramid from the first to the third peg by moving disks one by one and and you can never put a disk that is larger on top of one that’s smaller. So it’s a sort of sequential reasoning puzzle. And the more disks that are in the pyramid to begin with, the more difficult it is to solve. And what they found is, when they started out with, say, three disks, the models were reasonably good at solving that. Beyond that and ratcheting up to around six or seven disks, models started to drop off. So as soon as the problem became beyond a certain level of complex, models went from 80 to 90% accurate at solving them to 0% they would never solve it. And certainly by the time you reach 10 disks, they were just seeing complete drop off, or what they termed complete accuracy collapse. And this surprised the researchers for a number of reasons. First of all, the models didn’t just fail at solving the task. They didn’t actually even use all of the reasoning time or reasoning budget so the tokens that the model can actually use to process the problem, they just stopped working out beyond a certain point, which is not what you would expect a model that’s its whole kind of reason to exist is it will just keep churning away at a problem until it solved. They actually found the opposite. They found that these things are pretty good at solving things that are relatively complex, but when you give it a problem that’s too complex, it’ll give up. And I think it’s especially worrying for developers, because this happened across all these different models. It’s not an open AI problem or an anthropic problem or a Google problem, exactly. It’s something inherent to the architecture of these models. And I think a lot of developers will have had, if not, a panicky weekend, certainly, they would have come into the office on Monday thinking, you know, how do we even begin to address this?
Ross
Yeah, I mean, cards on the table here when we’re talking about, you know, once you get to the 10 disc situation on the Tower of Hanoi, I mean, I’m on the side of the LLM
Rory
Me too. I’m terrible at these puzzles.
Ross
Yeah. I would have stopped working as well. I would have left. That’s the monopoly boards getting flipped at that point. I think, you know, the reaction to. Online is what I found really the most interesting. Because if I mean, you know, it’s not the best gage of things, but when you go through LinkedIn, you will have everything from, you know, apples research was fundamentally flawed, to Apple is doing this deliberately because it’s falling behind in the AI race to, yeah, open AI, deep, seek, anthropic, they’ve all been lying to us. There’s no real consistency here. But I do think you know, you have to say llms, you know, they’re still gonna have their uses, you know, particularly for coding and other basic tasks, like, you know, writing, but that’s what they’ve been designed for, and, you know, framed as over the last year or so. So I don’t think it’s, I think it’s unfair to really criticize, you know, open AI, or anthropic in particular, when anthropic, entire message at the moment is these models are the best of the best for coding, yeah, and, you know, ultimately, for supporting a human programmer as well. Likewise, with open AI, you can get into the weeds talking about, you know, how different models compare to one another. And well, we won’t do that today, but I think it does pour a lot of cold water on the AGI talk that we get every now and again. I noticed Sam Altman’s been very active on social media in the wake of this, and yeah, there’s been a lot of other industry figures talking about AGI over the last week. And some are going, Yeah, well, this doesn’t mean anything for us. Others are saying, yeah, it completely it completely ruins the argument that we’re anywhere near this again, it’s always going to be somewhere in the middle ground. And I think, just to reiterate, for task like coding, there’s absolutely no doubt.
Rory
I think there’s a couple different angles here. Speaking personally, I’ve never liked the phrase thinking when it comes to llms. I mean Apple terms these LRMs, large reasoning models. I think generally they’re sold as kind of a large language model with a reasoning element on top. Yeah, I think you and I also spoke about this at the start of the week. There were some reactions online that I was pretty surprised by, in terms of the people who were posting, oh, Apple is claiming that OpenAI hasn’t made a model that can think, to which my reaction was, did you think OpenAI had made a model that can think? Because I think there’s a huge, you know, it’s a huge difference between an LLM that can show it’s working and and something that is conscious, right? And if people were under the misapprehension that San Francisco had quietly solved the problem of consciousness in the last six months, this hopefully has been a bit of a reality check for them. On the other hand, I do think that maybe developers could be a little bit clearer with the phrasing they’re using. I don’t want to call out developers necessarily in particular, but thinking and reasoning are terms that come with this implication that a model maybe is mulling a problem over rather than kind of what it’s really doing, which is breaking a larger problem down into a series of smaller problems that it can solve in an order that leads to it, you know, reaching maybe a better a better result. So maybe there are some internal marketing discussions to happen there, because I guess that those are separate kind of arms of the business that are working there. Absolutely, I think again, personally speaking, I find the idea that Apple is somehow intentionally sabotaging these companies. Fairly ridiculous, because Apple itself is involved with open AI. It’s made sure it has quite an open approach to AI models and AI licensing. It hasn’t, you know, fully tied up with any one AI company. But look, Apple has apple intelligence. Apple has made, especially last year, quite public strides towards adopting generative AI. I think it’s it’s apples interest that these models work, because it’s clearly trying to make them as much a part of its whole hardware ecosystem as any other company. It just that it hasn’t fully invested in one solution yet. So I think to suggest that Apple is somehow trying to snipe companies from the sidelines, rather than what this is, which, which is a relatively timely and well researched report. It’s, I don’t know. It just, doesn’t it just doesn’t quite add up for me.
Ross
Yeah, I think just, you know, pull us back to the sort of the framing as well. I totally agree on the marketing aspect, yeah. And, you know, ultimately, what you also said around people believing this was a, you know, quote thinking model, they’ve taken the bait, hook, line and sinker in that regard. So I. So if you’re up in arms about this because your your reasoning model isn’t thinking, then, yeah, I think you need to take a step back and, like, reassess how you’re looking at these tools, because you’re going at it from a completely, you know, unrealistic angle, in terms of the companies themselves. Yeah, they definitely can market these differently. I don’t think it’s intentionally misleading to describe these as reasoning models, because, you know, this is fundamentally what they’re doing by breaking it down into sort of or compartmentalizing things into more manageable steps. But there’s always going to be a limitation when you go back to the early days of like the generative AI boom, the discussions were around, what is the limit? You know, we can have models with simply unlimited parameters. Yeah, that was always going to get to a point where some companies realized it was, you know, completely untenable reasoning. Models were a workaround, so to speak, for that to produce more accurate results. You know, in the wake of this, a lot of these developers will go back and they’ll reassess, they’ll re examine things, and, you know, like I said, it might might bear some positive fruit.
Rory
Well, unfortunately, that’s all we have time for today. But Ross, thanks so much for joining us. See you next time. As always, you can find links to all of the topics we’ve spoken about today in the show notes and even more, on our website@itpro.com, you can also follow us on LinkedIn as well as subscribe to our daily newsletter. Don’t forget to subscribe to the IT Pro podcast, wherever you find podcasts, and if you’re enjoying the show, let us know by leaving a rating or review. We’ll be back next week with more from the world of it, but until then, goodbye. You.
TOPICS
Source link