Blog

NASA finds generative AI can’t be trusted


Although many C-suite and line-of-business (LOB) execs are doing everything they can to focus on generative AI (genAI) efficiency and flexibility — and not about how often the technology delivers wrong answers — IT decision-makers can’t afford to do the same thing.

This isn’t just about hallucinations, although the increasing rate at which these kinds of errors crop up is terrifying. This lack of reliability is primarily caused by elements from one of four buckets:

  • Hallucinations, where genAI tools simply make up answers;
  • Bad training data, whether that means data that’s insufficient, outdated, biased or of low-quality;
  • Ignored query instructions, which is often a manifestation of biases in the training data;
  • Disregarded guardrails, (For a multi-billion-dollar licensing fee, one would think the model would at least try to do what it is told to do.)

Try and envision how your management team would react to a human employee who pulled these kinds of stunts. Here’s the scenario: the boss in his or her office with the problematic employee and that employee’s supervisor.

Exec: “You have been doing excellent work lately. You are far faster than your colleagues and the number of tasks you have figured out how to master is truly amazing. But 20 times over the last month, we found claims in your report that you simply made up. That is just not acceptable. If you promise to never do that again, everything should be fine.”

Supervisor: “Actually, boss, this employee has certain quirks and he is definitely going to continue to make stuff up. So, yes, this will not go away. Heck, I can’t even promise that this worker won’t make up stuff far more often.”

Exec: “OK. We’ll overlook that. But my understanding is that he ignored your instructions repeatedly and did only what he wanted. Can we at least get him to stop doing that?”

Supervisor: “Nope. That’s just what he does. We knew that when we hired him.”

Exec: “Very well. But on three occasions this month, he was found in the restricted part of the building where workers need Top Secret clearance. Can you at least get him to abide by our rules?”

Supervisor: “Nope. And given that his licensing fee was $5.8 billion this year, we’ve invested too much to turn back.”

Exec: “Fair enough. Carry on.”

And yet, that is precisely what so many enterprises are doing today, which is why a March report from the US National Aeronautics and Space Administration (NASA) is so important.

The NASA report found that genAI could not be relied on for critical research.

The “point” of conducting the assessment was “to filter out systems that create unacceptable risk. Just as we would not release a system with the potential to kill into service without performing appropriate safety analysis and safety engineering activities, we should not adopt technology into the regulatory pipeline without acceptable reasons to believe that it is fit for use in the critical activities of safety engineering and certification,” the NASA report said. “There is reason to doubt LLMs as a technology for writing or reviewing assurance arguments. LLMs are machines that BS, not machines that think, and thinking is precisely the task that must be automated if the technology is to improve safety or lower cost.”

In a wonderful display of scientific logic, the report wondered — in a section that should become required reading for CIOs on down the IT food chain — what genAI models could be truly used for.

“It’s worth mentioning the obvious potential alternative to using empirical research to establish the fitness for use of a proposed LLM-based automation before use, namely putting it into practice and seeing what happens. That’s certainly been done before, especially in the early history of industries such as aviation,” NASA researchers wrote. 

“But it is worth asking two questions here: (1) How can this be justified when there are existing practices we are more familiar with? and (2) How would we know whether it was working out? The first question might turn largely on the specifics of a proposed application and the tolerability of the potential harm that failure of the argument-based processes themselves might lead to: if one can find circumstances where failure is an option, there is more opportunity to use something unproven.”

The report then points out the logical contradiction in this kind of experimentation: “But that leaves the second question and raises a wrinkle: ongoing monitoring of less-critical systems is often also less rigorous than for more critical systems. Thus, the very applications in which it is most possible to take chances are those that produce the least reliable feedback about how well novel processes might have worked.”

It also pointed out the flaw in assuming this kind of model would know when circumstances would make a decision a bad idea. “Indeed, it is in corner cases that we might expect the BS to be most likely erroneous or misleading. Because the LLM does not reason from principles, it has no capacity for looking at a case and recognizing features that might make the usual reasoning inapplicable. Training data comprised of ISO 26262-style automotive safety arguments wouldn’t prepare an LLM to recognize, as a human would, that a submersible Lotus is a very different kind of vehicle than a typical sedan or light utility vehicle, and thus that typical reasoning — e.g., about the appropriateness of industry-standard water intrusion protection ratings — might be inapplicable.”

These same logical questions should apply to every enterprise. If the mission-critical nature of sensitive work would preclude genAI use — and if the low monitoring involved in the typical low-risk work makes it an unfit environment for experimenting — where should it be used? 

Gartner analyst Lauren Kornutick agreed these can be difficult decisions, but CIOs must take the reins and act as the “voice of reason.”

Enterprise technology projects in general “can fail when the business is misaligned on expectations versus reality, so someone needs to be a voice of reason in the room. (The CIO) needs to be helping drive solutions and not just running to the next shiny thing. And those are some very challenging conversations to have,” Kornutick said. 

“These are things that need to go to the executive committee to decide the best path forward,” she said. “Are we going to assume this risk? What’s the trade-off? What does this risk look like against the potential ROI? They should be working with the other leaders to align on what their risk tolerance is as a leadership team and then bring that to the board of directors.”

Rowan Curran, senior analyst at Forrester, suggested a more tactical approach. He suggests IT decision-makers insist they be far more involved in the beginning, when each business unit discusses where and how they will use genAI technology.

“You need to be very particular about the new use case they are going for,” Curran said. “Push governance much further to the left, so when they are developing the use case in the first place, you are helping them determine the risk and setting data governance controls.”

Curran also suggested that teams should take genAI data as a starting point and nothing more. “Do not rely on it for the exact answer.”

Trust genAI too much, in other words, and you might be living April Fool’s Day every day of the year.


Source link

Related Articles

Back to top button
close