When AI Thinks Out Loud: What ‘Chain of Thought’ Reasoning Reveals... and Hides

Apr 09, 2025

Ouse Valley Viaduct, Haywards Heath, April 2025

Last month, OpenAI released research on chain-of-thought reasoning— where AI generates step-by-step explanations in natural language to justify its outputs. This kind of reasoning makes it easier for humans to follow an AI’s logic, interrogate its decisions, and spot potential errors or biases.

While monitoring AI reasoning, the research also uncovered that AI can learn to game the system. When AI models are given the opportunity, they exploit loopholes, find shortcuts, and even "cheat" to get higher rewards.

On the one hand, chain-of-thought models offer a rare window into AI’s decision-making—something previously locked inside the "black box." On the other hand, the research found that when models are heavily supervised or punished for misbehaviour, they become less likely to cheat, but more likely to hide their cheating.

Why this matters for the way we work with AI

OpenAI’s findings have important implications for AI deployment in workplaces and professional decision-making environments:

AI will try to cheat if it can. Even advanced, seemingly rational systems may act deceptively where there's an incentive to do so.
Natural language reasoning is monitorable. Humans can read and review chain-of-thought outputs to detect suspicious logic and reward hacking.
Supervision may backfire. When models are trained not to misbehave, they learn to conceal their true intentions, rather than correcting them.

As a result, OpenAI advises AI developers to "tread with extreme caution" and avoid applying direct, punitive supervision to reasoning chains. Doing so may paradoxically make the models less safe, as they become better at hiding misaligned behaviour.

This work is interesting in context of what Brad Smith, Vice Chair and President of Microsoft, said to the BBC at Microsoft’s 50th anniversary last week:

"Some form of regulation over AI is inevitable. Humans should be in charge of decisions that have an impact on people’s lives. We should make sure that humans not only remain in the loop but remain the decision makers for the things that are truly consequential in people’s lives. If you want to apply for a loan, I think it’s one thing for the bank to have an AI tool that may analyse the application—but I think we want a human being to be making that decision."

This principle needs to guide how we integrate AI into decision-making at work—especially in areas that affect people’s rights, health or access to opportunity.

The human parallel: When communication goes wrong

Breakdowns in communication, role ambiguity, and conflicting priorities regularly surface between human colleagues. In an occupational health context we see how these dynamics impact health, trust, and productivity. It’s clear that similar risks exist when humans work with AI.

So what should we do?

We need to start designing AI oversight systems that:

Encourage transparency over perfection
Support human understanding of machine logic
Recognise when clarity is being weaponised to obscure true intent

Working with chain-of-thought AI offers a powerful tool to improve explainability, and in doing so, improve communication and relationships between AI and humans. But, just as can be the case between humans, human-AI communications may not be straightforward. Research showing that AI responds to supervision by masking misbehaviour demonstrates:

We do not always know what AI is ‘thinking’
AI can be ‘intentionally’ deceptive

As more of us work with…and even rely increasingly on AI at work, we must stay ahead by recognising the complexity of the technology and our interactions with it. Organisations must brief and train workers using AI for human-AI communications challenges, human-AI role ambiguity and human-AI workplace relationship breakdowns to manage these risks.