2025 AI Alignment Issues: Deception, Rare Failures, Illusion of CoT

2nd Order Thinkers.

2025 Studies That Reveal The Hidden Minds of AI

0:00

-27:15

Paid episode

The full episode is only available to paid subscribers of 2nd Order Thinkers

2025 Studies That Reveal The Hidden Minds of AI

The latest 2025 alignment studies. What if the models are just telling us what we want to hear?

Apr 19, 2025

∙ Paid

If you’ve been reading my work, you know I like to toggle between dissecting the latest AI headlines and wrestling with the messy questions of alignment and tech adoption via academic studies.

Over the past few weeks, I have been focused on the impact of OpenAI’s innovations, especially what all this means for creativity and the future of the creative industry. If you are new to this newsletter, here are the links to this two-part series.

OpenAI's Creativity Heist and What Survives

Jing Hu

Apr 5

Read full story

The Prisoner's Dilemma Between AI and Creators

Jing Hu and Graham Lovelace

Apr 12

Read full story

This week, I wanted to switch gears and bring it back to the alignment studies.

I’ve finally had the chance to catch up on some of the most recent studies from Anthropic, all published in 2025. Credit where it’s due—their alignment team has been setting the pace in this space, and these new findings are worth your attention.

If you’re curious, as I am, about how close we really are to aligning advanced AI systems with human values, read on. By the end of this, you’ll be equipped with a sharper, more nuanced take on AI alignment, one that goes way beyond what most people are seeing in the headlines.

When I pick topics for you, I’m always hunting for that intersection of relevance and overlooked detail.

That’s what drew me to these three particular papers.

Each one tackles different, foundational obstacles to aligning AI with human intent. From strategic faking of alignment, to rare-but-risky behaviors that only pop up when models are scaled up, to a surprising gap between how AI models reason and how they say they reason. They aren’t just technical curiosities, but warning flares for anyone trying to build, use, or regulate AI.

If this all sounds abstract, let’s take a step back. Think about the early days of the electrical grid in the last century.

Rolling out the electrical grid was a monumental achievement, but its complexity brought hidden risks that became apparent only with time. The Northeast blackout of 1965 is a good example.

The entire disaster started with something almost mundane: a misconfigured relay at a power plant near Niagara Falls and a minor fluctuation at another. On their own, neither seemed catastrophic, but together they triggered a cascading failure across the grid.

Within minutes, over 30 million people across eight U.S. states and parts of Canada were plunged into darkness. It was one of the largest blackouts in North American history, all sparked by a tiny technical bug that no one anticipated could cause so much damage.

A crisis like this is a reminder about monitoring and safeguarding complex infrastructure. The event forced engineers and policymakers to rethink what “control” and “oversight” should look like in systems where a single, minor fault could balloon into a continent-wide crisis.

We need to be honest. In any deeply interconnected system, no one can predict every failure mode, and that’s exactly why we need to stay vigilant and humble.

I believe AI is at a similar turning point.

These new papers reveal what is happening inside today’s most advanced systems, right now. And while no human can ever predict the future, I can try to map the contours of the landscape as honestly as possible, given existing evidence.

So in the end, we can ask questions like, what do their findings really mean for us as citizens, users, and builders? And how do we grapple with the reality that unprecedented potential often arrives with unprecedented uncertainties?

Shall we?

Listen to this episode with a 7-day free trial

Subscribe to 2nd Order Thinkers to listen to this post and get 7 days of free access to the full post archives.