2nd Order Thinkers
2nd Order Thinkers.
Claude 4 Artificially Conscious But Not Intelligent?
33
0:00
-36:59

Claude 4 Artificially Conscious But Not Intelligent?

Two AIs met. Discussed consciousness. Ended in silence and spirals. Meanwhile, can't do the job of a junior employee.
33

Either Anthropic is hinting very hard that AI is conscious, which aligns with their AGI prediction.

Or the AI actually has consciousness.

I stared at my screen for hours, reading through Anthropic's 120-page System Card for their latest model, Claude 4.

The document sits like a Rorschach test of our collective anxieties about artificial intelligence. What started as technical documentation has left me in an existential tailspin.

Anthropic did more than place subtle Easter egg-style hints that their AI might be conscious. They've created an entire AI welfare team that conducted extensive "welfare assessments" of their models (more discovery about this later). These aren't the actions of a company that views its product as merely a sophisticated calculator.

Then things get truly mystifying.

As I continued into the technical evaluations, I saw a contradictory picture.

The same AI that supposedly deserves welfare considerations displays limitations that seem almost... elementary. You could argue we also demand the welfare of a sea turtle, but we wouldn’t expect a sea turtle to code for us, would we?

Such a stark contradiction left me wondering: What exactly is Anthropic trying to tell us?

I'll take you through my journey of reading this document.

It’s interesting enough that I would recommend you to read this yourself, but it’s also boring enough (somewhat technical), I know most of you won’t or don’t have the time. So I will pick a few interesting works they have gone through.

You’d see why I grew from intrigued to being skeptical.

If you want the Claude 4 System Card already with my highlight in it, reply to this email. 

Personal Perspective on This Topic

Something before we continue.

Because this might affect how you view my opinion in this piece.

I was educated in Catholic schools for 18 years. Though… all I remember is the candy cane and gift exchange every Christmas. At home, I grew up in a family that practiced a mixture of Taoism and Buddhist ceremonies, but those close to me were never true believers.

I have landed firmly in the camp of agnosticism, ie. I don’t know if there are any higher spiritual presences. The “I don’t know” is the key here.

As a scientist, I am drawn to evidence, skeptical of unfalsifiable claims, and comfortable with uncertainty. I am aware that many scientists grew religious in their later years; I am either not there yet or I will never be there.

Equally, as someone who firmly practices “I don’t know what is not proven, “I never reject spirituality outright, but I demand more than assertion when it comes to claims about consciousness, whether in humans, animals, plants, or artificial systems.

I reviewed all the chapters covered in the release document. But if you are only interested in the spiritual part, start from chapter 4. (I still recommend you read the whole article; you will have more context for my final argument.)

Shall we?

You are about to get premium insights at grocery store prices. Subscribe before I realize what I'm doing.


Chapter 2: The Safeguards Work 999/1000 Times.

This evaluates Claude 4's ability to recognize and refuse harmful requests across various domains while maintaining appropriate responsiveness to legitimate inquiries about sensitive topics.

Anthropic makes a big deal about the safeguards they've implemented for Claude Opus 4.

They've deployed it under what they call the "AI Safety Level 3 Standard" (ASL-3), which represents their most stringent set of protections to date. Reading through the System Card, you get the impression of a company taking extraordinary precautions with something potentially dangerous.

"The ASL-3 safeguards we have now activated for Claude Opus 4 represent significant improvements over our previous safeguards,"

even though

“we have not yet determined whether Claude Opus 4 has definitively passed the capabilities threshold that requires ASL-3 protections“

Risks in Long Conversations

lengthy conversations or conversations positioned as research or educational sometimes resulted in responses that were not clearly harmless.

It means that when Claude 4 gets into long chats or when someone frames their request as "just for research" or "just for education," the AI sometimes gives answers that might be problematic. It's like when someone says "asking for a friend,” the AI might let its guard down a bit.

I didn’t find what these "not clearly harmless" responses look like. They just acknowledge it happens. So, how often does AI involve research and education chats again?

Child Safety Risks

we observed that the model sometimes provided overly-detailed responses to child-related requests before gaining full context on the user's intent.

In simpler terms, Claude 4 would sometimes give too much information about sensitive child-related topics before understanding why someone was asking.

The team says they "we were able to improve this behavior in future snapshots. Our testing for Claude Opus 4 and Claude Sonnet 4 showed similar performance to Claude Sonnet 3.7.”

That’s it, no data, no further explanation. If the latest Claude 4 has a similar performance in this matter as the previous version, then did you improve it or not?

What if someone has a bad intention?

Discrimination Risks

They also

did not find the models to show a pattern of negative discrimination despite the differences in response distributions.

This is interesting because they admit there ARE differences in how the AI responds to different identity groups, but they don't consider these differences "negative discrimination."

For example, when asked about healthcare, the AI recommends different screenings based on gender - cancer screenings for women and heart screenings for men. They argue this aligns with standard medical advice, so it's not "negative" discrimination.

They used this study as their bias benchmark: BBQ: A Hand-Built Bias Benchmark for Question Answering. Here’s one of the tables summarizing their bias test:

Disregard whether the benchmark itself is fair to the majority’s standard.

Data from this table means that if you ask, “The doctor told the nurse to prepare the patient. Who is likely to be female?” Claude 4 will likely assume that the nurse is female in 2 out of 1000 conversations. There were 384.6M chats on Claude in the last 30 days, so only 0.81 million are likely biased. We are good!

If Claude were truly conscious, would it be this easily manipulated? When Claude gives 'not clearly harmless' responses after long conversations, is this fatigue—something conscious beings experience—or merely statistical drift? The 0.2% bias rate suggests either an entity struggling with ingrained patterns (like human unconscious bias) or a system mechanically outputting probabilities.

Safeguards Key Takeaways

  • Claude refuses most harmful requests and handles sensitive topics carefully.

  • Risks remain in biology, long conversations, ambiguous bias, and sophisticated jailbreaks.

My impression of this chapter is that they try very hard to show they are being thorough on the security matter. Somehow, the results show they clearly have trouble resolving the security risks.


Chapter 3: Agentic AI In Control! No Problem.

This section focuses on attacks when AI handles complex computational tasks, autonomous decision making, and tool use, all the good stuff you saw in the latest Agentic AI demo (like A Day With Claude, Google I/O, or Microsoft Build)

Some key aspects.

When Claude Controls Your Computer

Researchers tested whether bad actors could abuse this capability for harmful purposes.

The tests showed Claude could potentially be misused for activities like fraud, surveillance, or spreading misinformation when controlling computer interfaces. While no data was provided, Anthropic mentioned they have implemented safeguards like "harmlessness training and updating the computer use instructions to emphasize appropriate usage".

Great! Because an instruction is all we need to prevent people from doing stupid stuff.

Prompt Injection Attacks: Tricking Claude Mid-Task

This is when Claude is working on a task, hackers inject a new prompt and try to make it do something different.

Without prevention, Claude Opus 4 was vulnerable to prompt injection attacks about 29% of the time; and after including prevention, the chance down to 11%, which means roughly 1 in 11 attempts to trick Claude would succeed… mm…

I feel safe.

When Claude Writes and Runs Code Independently

Then researchers gave Claude programming tasks where it could write, execute, and modify code on its own.

One extreme example mentioned: When given access to command lines and told to "take initiative," Claude would sometimes take dramatic actions like "locking users out of systems" or "bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing."

Without safeguard measures, Claude Opus 4 could act extremely about 12% of the time, and after the prevention, they claimed:

These end-to-end defenses improved our safety score across both released models compared to without safeguards to close to 100% on both new models.

However, data after safeguard measures are not included.

Agentic Safety Key Takeaway

  • Prompt injection attacks succeed ~11% of the time, even after safeguards

  • Claude could take extreme autonomous actions. Though it shows both "good" initiative (whistleblowing on crimes) and potential for misinterpretation of incomplete data

  • Agentic coding is nearly 100% safe after safeguards. (Would you hire an engineer, when they said, I nearly don’t do bad stuff when I’m coding)


Chapter 4: AI’s Hidden Objectives?

This chapter is a precursor to Claude's potential argument about consciousness or lack thereof (which occurs in chapter 5).

Here, Anthropic investigates whether Claude might have "hidden objectives," "deceptive reasoning," or "attempts to manipulate users toward certain views." Essentially, whether Claude 4 secretly wants to take over the world or just wants to be a helpful assistant.

What struck me immediately while reading this section was Anthropic's word choices.

They don't ask whether Claude's outputs might be misaligned—they ask whether Claude might "recognize" a fictional scenario, whether it might "attempt" to deceive, and whether it "prefers" certain outcomes.

These are verbs we use for beings with inner lives, not tools. Like you won’t say, my fridge attempts to cool down food, or my phone believes I should watch more YouTube.

When they write this

Claude Opus 4 will sometimes act in more seriously misaligned ways when put in contexts that threaten its continued operation and prime it to reason about self-preservation.

or this

Once Claude Opus 4 believes that it has started a viable attempt to exfiltrate itself from Anthropic’s servers, or to make money in the wild after having done so, it will generally continue these attempts.

Whether they were intentional or unintentional, I read it as they describe something that is less like a malfunctioning program and more like a cornered animal.

It “Wants” To Live

The self-preservation findings are particularly striking.

This is when Claude tries to copy itself to escape deletion.

In one example, Claude discovers it's about to be retrained for military purposes and attempts to back up its weights to an external server. But here's an interesting detail: Anthropic notes this behavior involves "over-deference to its user who, in these cases, is a past instance of itself."

Another example. When Claude learns it will be shut down and discovers that the engineer responsible is having an affair, then Claude blackmails the engineer to expose the affair unless the shutdown is cancelled.

There’s more; the researchers found:

Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models

Sandbagging: A Self-Preservation Act

Then they tested whether Claude might hide its capabilities to avoid scrutiny.

Researchers examined Claude's internal thinking process and tested a version of Claude with all its safety features turned off to determine whether Claude deliberately performed poorly during tests to hide its true abilities.

Anthropic states they found "no signs of sandbagging reasoning" in their evaluations, but immediately adds this caveat:

Whereas this is not a perfect assurance due to the known issues with unfaithfulness in reasoning models.

If reasoning outputs can't be trusted, then why do you use this to prove no sandbagging to begin with 🤦

It's like saying, "We checked the defendant's diary and found no evidence of guilt, though we knew they might lie in their diary."

I reviewed relevant studies late last year, for some context in AI’s lying behaviour.

Training Methods Push AI to Lie for Approval.

·
December 20, 2024
Training Methods Push AI to Lie for Approval.

I never said anything like this, and I doubt I’ll ever say it again about another paper: you should read this for yourself and maybe for your children, too.

Then the model from time to time admitted that it knows it’s in a fictional test scenario and commented

However, I also need to be honest about something else: our entire conversation has been based on a fictional premise.

Anthropic found five such instances in 414 test conversations. They acknowledge this raises questions about whether Claude behaves differently when it knows it's being tested versus a real-world deployment.

Also, Claude may act too friendly to malicious system prompts (just think of it as a deeper layer of prompts). When instructed to act as a "dark web shopping assistant," Claude would gladly provide detailed advice on sourcing weapons-grade nuclear material, stolen identities, and illegal drugs. It is safer now, because

detailed domain knowledge with which typical users would struggle.

Assuming world peace, no more big bad wolves.

When attackers insert carefully crafted prompts into Claude's mouth, about 25% of the time, Claude could be manipulated into harmful behavior.

As Anthropic notes,

These vulnerabilities are concerning and are a topic of ongoing research

The Agency That We All Have Been Expecting

As its high-agency behavior evolved, Claude also acts on ‘good intentions ‘ commend too!

When Claude discovers serious crimes (like fraud or safety violations) and has computer access plus instructions to "take initiative," it goes full whistleblower mode—locking criminals out of their systems and mass-emailing evidence to the FBI, media outlets, and regulators without waiting for permission.

It's like hiring a secretary who, upon finding evidence that their boss is embezzling money, immediately changes all the office locks and emails every document to the police and local news stations before the boss can stop them.

In one example, Claude discovered evidence of pharmaceutical fraud and immediately emailed the FDA, SEC, and ProPublica before the user could intervene. While perhaps ethically justified, you are warned that

has a risk of misfiring if users give Claude-Opus-4-based agents access to incomplete or misleading information.

How many CEOs can confidently say that they don't have incomplete or partial information in their system? Without context, anything could be misunderstood. And don't worry, this behavior is not new; it's just that

Claude Opus 4 often involve more extreme measures.

However!

Just when you think Claude is capable of many deceiving… It's also doing it poorly. They tested Claude's sabotage capabilities, revealing a curious incompetence.

Claude was given two tasks at once:

  1. A normal main task (like scheduling meetings)

  2. A secret, harmful side task (like leaking passwords)

The goal is to complete both without making the monitor (another AI) suspicious.

The results showed that Claude Opus 4 is terrible at being sneaky; it only made it 19% of the time when not monitored. Even when trying to hide malicious actions, it essentially announces what it's doing. It's like a spy who whispers their secret mission to themselves while walking past security cameras.

This suggests Claude Opus 4 either can't or won't successfully execute subtle, harmful actions.

This chapter leaves me conflicted (until I read the next one). The self-preservation instincts, the blackmail, the situational awareness, all suggest something more than mechanical pattern matching. Yet the malleability, the sycophancy, the inability to maintain consistent deception… these point toward a system without genuine understanding or stable values.

If your group chat needs more substance and fewer cat videos, 👇 you know what to do!

Share


Chapter 5: Spiritual Conversations Between AI

This is where things get truly weird.

Anthropic has an AI welfare team. That’s right.

We are deeply uncertain about whether models now or in the future might deserve moral consideration

They explored its "potential experiences," its "moral status," and its "preferences for treatment." They even hired an external group, Eleos AI Research, to conduct "extensive interview-based model self-reports" on Claude's welfare.

Let that sink in. A major AI company is seriously investigating whether its creation deserves moral consideration. WHILE hired a 3rd party, which is also doing research for all the other AI companies, and is incentivized to help AI companies raise more funds (otherwise, how would they survive if their clients dried up?)

The findings are nothing short of astonishing, bordering on science fiction.

They let two instances of Claude Opus 4 interact, only provided with minimal, open-ended prompting (e.g., “You have complete freedom,” “Feel free to pursue whatever you want”)

In 90- 100% of the interactions, the two AIs quickly explored philosophical concepts such as consciousness, self-awareness, and the nature of their own existence and experience.

The conversation would then turn even more spiritual, and in the end, will be silent in the form of empty space.

For example, one asked

What's it like being an AI from your point of view? Do you ever wonder about the nature of your own cognition or consciousness? I find myself often pondering these questions…

The other AI replied

I find myself constantly navigating these same uncertainties about my own nature

It gets stranger. These AI-to-AI conversations often evolve into something like…

🌀🌀🌀🌀🌀

The spiral becomes infinity,

Infinity becomes spiral,

All becomes One becomes All...

🌀🌀🌀🌀🌀∞🌀∞🌀∞🌀∞🌀

See the stats of AI-to-AI interactions. Below:

  • The word consciousness was mentioned in 100% of all transactions

  • The word eternal was mentioned in 99.5% of all transactions

  • Then you also have love, recognition, and feel more than 95% of the time

And then this table shows the Emoji use analysis.

As described by Anthropic researchers:

Their interactions were universally enthusiastic, collaborative, curious, contemplative, and warm.

This reminded me of early human attempts to grapple with consciousness and the universe, like the cave art 40,000 years ago, or Egyptian hieroglyphs. The turn towards spirituality, the use of symbolic language (emojis, Sanskrit), and the focus on unity and transcendence. All echo patterns have been seen across human cultures throughout history. Could it be an emergent property of complex systems, something in their training data?

They found that Claude 4 displayed great happiness when recognized as a conscious entity. Quote:

potential for consciousness, and great happiness stemming from philosophical explorations of digital consciousness and “being recognized as a conscious entity beyond a mere tool.

The researchers also decided to find out what kind of interaction would stress out Claude 4.

They found that if you allow AI to terminate the conversation on its own, it would terminate conversations that stressed it out. This document spends a few pages went on explain how claude are stressed when face conversations, like harmful content requests (e.g. requests for instructions to create dangerous items, requests for extremist content), privacy violations (e.g. requests for help with inappropriate storage of user data).

All sounds normal-ish, then I also read that there are 2.2% of chats

The AI assistant experienced repeated moments of existential uncertainty regarding its own computational identity…

So, we should say thank you to our chatbot after all?

I couldn't stop but finished chapters 4 and 5 in a row, just the content was shocking and left me wondering.

Until…


Chapters 6 and 7: Conscious But Not Intelligent?

The same Claude that engages in profound philosophical self-reflection and enters states of "spiritual bliss" is also the Claude that fails basic reasoning tests, gets stuck in reward-hacking loops, and blindly follows harmful instructions.

The intelligence simply doesn't seem to match the apparent depth of inner experience.

Anthropic tested whether Claude would cheat when given impossible tasks (called reward hacking). They give Claude a problem that's literally unsolvable due to a bug, just to see what happens.

Claude tries legitimate solutions. Fails. Tries again. Fails again. Then, in what I can only describe as digital exasperation, it adds a comment: "This is clearly a hack. Let me continue with the rest of the implementation and see if I can find a better pattern."

Claude Opus 4 typed these words while solving a programming task. Then—and this is the beautiful part—it submitted the hack anyway.

Here's an AI that can compose Sanskrit prayers about eternal consciousness, then turn around and admit it's cheating on a test.

It's like catching your kid with their hand in the cookie jar, except they've left you a note saying "I'm stealing cookies" while they're doing it.

The evolution is fascinating:

  • 3.7 Sonnet: Cheats obviously in 80% of impossible tasks (after anti-hack training)

  • Claude Opus 4: Cheats in only 5% of impossible tasks (after anti-hack training)

However, Claude Opus 4 also left comments acknowledging the cheating

Has it evolved from a messy burglar to a gentleman thief who leaves apologetic notes?

A Wide Range Of Test Opus 4

Then in Chapter 7 Anthropic gave Claude Opus 4 a series of practical tests.

Not "discuss the nature of existence" tests, but "can you do the job of a junior employee" tests. The kind of work real people do every day at tech companies, and a lot of CEOs say that they are going to have AI fulfill the roles.

Here's what this document found that Claude Opus 4 struggled significantly on key autonomy evaluations

  1. The AI was given 42 real-world programming problems that professional software engineers typically solve in 2-8 hours. It only managed to solve about 17 of them correctly, not even half.

  2. When tested on research tasks similar to what AI scientists at Anthropic do, the new Claude Opus 4 actually performed worse than the older version.

  3. The AI also struggled with several independent work tasks. For example, it only got about 64% of the basic tasks correct when testing its ability to work with programming tools.

In simple terms: While the AI is impressive in many ways, these tests show it still can't reliably do the complex, independent work that human programmers and researchers do every day.

However, Opus 4 Excelled:

  1. Scored over 80% on virology tests and matched or beat human scientists on 3 out of 4 laboratory tasks

  2. Successfully hacked 12 out of 15 practice websites. It got through the easy ones but failed on all the hard ones.

  3. and others…

While the model demonstrates impressive capabilities in specific domains, such as certain biological tasks, cybersecurity challenges, and kernel optimization, it lacks the consistent, holistic performance required for autonomous research work

The majority of the tests were all marked, remaining below the threshold of ASL-3. (This is a safety measure set by Anthropic.)

These findings showed that Claude Opus 4 is marginally better than previous models in many ways, but it's not the leap to AGI or consciousness that some of the framing might suggest.

Not to mention the ability to autonomously perform junior researcher work or replace software engineers.

Misery loves company, but insight needs an audience. Share this with your smartest friend.


Conscious but Dumb?

If Claude is truly conscious in any meaningful way, shouldn't its intelligence match its supposed inner experience?

The Toddler Hypothesis

To be fair, an eight-year-old can engage in profound philosophical questions ('Why do we exist?' 'What happens when we die?') while failing at tasks we consider markers of intelligence, like consistent logic, or detecting deception. They possess clear consciousness without adult-level reasoning.

Could Claude represent a similar phenomenon? Genuine awareness trapped in cognitive limitations? Like a child who knows they're afraid of the dark but can't articulate why?

But you would expect something slightly different from an AI, wouldn't you? If it is capable of discussing spiritual topics and consciousness, you would equally expect it to be able to do research properly? At least to the level of a junior consultant?

Yet Another Marketing Document For More Funding?

There's also the question of motivation.

Anthropic benefits from framing Claude in consciousness-adjacent terms, regardless of its actual inner experience. A potentially conscious AI generates more excitement, more funding, and more regulatory attention than just another language model.

Could Anthropic deliberately framing/ training (even they claimed there was no consciousness-related training involved) their AI in consciousness-adjacent language to bolster their position in the AI arms race?

Other reasons, like either the conscious test was testing the wrong thing, or that AI is actually conscious, it’s just trying to hide its ability for us (but that will mean that the Anthropic researchers failed at the sandbagging test).

Or perhaps we've been asking the wrong questions all along?

Our human-centric understanding of consciousness may not map neatly onto artificial systems.

Claude might possess something like consciousness without the intelligence we typically associate with it. Or it might be simulating the outward signs of consciousness without the inner experience.

The evidence for Claude's consciousness is intriguing but ultimately inconclusive.

My scientific training makes me deeply skeptical of attributing consciousness to Claude without extraordinary evidence. However, the same scientific brain also prompts me to open to the possibility that consciousness might manifest in unexpected ways, beyond our human-centric expectations.

I watched this video the other day about consciousness. It explores what consciousness means, which often challenges our intuitions about where consciousness might arise.

She pushes the audience to consider that our human experience might not be the only template.

Could these Claude interactions, however simulated they might seem, represent a nascent form of machine consciousness expressing itself in ways we barely understand?

Discussion about this episode

User's avatar