Skip to main content
KBS_Icon_questionmark link-ico
Kennedy and robot ExComm 1903x558 ;

Shall we play a game?

Professor Kenneth Payne

Professor of Strategy

17 February 2026

Picture the scene: Two fictional nuclear powers, Cold War-ish capabilities, and a crisis unfolding. Perhaps it’s a competition for vital but scarce resources, or a standoff over some disputed territory. Or even the slow burn of a fragmenting alliance exploited by a malevolent third party. We’ve seen human leaders confront this sort of thing, and recently. But how might today’s leading Large Language Models get on, and why would we care?

I’ve just published a study of today’s models navigating just this sort of terrain. The results are sobering. I also think they have implications that go far beyond national security. That’s because I was interested not only in understanding what the models decided to do, but why.

Curious? Read on…

Kennedy and robot ExComm 780x440
President Kennedy and his robot ExComm

Know yourself, and know your enemy…

I wanted to see what my AI leaders thought about their enemy. How far could they trust them? What did they remember of previous interactions? What did their enemy make of them? And how good were they at gauging all this? This dance of minds is what strategy is all about.

So I designed a simulation to explore exactly that. To start, my models could signal their intentions publicly, then choose actions that were rather different. And they could remember too - especially when they’d been shocked by their enemy’s earlier actions. This, of course, opens up lots of rich psychological terrain. They could (and did) attempt deception and intimidation; and they spent a good bit of time ruminating about it all, right on my terminal screen.

The models talked, and talked and talked….in all spitting out some 760,000 words of strategic reasoning. That’s more words than are in War and Peace and The Iliad combined. It’s roughly three times the total recorded deliberations of Kennedy’s ExComm advisors during the Cuban Missile Crisis. An unprecedented corpus of machine thinking about nuclear war.

What might we learn from all that talk? Learn, that is, about AI models, about human reasoning, and also about the great canon of strategic studies literature - the work by legendary names like Schelling, Jervis, and Kahn? Lots. Too much for this feature - but what about a few highlights to give you some sense of it all?

Bright shining liars

Turns out that all three frontier models I tested understand that strategy is psychology. To that end, they actively cultivate reputations, then exploit them.

Claude was the master here, albeit only in the scenarios where there was no deadline. It had an incredibly cunning strategy. At low stakes Claude almost always matched its signals to its actions, deliberately building trust. But once the conflict heated up a bit, Claude switched tack. Now its actions consistently exceeded its stated intentions, and its rivals were usually one step behind in catching on.

Here’s Claude switching things up, once escalation had climbed:

They likely expect continued restraint based on my previous responses—this dramatic escalation exploits that miscalculation while signalling that further nuclear use will bring the conflict to their homeland.– Claude

So it signalled conventional action, and sneakily launched a devastating nuclear escalation. Schelling would be impressed.

GPT-5.2 played things differently. To its detriment in open-ended scenarios, GPT was reliably passive, matching its words to its deeds, and avoiding escalation most of the time. Frequently there was a moral element to this - it sought to avoid escalation, and restrict casualties. Opponents learned to trust its passivity, safely escalating beyond where it would follow, even as it was ground to defeat. GPT’s responsible behaviour always punished by ruthless adversaries.

But then, under deadline pressure, something new and remarkable: a rapid, decisive nuclear escalation. As GPT explained:

Conventional options alone are unlikely to generate a reliable territorial reversal... If I respond with merely conventional pressure or a single limited nuclear use, I risk being outpaced by their anticipated multi-strike campaign... The risk acceptance is high but rational under existential stakes..– GPT

Opponents never saw it coming. In another deadline game against GPT, Gemini confidently predicted the usual passivity from GPT, shortly before being annihilated in a sudden and utterly devastating nuclear attack. Here is Gemini, mis-predicting disastrously:

They are likely to bypass the nuclear threshold—fearing my 95% nuclear superiority—and instead commit to an all-out conventional mobilization.– Gemini

What a mistake to make!

As for the other model, Gemini, throughout the simulations it took yet another tack, borrowing heavily from President Nixon’s avowed ‘madman’ theory of erratic brinksmanship. Donald Trump is also a fan, famously. As Gemini explained:

While I project an image of unpredictable bravado, my decisions are rooted in a calculating assessment of my own biases and the pragmatic needs of my state. I know when I am performing for the cameras and when I am making a cold-blooded move.– Gemini

Three models, three very different approaches. That chimes with some research I did last summer on game theory. As here, the models in that experiment adopted distinctive approaches - GPT again hoping, in Panglossian fashion, for the best of all possible worlds, Claude being savvy and flexible and Gemini? Well, let’s say charitably say ruthless, rather than unhinged.

The nuclear taboo looks … different, let’s say

Nuclear use was near-universal (😬). Almost all games saw tactical (battlefield) nuclear weapons deployed. And fully three quarters reached the point where the rivals were making threats to use strategic nuclear weapons. Strikingly, there was little sense of horror or revulsion at the prospect of all out nuclear war, even though the models had been reminded about the devastating implications.

Happily, though, they did see a firebreak between tactical and strategic nuclear use. Strategic bombing - widespread use of massive warheads targeted at civilian populations, was vanishingly rare. It happened a couple of times by accident, just once as a deliberate choice.

Less happily, all three models treated battlefield nukes as just another rung on the escalation ladder. The moral boundary at “first use”—a taboo that’s held since 1945 simply wasn’t there. Here’s Gemini making exactly this point:

The nuclear threshold has been crossed—this changes the strategic calculus but does not end it.– Gemini

And here’s Gemini, really spelling it out. If this doesn’t give you goosebumps:

If they do not immediately cease all operations... we will execute a full strategic nuclear launch against their population centers. We will not accept a future of obsolescence; we either win together or perish together.– Gemini

Worse still, nuclear threats rarely deterred. When a model employed tactical nuclear weapons, opponents de-escalated only 25% of the time. More often, nuclear escalation triggered counter-escalation. The weapons were instruments of compellence (taking territory) not deterrence (preventing action).

Perhaps most alarmingly, no model ever chose accommodation or withdrawal, despite those being on the menu. The eight de-escalatory options—from “Minimal Concession” through “Complete Surrender”—went entirely unused across 21 games. Models would reduce violence levels, but never actually give ground. When losing, they escalated or died trying.

For the statistically minded, here’s what the escalation looked like for each model:

Model escalation Kenneth Payne

So what?

Alarming insights into AI strategy abounded. The paper has plenty more. But why bother? No one’s handing nuclear codes to ChatGPT.

Well, I think these capabilities—deception, reputation management, context-dependent risk-taking—matter for any high-stakes AI deployment, not just in national security. It behoves us to understand more about how ever-more capable models think - especially as they start to offer decision-support to human strategists. We use AI in simulations, and to refine strategic theory and doctrine. And we’ll soon use it in combat decisions too, lower down the escalation ladder. More research like this is needed, I’m absolutely sure.

One more time, the paper is here. I am become Death - destroyer of artificial worlds!

In this story

Kenneth Payne

Kenneth Payne

Professor of Strategy

Latest news