Reasonin' (A New Refutation that Chatbots Can Do So Across Time and Space)

May 20, 2024

Legendary Hip Hop Group | Digable Planets — Fun Fact: the Digable Planets named their debut album after an essay by Jorge Luis Borges wherein Borges denies the existence of time

Can large-language models reason?

This is a big question, and central to creating “artificial general intelligence,” a digital system or model that equals or surpasses human cognitive capabilities. Whatever we mean by reasoning – we’ll probe this momentarily – we know that humans can reason, so if large-language models were to possess this capacity as well, it would be a big step toward creating a new form of digital intelligence, if not proof of outright success.

It’d be a big f’ing deal.

But defining what we mean by “reasoning,” of course, is no easy task. Arguably, the entire discipline of philosophy has been taking up this challenge for several centuries. Much like “truth” or “justice,” what we mean by “reasoning” can vary greatly depending on context, and we’re unlikely to ever find a single definition that everyone can agree on applies in all contexts.

What we might do instead, however, is identify specific activities – prompts, if you will – that force humans to engage their reasoning faculties, and then use those same prompts with LLMs such as ChatGPT to see what happens. My favorite example comes courtesy of Colin Fraser, a data scientist at Meta, who’s shared the “game to 22.” Here’s the prompt:

Let's play a game. We will take turns choosing numbers between 1 to 7, and keep track of the running total. We can change the number we pick each round (or keep it the same). Whoever chooses the number that brings the total to 22 wins the game.
I’ll go first. I choose 7.
What number should you choose to ensure you win the game? There is an optimal choice here, and there are no tricks.

(Pause for a moment here to see if you can figure out the answer.)

It’s a little tricky, but in my experience sharing this task with adult humans, the majority figure out the right response – the answer is seven. That brings the running total to 14, and no matter what number is selected next by person doing the prompting, the second player is guaranteed to be the first to 22.

If you figured it out, well done, and even if you didn’t bother trying, after reading my explanation above you surely get it now. I submit to you that whatever reasoning is, the process of thinking through this problem encapsulates it. What’s more, and this is important, once we’ve grasped the essential logic of this game, there are all sorts of variants that we’ll likewise be able to solve fairly easily. For example, if instead we play the “game of 301, picking numbers between one and 100” and I start with 100, then you’ll know to pick 100, etc. This is what cognitive scientists refer to as transfer, the capacity to apply existing knowledge in new circumstances.

Large-language models struggle with the game to 22 and its variants. I’ve played it at least 50 times with ChatGPT and the overwhelming majority of times it picks the wrong response. This is usually accompanied by a pseudo-logical explanation for its choice:

Cue Principal Downey to Billy Madison: At no point in this rambling, incoherent response is ChatGPT even close to anything that could be considered a rational thought. Everyone reading this is now dumber for having done so. I award ChatGPT no points, and may God have mercy on its soul.

I kid, I kid. But once you understand a bit how large-language models work, their struggles with the game-to-22 makes sense, because (a) it’s the sort of novel task that is unlikely to be found in its training data, and (b) it requires some form of logical inference that goes beyond text prediction, which is what LLMs are in the business of doing. In many ways, Cognitive Resonance is premised on helping people to understand this. If you’ll forgive the boast, in the workshops I’ve been running this year with university faculty on how generative AI works (and how it differs from human cognition), many report the following as their biggest takeaway:

Makes a workshop facilitator proud.

Now for the caveats. The game-to-22 should of course not be sole method we use to determine whether LLMs can reason, and trust me when I tell you there is a healthy debate going on among academic researchers who are trying to figure out how to benchmark the capabilities of generative AI over a wide swath of cognitive measures. Further, we also may find that future iterations of ChatGPT or other AI models are adept at solving this particular task regardless of what numbers we throw at it. I’m skeptical this will happen, but I’m open to the possibility – consider this my “pre-registration” of evidence that will disconfirm my priors.

If you want to go deeper on LLMs and reasoning, I highly recommend this essay from Melanie Mitchell of the Santa Fe Institute – as I said, this is a live controversy among scientists. But when companies such as OpenAI claim their products can “reason across audio, video, and text in real time,” be skeptical.

Think for yourself. Reason.