A battle of wits between humans and chatbots

If you play Connections you'll dig this

Jul 16, 2024

Promotional plug: Greg Toppo of The 74 published an interview of me today that outlines some of my concerns around the use of AI in education. He asked thoughtful questions, and I’m grateful for the opportunity to share my views.

I didn’t get this one either but Mariah Carey sure did. IYKYK.

Canine. Freight. Often. Ozone. Can you spot the connection between these words? Can a large-language model?

Three weeks ago, I found myself unable to imagine what relationship could possibly connect these four seemingly disparate ideas. Yes, reader, I was playing the game Connections from the NY Times, which is part of my daily morning wake-up routine (right after Wordle). The game is simple: You’re given 16 words and have to identify what connects groups of four of them. Some groups are easy to spot (THING, PAIR, ITEM, COUPLE – synonyms for romantic twosomes) and some are more challenging (ANT, DRILL, ISLAND, OPAL – words that follow “fire”). Give it a go, it’s fun in a very nerdy sort of way.

But for the life of me, I could not spot the common thread to canine, freight, often and ozone. Have you figured it out? Hint: look for numbers.

[Pause]

Hopefully that helped, and if not, here’s the reveal: CaNINE. FrEIGHT. OfTEN. OzONE. Each word has a number “hidden” inside it. Clever, right? Again, in a very nerdy sort of way.

And now I’m going to make this post even nerdier by saying that I got very excited after seeing this particular group of words because I knew, I just knew, that our LLM chatbot friends would struggle to make this connection. Why’s that, you wonder? Well, these models break words into tokens, and then look for statistical relationships between the inputted tokens and the data they’ve been trained on, and…look, you’ll have to spend time in one of my workshops to build your own mental model, ok? Point is, I was confident they’d have problems with this particular set of words and their cleverly obscured relationship.

But what about humans? After all, I myself couldn’t find the connection between these for words, here, and – with apologies for sounding braggadocious – I usually finish the game pretty quickly. So while I was confident that LLMs would do poorly, I was less sure how they’d stack up against human competition more generally.

So…I ran a little quasi-study, nothing that would pass muster in any reputable academic journal, but one that nonetheless generated some pretty interesting data (see caveats in the footnote).1 I started by polling my followers on Twitter with the same question that leads off this essay, including the hint, and received 143 responses:

Then, with that data in hand, I ran roughly the same number of trials (n = 136) with four of the leading LLMs: GPT4o, Claude Sonnet 3.5, Google Gemini (free version), and Meta AI. Why 136 and not 143, you ask? Because Claude kept glitching out on me and I got annoyed, that’s why. Also, if the chatbots if didn’t get the right answer on the first try, I gave them the same hint to “look for numbers.”

Ready for the results of the Connections showdown between humans and the machines? Let’s nerd out with our words out:

What do you notice, what do you wonder? Three things jump out to me. First, as I anticipated, humans crush LLMs overall and especially at solving this problem without any hints. Second, while the hint did help human performance, it didn’t have the impact I expected (what else could “look for numbers” mean?). Third…and definitely the biggest surprise…how is Meta AI the best LLM at this, with the hint?! Prior to this experiment, I’d spent all of about five minutes using it, and by “using it” I mean “getting annoyed that anytime I look up something on Facebook it was making me use this dumb AI.” But I never would have guessed it would so dramatically outperform the other LLMs on this task.

A few other observations from this ‘lil research effort:

Many of the LLMs would identify the number in the first three words, but struggle with “ozone.” Often, they would say the hidden word was “three” because, well, see if you can figure out why.
Likewise, many of the LLMs claimed that canine “sounded like K9” and thus would identify the right hidden number in that word (nine) but for the wrong reason. Feeling generous, I gave them full credit for this, but the chatbots would have performed even worse here with harsher grading. Also, the LLMs were prone to identifying “zero” as the hidden word within often or ozone, because both words start with O. Wrong…but interesting.
Gemini, uniquely among the LLMs, consistently suspected I was playing a version of Connections with it. Given that Connections is only about a year old, this was surprising and frankly impressive.
GPT4o and Claude were most prone to go on for pages and pages of text trying to find some rationale for a connection without success, the other two models were far more succinct. GPT4o was particularly inclined to bullshit an answer. My favorite was its claim that all four words “could be typed on a QWERTY keyboard,” which is not only true for all words in English, but also any other keyboard that has English letters.

Now let’s zoom out. Connections is just a game, but researchers are starting to realize that it’s a particularly interesting one to play with LLMs. At their core, large-language models make statistical connections between words in our language, so this puzzle meets them on their home turf. But, as my quasi-research indicates, they are often less capable than humans in solving problems that require some degree of “cleverness” (not a technical term).

Now, might this change with more data, more compute power, more fine tuning? It’s possible. Still, all this feels related to my recent essays on how language and thought are distinct capabilities, and how LLMs are communicating without thought. LLMs are good at making formal connections between words, but do they have a functional understanding of what the words mean? Will they ever?

We’ll see. But for now, I like outwitting them.

Results are self reported; my Twitter followers are hardly a representative sample of the general population (by definition, if they’re following me they are smarter than average); and there were no attempts to control for even basic confounds. Sue me.

A battle of wits between humans and chatbots

If you play Connections you'll dig this

Discussion about this post