The new version of ChatGPT might be a big deal

But perhaps not the way you're imagining

Sep 13, 2024

Life goal: Every time I write about reasoning and LLMs I will include a reference to 90s hip-hop

This is not what I planned to publish today. What I intended to send was a wide-ranging interview of MIT neuroscientist Ev Fedorenko that explores her thoughts on human cognition and artificial intelligence – you’ll get that next week.

But right now, I admit…I’m just a tiny bit shook. I just spent the last hour experimenting with OpenAI’s newest ChatGPT model, the inexplicably named “o1-preview,” and candidly, I’m feeling the same sensations as when I first used GPT3.5 – blown away, excited, confused, and frightened.

Before explaining why, in case you missed it, OpenAI released two “o1” models yesterday while in the throes of a massive fundraising round (and its CEO being interviewed by Oprah). But unlike previous new versions of GPT, the o1s are neither faster or more cost-effective, and thus OpenAI suggests sticking with 4o for a wide range of use cases.

What makes these new models compelling instead, OpenAI claims, is that they’re much better at “reasoning.” What’s more, they articulate the process they use to generate their responses – they explain themselves (somewhat). Simon Willison, an entrepreneur and insightful commentator on AI, describes this as “a specialized extension of the chain of thought prompting pattern—the ‘think step by step’ trick that we’ve been exploring as a community for a couple of years now.” I say somewhat because OpenAI is hiding some of the tokens that these models produce – read Willison’s blog post for more on that (and why it irks him).

This morning, I reactivated my paid account with OpenAI so I could play around with o1-preview and test it using one of my favorite reasoning tasks, the New York Times’ Connections game. I’ve previously explained why I enjoy using this puzzle to challenge the alleged reasoning abilities of LLMs, and other researchers are with me. So I put GPTo1 to the test.

And then things got weird. Spooky, even.

Here’s (***spoiler alert***) the solution to today’s game:

This didn’t strike me as a particularly challenging puzzle, but it does contain some nice red herrings, such as “Beetle” and “Juice” (game creator Wyna Liu likes keep things current). Anyway, here’s GPTo1’s response, after taking 63 seconds to “think”:

Grouping words
I’m noticing the division of 16 words into 4 groups of 4, each based on distinct themes like baseball equipment and monopoly tokens.
Identifying word groups
I’m tasked with identifying four connected words at a time, noting the connections, and progressing to the next group upon correct identification.
Mapping word connections
Identifying four groups of words with a common theme. Considering palindromes, but also thinking about other potential links.
Identifying connections
I'm considering Beetle and Juice together as 'Beetlejuice', a movie title. This leads me to think about other potential groups or patterns among the words.
Tracing connections
I’m noticing 'Beetlejuice' in Orion and likening 'Beetle' and 'Juice' together. Also connecting 'Jack', 'Beans', 'Cow', and 'Giant' to 'Jack and the Beanstalk'.
Mapping energy connections
I’m exploring energy-related terms like 'Pep', 'Zip', 'Energy', and 'Life', and linking 'Juice', 'Volt', 'Energy', and 'Life' through electricity and power.
Identifying connections
I’m gathering car models like 'Volt', 'Focus', 'Civic', and 'Beetle'. This forms a distinct group of automotive terms, leaving me to reconsider their relevance in the current categorization.
Checking for combinations
I'm mapping out which words can be combined, like 'Beetlejuice' and examining if terms like 'Level' and 'Tenet' relate to video games.
Mapping out terms
I’m analyzing the words 'Juice', 'Zip', 'Pep', 'Life', 'Energy', 'Beans' to categorize them into groupings like energy drinks, computer games, slang terms, energy sources, and magazines. Progress is steady, and I’m considering all possible categories.
Identifying thematic connections
Piecing together connections, 'Jack', 'Giant', 'Beetle', and 'Life' align with movies. This rounds out the groups nicely.
Sorting group challenges
I'm pulling together which group 'Civic' fits best: palindromes or car models. While 'Civic' fits both, it seems more suitable for palindromes.
Distinguishing groups
I’m mapping out different groupings, considering car models, movies, and palindromes. Car models aren’t evident, and Michael Keaton's movies don’t include 'Life'. Progress continues on palindromes and Jack and the Beanstalk.
Piecing together connections
I'm thinking about how 'Beetle', 'Juice', 'Jack', and 'Giant' connect through astronomy or energy themes, though not all ideas align perfectly.
Evaluating categorized words
I’m thinking through different groups of words, including car models and palindromes. Adjusting my approach to ensure accuracy.
Reevaluating groupings
I’m thinking through if 'Energy', 'Pep', 'Zip', and 'Beans' can be linked to energy drinks or food. Considering 'Zip' and 'Life' to hint at potential computer formats.
Piecing together options
I’m combining words like 'Pep', 'Zip', 'Cow', and 'Beans' to form usable phrases or ideas. Considering adding 'Jack', 'Juice', 'Life', or 'Giant' to craft examples or reflect ideas from movies starring Robin Williams.
Piecing together word themes
I’m gathering words like 'Zip', 'Pep', 'Beans', and 'Juice', which all relate to energy in some form: 'Pep' and 'Zip' convey vivacity, 'Beans' suggest enthusiasm, and 'Juice' evokes revitalization.
Grouping thoughts
I'm categorizing words into groups: 'Level', 'Tenet', 'Civic', 'Refer' as palindromes, 'Jack', 'Cow', 'Beans', 'Giant' related to Jack and the Beanstalk, 'Energy', 'Life', 'Focus', 'Level' for video games, and 'Beetle', 'Juice', 'Pep', 'Zip' denoting energy.
Categorization
I’m mapping out words like 'Jumping', 'Lady', and 'Soy' to group them into categories such as 'Jumping' and 'Lady', and applying logical reasoning to identify connections.
Weighing options
First, I’m piecing together electric-related terms like 'Volt', 'Energy', 'Juice', and 'Zip'. Next, I consider movies, ensuring no overlaps. Finally, I'm thinking about car models, aiming for a unique fourth vehicle.
Organizing and categorizing
I'm mapping out the groups and spotting conflicts. It seems 'Juice' and 'Beetle' might form 'Beetlejuice'. Progressing through, I'll resolve conflicts to finalize groupings.
Refining groups
OK, let me see. Group 1 has palindromes, group 2 includes Jack and the Beanstalk elements, group 3 lists movies, and group 4 features words about energy.

Look, if you aren’t at least a little bit fascinated by this, I don’t know what to tell you. Because GPTo1’s articulation of its process here feels eerily similar to how I’d describe my own process of solving this puzzle. I too thought about Michael Keaton movies (and Tim Burton), bounced around to cars, and spotted the Jack in the Beanstalk references. I don’t want to say GPTo1 is holding up a cognitive mirror, at least not exactly, but I find something familiar here, something shared even, between my thinking and the process employed by this model.

And then this happened in another trial:

“It’s interesting to explore their potential pairings.” Why is it saying that? Did OpenAI fine tune the model so that it will add commentary that sounds human-like? Or is the underlying transformation process leading the model to…self-reflect…on its own activities?

Now, I don’t want to get too carried away. We can see embedded in that first response some weird “reasoning” swerves that no human would ever make – such as the inexplicable examination of three random words (Jumping, Lady, and Soy). And GPTo1 performance on the actual Connections game has been decidely mixed thus far. Here’s one trial where it failed to identify a single group of words correctly. Here’s another where it nailed it on the first go.

Whether it’s actually better at reasoning, however, is less interesting to me than the potential of using these digital mind simulators to help us understand our own thinking. In education in particular, imagine if we could use these tools to help anticipate the misconceptions students have regarding certain concepts when they are presented one way or another. We might be able to do a/b test trials at scale on how to improve teaching and learning.

Might. Might might might. I might come to regret this optimism. I already worry that this is just a seductive illusion. But there’s potentially something significant happening here. And what it is ain’t exactly clear.

Sep 13

I wish I'd thought to do this before publishing, but things continued to be weird when I asked GPTo1 to explain why the words Jumping, Soy, and Lady popped into its analysis.

https://chatgpt.com/share/66e448b4-4f6c-8006-a7b6-9ee5c6662386

Expand full comment

Joseph Thibault

I appreciate this summary a lot more than "how many R's in strawberry". The ability to check out the steps and see those as outputs is pretty fascinating...

2 replies by Benjamin Riley and others

4 more comments...

The new version of ChatGPT might be a big deal

But perhaps not the way you're imagining

Discussion about this post