I've been interesting in this domain as a benchmark as well. I implemented my own small custom dungeon, and have been testing it with various harnesses. And yes, currently LLM agents are awful:
What's interesting about Zork in particular is that yes, it's a deterministic domain with lots of walkthroughs on the internet. Either that info is getting drowned out, or the LLMs have a very difficult time translating declarative information into actionable steps, or something else. But even if they could solve the original design of Zork, it would be fairly straightforward to randomize elements of the map and puzzles to keep it the solution from being deterministic. We're not even there yet though.
This is so interesting! Thanks for sharing those links, and very cool that you built a custom (digital) dungeon to test these out. I agree, it's remarkable to me how *poorly* the agent performed despite all the info that's out there. My hunch is they will get better, but then, maybe compute costs will be prohibitive. Please keep me posted on what you learn!
The Zork analogy makes a useful test: can the agent maintain a recoverable state model after a wrong turn? In OR workflows, I would ask the same question more narrowly: when readiness evidence conflicts, can the system show the path it used and return control before the team acts on a stale map?
Thanks for the comment, and it looks like you are thinking about agentic AI in the much more high-stakes context of surgery -- I'll be curious to follow your efforts.
That is the boundary I am trying to keep explicit: surgery is a high-stakes context, but the first useful agentic layer is probably not decision authority. It is recoverable state: what the system saw, what it missed, and where control returns to the team before the room acts on a stale map.
"Riley's probe is about whether there is a usable world model. The evidence here is strongest exactly where memory and the web were weakest:
- It is **weak evidence** from the set-piece puzzles. Solving the rainbow or Hades mostly shows I have read about Zork, not that I reasoned about a world.
- It is **strong evidence** from the maze, the spatial navigation, and the error-correction. Hand-building the map of a maze *designed specifically to defeat spatial memory* — by marking rooms and inferring the graph from observations — is hard to do without an actual, manipulable internal model of the space. Likewise, the cross-attempt learning (changing strategy in response to deaths) is the precise behavior the article found the other agent lacked.
So the win clears Riley's 40-point bar by a wide margin, but the interesting result is narrower than "Claude solved Zork by thinking." The honest claim is: **given knowledge of the puzzles — some from training, a little from the web — Claude can plan, navigate, adapt, and, when its memory is wrong, fall back on genuine spatial reasoning to finish a long, adversarial game.** The maze is the part worth caring about, because it's the part nobody handed me."
Honestly, the main difficulty it had was using the browser to play the game. Once I grabbed the emulator and had Claude Code play via the command line, it tore it apart quickly. At any rate, not really a lobotomized hamster, but also not a true test of its capabilities.
Right, so according to this Agentic Claude pulled more information about how to solve Zork from existing data. That’s not for nothing, but I think it affirms my central point about its failure to world model.
So it is a big deal if I can get it to do it or has someone already solved this since this paper? I have a way to do it but usually when I show how on stuff like this the response is “no not like that” meaning there is a much narrower definition of success then I would have for solving the problem of how to make an agent play zork well
Well, it depends. If you specifically train or instruct a model on how to solve Zork specifically, no one will be impressed -- that can be done for any game, really. But if you've figured out a way to make an AI agent *generally* capable of winning novel games, that would be of great interest to many.
I never played Zork but spent oodles of time playing "rogue" (Exploring the Dungeons of Doom) on UNIX systems ... way too much fun! Would be fun testing an LLM against that ...
Thanks for the comment, and the revelation of your nerdiness (I kid, I kid). I am unfamiliar with "rogue" but I agree there's something fun and interesting by seeing how agentic AI does with these games. I was joking about trialling it with Oregon Trail but now I'm thinking I just might!
I've been interesting in this domain as a benchmark as well. I implemented my own small custom dungeon, and have been testing it with various harnesses. And yes, currently LLM agents are awful:
https://derekjames.substack.com/p/youre-standing-in-a-clearing-in-a
https://derekjames.substack.com/p/text-adventure-benchmarks-revisited
What's interesting about Zork in particular is that yes, it's a deterministic domain with lots of walkthroughs on the internet. Either that info is getting drowned out, or the LLMs have a very difficult time translating declarative information into actionable steps, or something else. But even if they could solve the original design of Zork, it would be fairly straightforward to randomize elements of the map and puzzles to keep it the solution from being deterministic. We're not even there yet though.
This is so interesting! Thanks for sharing those links, and very cool that you built a custom (digital) dungeon to test these out. I agree, it's remarkable to me how *poorly* the agent performed despite all the info that's out there. My hunch is they will get better, but then, maybe compute costs will be prohibitive. Please keep me posted on what you learn!
OK this is awesome and I love it.
I want to know how the best AI would do at nethack also! Who's got the money and spare time to throw at it?
The Zork analogy makes a useful test: can the agent maintain a recoverable state model after a wrong turn? In OR workflows, I would ask the same question more narrowly: when readiness evidence conflicts, can the system show the path it used and return control before the team acts on a stale map?
Thanks for the comment, and it looks like you are thinking about agentic AI in the much more high-stakes context of surgery -- I'll be curious to follow your efforts.
That is the boundary I am trying to keep explicit: surgery is a high-stakes context, but the first useful agentic layer is probably not decision authority. It is recoverable state: what the system saw, what it missed, and where control returns to the team before the room acts on a stale map.
Here you go: https://www.youtube.com/watch?v=LjqKlB7a_ms
Claude knows the game well from its training data, though it still had to figure some things out. Here's part of the write-up (https://drive.google.com/file/d/1qhj-5igHPIxtBrMobf7HUxcjm3LHxKNJ/view?usp=sharing) I had it make after it finished:
"Riley's probe is about whether there is a usable world model. The evidence here is strongest exactly where memory and the web were weakest:
- It is **weak evidence** from the set-piece puzzles. Solving the rainbow or Hades mostly shows I have read about Zork, not that I reasoned about a world.
- It is **strong evidence** from the maze, the spatial navigation, and the error-correction. Hand-building the map of a maze *designed specifically to defeat spatial memory* — by marking rooms and inferring the graph from observations — is hard to do without an actual, manipulable internal model of the space. Likewise, the cross-attempt learning (changing strategy in response to deaths) is the precise behavior the article found the other agent lacked.
So the win clears Riley's 40-point bar by a wide margin, but the interesting result is narrower than "Claude solved Zork by thinking." The honest claim is: **given knowledge of the puzzles — some from training, a little from the web — Claude can plan, navigate, adapt, and, when its memory is wrong, fall back on genuine spatial reasoning to finish a long, adversarial game.** The maze is the part worth caring about, because it's the part nobody handed me."
Honestly, the main difficulty it had was using the browser to play the game. Once I grabbed the emulator and had Claude Code play via the command line, it tore it apart quickly. At any rate, not really a lobotomized hamster, but also not a true test of its capabilities.
Right, so according to this Agentic Claude pulled more information about how to solve Zork from existing data. That’s not for nothing, but I think it affirms my central point about its failure to world model.
So it is a big deal if I can get it to do it or has someone already solved this since this paper? I have a way to do it but usually when I show how on stuff like this the response is “no not like that” meaning there is a much narrower definition of success then I would have for solving the problem of how to make an agent play zork well
Well, it depends. If you specifically train or instruct a model on how to solve Zork specifically, no one will be impressed -- that can be done for any game, really. But if you've figured out a way to make an AI agent *generally* capable of winning novel games, that would be of great interest to many.
Are there rules about how it solves for this generally or anything goes
I never played Zork but spent oodles of time playing "rogue" (Exploring the Dungeons of Doom) on UNIX systems ... way too much fun! Would be fun testing an LLM against that ...
Thanks for the comment, and the revelation of your nerdiness (I kid, I kid). I am unfamiliar with "rogue" but I agree there's something fun and interesting by seeing how agentic AI does with these games. I was joking about trialling it with Oregon Trail but now I'm thinking I just might!
Oh, no worries on the nerdiness - I was a huge UNIX nerd at the time. Iirc, I was up to level 27 or something in rogue. :-O
I have to confess I still am a UNIX/Linux nerd to some extent. :-D