Discussion about this post

User's avatar
Derek James's avatar

I've been interesting in this domain as a benchmark as well. I implemented my own small custom dungeon, and have been testing it with various harnesses. And yes, currently LLM agents are awful:

https://derekjames.substack.com/p/youre-standing-in-a-clearing-in-a

https://derekjames.substack.com/p/text-adventure-benchmarks-revisited

What's interesting about Zork in particular is that yes, it's a deterministic domain with lots of walkthroughs on the internet. Either that info is getting drowned out, or the LLMs have a very difficult time translating declarative information into actionable steps, or something else. But even if they could solve the original design of Zork, it would be fairly straightforward to randomize elements of the map and puzzles to keep it the solution from being deterministic. We're not even there yet though.

Kent's avatar

OK this is awesome and I love it.

I want to know how the best AI would do at nethack also! Who's got the money and spare time to throw at it?

12 more comments...

No posts

Ready for more?