AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
LLMs are great at coming up with approximate knowledge and ideas for potential plans. But to actually use those ideas, you need to pair the LLM with external programs that can rigorously check the plans for errors. The key is to use them as part of a bigger system.
Prompting tips for better reasoning and planning that are based on this research are included at the end of the article.
The AI world feels like it’s divided into two camps: those that think LLMs can reason and plan and those who do not agree. This dichotomy in views gives rise to over-optimism and over-pessimism about AI, neither of which are particularly helpful. So which is it?
It’s increasingly clear that LLMs aren't capable of genuine planning and reasoning. According to ASU researchers, they're essentially giant pseudo-System 1 knowledge sources, not System 2 thinkers. While it’s true that they are more than giant machine translators, it’s also true that they cannot reason autonomously.
One study put LLMs to the test on standard planning problems and found that even the best LLM, GPT-4, could only come up with fully correct plans about 12% of the time. It didn't matter which LLM was used or if was fine-tuned—the results were still pretty dismal. If the researchers made the problem descriptions a bit less obvious by changing some names, the LLMs did even worse. In this study, it looks like LLMs are just pulling up plans that kind of match the problem, not really thinking things through step-by-step like a real planner. They're easily fooled by surface-level stuff.
Earlier research has placed hope on LLMs boosting their accuracy by iteratively critiquing and refining their own solutions. The idea is that checking if a plan works should be easier than coming up with one in the first place. But more recent work pours cold water on this optimism. It turns out that LLMs are just as bad at verifying solutions as they are at generating them. Having the LLM critique its own work can actually make things worse. Even if it stumbles upon a correct solution, it can just pass right over it, not recognizing it is right.
So why do so many papers claim LLMs can plan, when the evidence says they can't? Planning needs two things: 1) domain knowledge about actions and effects, and 2) the ability to put that knowledge together into a plan that actually works, handling any tricky interactions. A lot of the "LLMs can plan!" papers are really just showing that LLMs can spit out general planning knowledge. But that's not the same as an executable plan.
Some papers sidestep this by looking at simple problems where interactions don't matter, or by having humans prompt the LLM to fix issues. Others rely on fine-tuning or common sense so the LLM can just regurgitate a solution it's seen before. But for real planning, that’s not enough. The plans might look okay at first, but they'll fall apart when you try to use them.
A lot of the claims mix up the LLM's skill at generating surface-level plans with the harder task of making sure those plans actually work when you try to use them. In contrast, the best approach lets LLMs do what they're good at (coming up with ideas) while using external models to do what they're good at (rigorously checking those ideas). This gives you the best of both worlds: the flexibility and expressiveness of LLMs without sacrificing the reliability of traditional symbolic planning methods. This kind of approach is already being used in some high-profile projects, like:
So the truth lies in the middle. Which means that LLMs should be thought of as powerful cognitive orthotics that can aid planning and reasoning when used correctly.
What then is “correct use” in reasoning and planning tasks?
The research suggests that the answer is to treat them as powerful but imperfect creators of new ideas. Specifically, LLMs are great at coming up with approximate knowledge and ideas for potential plans. But to actually use those ideas, you need to pair the LLM with external programs that can rigorously check the plans for errors. This combination of LLM creativity and external verification is golden. The key is to use them as part of a bigger system.
Here's a list of do's for using LLMs for more complex uses, including reasoning and planning:
The Artificiality Weekend Briefing: About AI, Not Written by AI