Our Research
Can LLMs reason and plan?
LLMs are great at coming up with approximate knowledge and ideas for potential plans. But to actually use those ideas, you need to pair the LLM with external programs that can rigorously check the plans for errors. The key is to use them as part of a bigger system.
Prompting tips for better reasoning and planning that are based on this research are included at the end of the article.
The AI world feels like it’s divided into two camps: those that think LLMs can reason and plan and those who do not agree. This dichotomy in views gives rise to over-optimism and over-pessimism about AI, neither of which are particularly helpful. So which is it?
It’s increasingly clear that LLMs aren't capable of genuine planning and reasoning. According to ASU researchers, they're essentially giant pseudo-System 1 knowledge sources, not System 2 thinkers. While it’s true that they are more than giant machine translators, it’s also true that they cannot reason autonomously.
One study put LLMs to the test on standard planning problems and found that even the best LLM, GPT-4, could only come up with fully correct plans about 12% of the time. It didn't matter which LLM was used or if was fine-tuned—the results were still pretty dismal. If the researchers made the problem descriptions a bit less obvious by changing some names, the LLMs did even worse. In this study, it looks like LLMs are just pulling up plans that kind of match the problem, not really thinking things through step-by-step like a real planner. They're easily fooled by surface-level stuff.
Earlier research has placed hope on LLMs boosting their accuracy by iteratively critiquing and refining their own solutions. The idea is that checking if a plan works should be easier than coming up with one in the first place. But more recent work pours cold water on this optimism. It turns out that LLMs are just as bad at verifying solutions as they are at generating them. Having the LLM critique its own work can actually make things worse. Even if it stumbles upon a correct solution, it can just pass right over it, not recognizing it is right.
So why do so many papers claim LLMs can plan, when the evidence says they can't? Planning needs two things: 1) domain knowledge about actions and effects, and 2) the ability to put that knowledge together into a plan that actually works, handling any tricky interactions. A lot of the "LLMs can plan!" papers are really just showing that LLMs can spit out general planning knowledge. But that's not the same as an executable plan.
Some papers sidestep this by looking at simple problems where interactions don't matter, or by having humans prompt the LLM to fix issues. Others rely on fine-tuning or common sense so the LLM can just regurgitate a solution it's seen before. But for real planning, that’s not enough. The plans might look okay at first, but they'll fall apart when you try to use them.
A lot of the claims mix up the LLM's skill at generating surface-level plans with the harder task of making sure those plans actually work when you try to use them. In contrast, the best approach lets LLMs do what they're good at (coming up with ideas) while using external models to do what they're good at (rigorously checking those ideas). This gives you the best of both worlds: the flexibility and expressiveness of LLMs without sacrificing the reliability of traditional symbolic planning methods. This kind of approach is already being used in some high-profile projects, like:
- AlphaGeometry: This is a system that uses an LLM to guess solutions to geometry problems, which are then checked by a separate symbolic math program. The LLM is specially trained on a dataset of geometry problems to improve its guesses.
- FunSearch: Similarly, this project uses a fine-tuned LLM to generate potential solutions to puzzles and games, with an external program evaluating those solutions and providing feedback to guide the LLM's next attempts.
So the truth lies in the middle. Which means that LLMs should be thought of as powerful cognitive orthotics that can aid planning and reasoning when used correctly.
What then is “correct use” in reasoning and planning tasks?
The research suggests that the answer is to treat them as powerful but imperfect creators of new ideas. Specifically, LLMs are great at coming up with approximate knowledge and ideas for potential plans. But to actually use those ideas, you need to pair the LLM with external programs that can rigorously check the plans for errors. This combination of LLM creativity and external verification is golden. The key is to use them as part of a bigger system.
Here's a list of do's for using LLMs for more complex uses, including reasoning and planning:
- Use LLMs as idea generators, not final decision makers. Let them suggest potential solutions, but always verify these solutions with external tools (or other people).
- Pair LLMs with rigorous external checkers. This could be symbolic AI systems, simulators, experts, or other tools that can thoroughly vet the LLM's suggestions.
- Embrace iteration. Use the LLM's suggestions as a starting point, but be prepared to refine them based on feedback from the external checkers. Repeat this process until you converge on a high-quality solution.
- Fine-tune the LLM whenever possible. Training the LLM on data that's specific to your problem domain can significantly improve the relevance and accuracy of its suggestions.
- Keep humans in the loop, but in the right roles. Use human experts to create accurate models of the problem space and to refine the initial problem specification. But let automated tools handle the rapid iteration and feedback cycle.
- Be specific in your prompts. The more context and detail you can provide about the problem you're trying to solve, the better the LLM's suggestions are likely to be.
- Experiment with different prompting strategies. Sometimes, approaching the problem from multiple angles or breaking it down into smaller sub-problems can help the LLM generate more useful suggestions.
- Don't expect miracles. While LLMs are incredibly powerful, they're not magic. Be realistic about what they can and can't do, and design your system accordingly.