The Brittleness of Agentic Reasoning and Planning Using LLMs

The ability of Large Language Models (LLMs) to reason and make sequential decisions has been a topic of debate. The ReAct framework, introduced by Yao et al, claimed to enhance the reasoning and planning abilities of LLMs by interleaving reasoning traces with action execution. In the original paper, they explored the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner. The central claim was one of synergy between reasoning and action: how reasoning traces helped the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.

This is now a popular approach, with many researchers and practitioners adopting ReAct to improve LLM performance on tasks requiring reasoning and decision-making.

Recently, a group of researchers from Arizona State University decided to investigate the claims made by ReAct and examine the factors contributing to its perceived success. They were particularly interested in examining the impact of the structure of reasoning traces. A reasoning trace is a step-by-step explanation of the thought process or logic used to solve a problem or complete a task. It outlines the sequence of mental steps taken to arrive at a solution, providing insight into the reasoning behind each action or decision. In the context of the ReAct framework, a reasoning trace is interleaved with the actions taken by the AI model, with the goal of guiding the model's decision-making process.

To put it simply, a reasoning trace is like a "think-aloud" protocol, where the AI model verbalizes its thought process as it works through a problem, explaining why it takes each action and how it plans to proceed. This trace is meant to help the model make better decisions by providing a structured way of thinking about the task at hand.

Key Findings

Interleaving Reasoning Trace with Action Execution:

Claim: ReAct suggests that interleaving reasoning traces with action execution improves LLM performance.
Finding: The study found that the performance of LLMs does not significantly benefit from this interleaving. In fact, LLMs performed better when reasoning traces were not interleaved with actions.

Nature of Reasoning Trace or Guidance Information:

Claim: The content of the reasoning trace provided in ReAct is critical for improving LLM performance.
Finding: Providing weak or even placebo guidance (irrelevant information) resulted in performance comparable to strong reasoning trace-based guidance. This suggests that the specific content of the guidance is not as crucial as previously thought.

Similarity Between Example and Query:

Claim: ReAct assumes that the LLM’s performance is due to its reasoning abilities.
Finding: The performance improvements seen in ReAct are primarily due to the high similarity between example tasks and query tasks. When this similarity decreases, the performance drops significantly. This indicates that LLMs rely more on approximate retrieval of similar examples rather than true reasoning abilities.

Let's consider a simple analogy to understand the research questions and results. Imagine teaching a child how to make a peanut butter and jelly sandwich. In the ReAct framework, you would guide the child step-by-step, providing reasoning at each stage: "Get bread. Reasoning: We need bread to make a sandwich. Get peanut butter. Reasoning: Peanut butter will be one of the spreads." and so on.

The researchers modified this approach by providing all the reasoning upfront (Exemplar-CoT): "To make a peanut butter and jelly sandwich, we need bread, peanut butter, and jelly. First, we spread peanut butter on one slice of bread and jelly on the other. Then, we put the two slices together to complete the sandwich." Contrary to the ReAct claim, this led to better performance than the step-by-step ReAct approach.

They also tested the impact of the nature of the reasoning trace by including failure examples, such as attempting to spread peanut butter on an unopened bread bag, and found that it did not significantly affect the LLM's performance. Furthermore, the researchers discovered that using synonyms in the example prompts (e.g., "retrieve sliced loaf" instead of "get bread") led to a significant drop in performance, indicating that LLMs heavily rely on the specific wording of the examples.

This challenges the claims made by ReAct and sheds light on the limitations of LLMs in terms of reasoning and decision-making. The findings suggest that the success of ReAct is primarily due to the high similarity between example prompts and the query task, rather than the interleaving of reasoning traces or the specific guidance provided.

This research suggests that LLMs are not demonstrating genuine reasoning abilities but are instead relying on pattern matching and retrieval based on the provided examples. The ReAct framework is brittle and what looks like reasoning in LLMs is more by luck than design. We're still a ways off reliable performance of LLMs in reasoning and decision-making tasks.

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

The Brittleness of Agentic Reasoning and Planning Using LLMs

Key Findings

Helen Edwards