AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
Research suggests that LLMs are not demonstrating genuine reasoning abilities but are instead relying on pattern matching and retrieval based on the provided examples. We're still a ways off reliable performance of LLMs in reasoning and decision-making tasks.
The ability of Large Language Models (LLMs) to reason and make sequential decisions has been a topic of debate. The ReAct framework, introduced by Yao et al, claimed to enhance the reasoning and planning abilities of LLMs by interleaving reasoning traces with action execution. In the original paper, they explored the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner. The central claim was one of synergy between reasoning and action: how reasoning traces helped the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.
This is now a popular approach, with many researchers and practitioners adopting ReAct to improve LLM performance on tasks requiring reasoning and decision-making.
Recently, a group of researchers from Arizona State University decided to investigate the claims made by ReAct and examine the factors contributing to its perceived success. They were particularly interested in examining the impact of the structure of reasoning traces. A reasoning trace is a step-by-step explanation of the thought process or logic used to solve a problem or complete a task. It outlines the sequence of mental steps taken to arrive at a solution, providing insight into the reasoning behind each action or decision. In the context of the ReAct framework, a reasoning trace is interleaved with the actions taken by the AI model, with the goal of guiding the model's decision-making process.
To put it simply, a reasoning trace is like a "think-aloud" protocol, where the AI model verbalizes its thought process as it works through a problem, explaining why it takes each action and how it plans to proceed. This trace is meant to help the model make better decisions by providing a structured way of thinking about the task at hand.
Interleaving Reasoning Trace with Action Execution:
Nature of Reasoning Trace or Guidance Information:
Similarity Between Example and Query:
Let's consider a simple analogy to understand the research questions and results. Imagine teaching a child how to make a peanut butter and jelly sandwich. In the ReAct framework, you would guide the child step-by-step, providing reasoning at each stage: "Get bread. Reasoning: We need bread to make a sandwich. Get peanut butter. Reasoning: Peanut butter will be one of the spreads." and so on.
The researchers modified this approach by providing all the reasoning upfront (Exemplar-CoT): "To make a peanut butter and jelly sandwich, we need bread, peanut butter, and jelly. First, we spread peanut butter on one slice of bread and jelly on the other. Then, we put the two slices together to complete the sandwich." Contrary to the ReAct claim, this led to better performance than the step-by-step ReAct approach.
They also tested the impact of the nature of the reasoning trace by including failure examples, such as attempting to spread peanut butter on an unopened bread bag, and found that it did not significantly affect the LLM's performance. Furthermore, the researchers discovered that using synonyms in the example prompts (e.g., "retrieve sliced loaf" instead of "get bread") led to a significant drop in performance, indicating that LLMs heavily rely on the specific wording of the examples.
This challenges the claims made by ReAct and sheds light on the limitations of LLMs in terms of reasoning and decision-making. The findings suggest that the success of ReAct is primarily due to the high similarity between example prompts and the query task, rather than the interleaving of reasoning traces or the specific guidance provided.
This research suggests that LLMs are not demonstrating genuine reasoning abilities but are instead relying on pattern matching and retrieval based on the provided examples. The ReAct framework is brittle and what looks like reasoning in LLMs is more by luck than design. We're still a ways off reliable performance of LLMs in reasoning and decision-making tasks.
The Artificiality Weekend Briefing: About AI, Not Written by AI