Improve Your Prompts with Iterative Reasoning Techniques

Key Points:

Enhancing AI Reasoning with Iterative Techniques: The paper “Iterative Reasoning Preference Optimization” by Pang et al. introduces a method to improve large language models’ (LLMs) reasoning abilities through an iterative training process.
Limitations of RLHF and Introduction of DPO: While reinforcement learning from human feedback (RLHF) is effective, it is costly and labor-intensive. Direct Preference Optimization (DPO) offers a mathematical equivalent, training AI with clear rules and examples without constant human interaction.
Advanced DPO Approach: The proposed method improves on DPO by integrating an iterative process combining reinforcement learning with a specialized loss function, enhancing AI’s reasoning without additional labeled data.
Key Components of the New Method:
- Chain-of-Thought & Answer Generation: LLM generates multiple reasoning paths and answers, identifying valid paths by comparing generated answers to correct ones.
- Preference Optimization: Training involves a language modeling loss on valid paths and a preference loss maximizing valid paths’ likelihood over invalid ones.
Iterative Training Process: The model undergoes repeated iterations, refining its reasoning paths and answers with each cycle, leading to improved performance on complex tasks.
Efficiency and Scalability: This method leverages the model’s own generations for self-supervision, eliminating the need for constant human feedback and making the process more efficient and scalable.
Best Practices for Prompting:
- Provide Feedback on Outputs: Comparative feedback helps guide AI towards preferred outputs.
- Generate Multiple Solution Paths: Iteratively build on the most promising paths for complex problem-solving.
- Break Down Problems: Solve and verify smaller sub-problems before integrating them into a complete solution.
- Verify Model Outputs: Critically examine and test model-generated solutions to ensure correctness and robustness.

Generative AI is getting better all the time. However, complex reasoning remains a work in progress. A recent paper by Pang et al., titled Iterative Reasoning Preference Optimization, addresses this limitation by proposing a new method to improve the reasoning abilities of LLMs.

This paper offers fresh insights and ideas that can enhance our prompting techniques. Below, we summarize these findings, but first, let's look at the research itself.

You have probably heard of RLHF: reinforcement learning from human feedback. RLHF has been the stalwart of training LLMs like ChatGPT. Imagine that ChatGPT is like a baby in a high chair who is learning the best way to eat mashed potato. RLHF is like constantly watching them and giving them rewards (like praise or treats) when they do something good—like get the mash in their mouth with a spoon—and punishments (like scolding or time-outs) when they do something bad—like fling the mash across the kitchen. The good news is that the child eventually learns to behave in ways that maximize rewards and minimize punishments. The bad news is that this requires you to constantly monitor them and provide feedback, which can be time-consuming and tiring. In the world of LLMs, RLHF is costly and runs on the scale of humans, since humans are doing a lot of work.

In 2023, researchers figured out a mathematical equivalent to RLHF called Direct Preference Optimization, or DPO. Let's first understand how it is different from RLHF by going back to the analogy of teaching a child. DPO is like giving the child a clear set of rules or guidelines to follow, and then letting them figure out how to behave on their own. (Obviously, in this analogy the child is a lot older than the one-year-old in the high chair who wants to just fling the mash). You give the child examples of good and bad behavior, and teach them to prefer the good behavior over the bad. The child then tries to behave in ways that follow the good examples and avoid the bad ones. The good news is that you don't need to constantly watch and give feedback, but you do need to provide clear rules and examples upfront.

So in AI terms, RLHF involves human users constantly interacting with the AI, providing rewards for good outputs and punishments for bad ones. This can make the AI better at following human preferences, but it's costly and time-consuming. While DPO involves training the AI to prefer good or correct outputs over bad or incorrect ones, based on a dataset of examples, which doesn't require constant human interaction, but it does require a well-curated dataset.

This paper proposes a new method that's like an advanced version of DPO. It allows the AI to improve its reasoning abilities on its own, without needing constant human feedback. This is important because it could lead to AI systems that are better at complex thinking and problem-solving, in a more efficient and scalable way than methods like RLHF.

The key innovation of their approach is an iterative training process that combines reinforcement learning with a new specialized loss function which is like an improved version of DPO. The method consists of two main components:

Chain-of-Thought & Answer Generation: In each iteration, the LLM is used to generate multiple reasoning paths (called "chains of thought") and corresponding answers for each question in the training data. The generated answers are then compared to the correct answers to determine which reasoning paths are valid.
Preference Optimization: The model is then trained using a special loss function that consists of two parts:
a) A standard language modeling loss on the valid reasoning paths that led to correct answers. This encourages the model to generate similar reasoning paths in the future.
b) A preference loss that tries to maximize the likelihood of the valid reasoning paths compared to the invalid ones. This is done using a modified version of the DPO loss, which essentially trains the model to prefer the good reasoning paths over the bad ones.

These two steps are repeated iteratively, with each iteration using the model from the previous iteration to generate new reasoning paths and answers. Gradually the model learns to generate better reasoning paths and answers, leading to improved performance on the final task.

Some key features of this method:

It does not require any additional labeled data beyond the original training set, as it generates its own reasoning paths and answers.
It leverages the model's own generations to create a form of self-supervision, where the model learns from its own successes and failures.
The combination of the language modeling loss and the preference loss allows the model to improve both the quality and the correctness of its generated reasoning paths.

The paper's focus is quite narrow. The research focuses specifically on improving reasoning in the context of question-answering tasks with well-defined correct answers. It remains to be seen how well this approach would generalize to more open-ended or subjective reasoning tasks, where the notion of a single "correct" answer may not apply. And while the iterative training process led to significant performance gains, it does require multiple rounds of computation, which could be resource-intensive for larger models or more complex tasks. Think $$$.

For experts, this paper provides a novel technical approach that combines diverse generation and preference optimization in an iterative process, with insights that could guide further research in this area. The three most important takeaways are:

The Iterative Reasoning Preference Optimization method combines two key components: a) generating diverse reasoning paths and answers using the language model, and b) optimizing the model's preferences using a modified version of the Diverse Preference Optimization (DPO) loss that includes a language modeling term. This dual approach allows the model to improve both the quality and correctness of its reasoning over multiple iterations.
The method demonstrates significant performance improvements on challenging reasoning tasks like math word problems (GSM8K), science questions (ARC Challenge), and more advanced math (MATH), without requiring any additional labeled data. This suggests that the approach is effective at extracting and amplifying the reasoning capabilities already present in large language models.
The paper highlights the importance of the language modeling term in the preference optimization loss. Ablation studies show that DPO alone, without the language modeling term, leads to inferior performance. This insight could inform the design of future preference optimization methods for enhancing specific capabilities in language models.

Overall, the paper makes a significant contribution by demonstrating a new approach to improving reasoning in language models that is both effective and efficient.

Lessons from this research to enhance your own prompting:

As always, we like to pull out ideas from The Science to help you improve your own mental models for prompting. The paper reinforces some best practice ways to use GenAI systems when you're working on tasks of higher complexity, for example:

Feedback on outputs: Whenever possible, try to provide feedback to the model about which outputs are better or worse. Even if you don't know the right answer, you might have a preference between different options generated by the model. This kind of comparative feedback can help guide the model.

Example: If you're using a GenAI system to help brainstorm ideas for a new product, and it generates two options: "a smart water bottle that tracks your hydration" and "a water bottle that plays music," you might prefer the first option because it seems more useful and innovative. You could provide this feedback to the model.
Iterative solution paths: If you're trying to solve a multi-step problem, consider generating multiple possible solution paths from the model and then selecting the most promising one to build on further. This is analogous to the iterative generation and filtering happening in the paper.

Example: If you're using a GenAI system to write a short story, you could start by generating multiple possible outlines or plot points. Then, select the most promising one and ask the model to expand on it, generating multiple possible paragraphs. Continue this process of selection and expansion until you have a complete story draft.
Breaking down problems: For complex reasoning tasks, it may help to break down the problem into smaller sub-problems that you can solve and verify more easily, and then use the model to stitch them together. This is similar to how the researchers used final answers to guide the selection of good reasoning chains.

Example: If you're using a GenAI system to help plan a complex project, like organizing a conference, you could break it down into sub-tasks like "create agenda," "invite speakers," "arrange catering," etc. Then, you could work with the model to generate and refine plans for each sub-task, and finally integrate them into a master plan.
Verifying model outputs: Keep in mind that even if the model generates an impressive-looking solution, it might still be incorrect. Always try to critically examine the model's reasoning and check the final answer, rather than blindly trusting it. The iterative training in the paper relies on this kind of verification signal.

Example: If you're using a GenAI system to help with a coding task, and it generates a function that seems to work, don't just assume it's correct. Test the function with different inputs and edge cases to verify its correctness and robustness.

More summary points on DPO v RLHF

DPO (Diverse Preference Optimization):

DPO is a technique that trains a model to prefer certain outputs over others, based on some notion of quality or correctness.
In the context of language models, DPO can be used to train the model to prefer generating "good" or "correct" responses over "bad" or "incorrect" ones.
The paper's method uses a modified version of DPO in its preference optimization step, where it trains the model to prefer valid reasoning paths (that lead to correct answers) over invalid ones.

RLHF (Reinforcement Learning from Human Feedback):

RLHF is a general approach to training AI models where the model learns from feedback provided by human users or annotators.
In a typical RLHF setup, the model generates outputs, and humans provide rewards or punishments based on the quality of those outputs. The model then learns to generate outputs that maximize the rewards.
RLHF requires ongoing human interaction and feedback, which can be costly and time-consuming.

Key differences between the paper's method and RLHF:

No human feedback: The paper's method does not rely on human feedback. Instead, it uses the correctness of the model's own generated answers as a form of automatic feedback. This makes it more scalable and cost-effective than RLHF.
Iterative refinement: The paper's method involves multiple iterations of generation and optimization, where each iteration uses the model from the previous iteration. This allows for gradual refinement of the model's reasoning capabilities. In contrast, RLHF typically involves a single round of feedback and update.
Preference optimization: The paper's method uses a specific optimization technique (modified DPO) to train the model to prefer good reasoning paths over bad ones. RLHF, on the other hand, is a general framework that can use various optimization techniques.

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

Improve Your Prompts with Iterative Reasoning Techniques

Key Points:

More summary points on DPO v RLHF

Helen Edwards