Meta-Prompting and What it Tells us About a Bias for Code

Key Points:

Meta-prompting acts as a multi-expert system, breaking down tasks for specialized model instances to tackle, improving accuracy in areas like programming and creative writing.
The "distribution of experts" strategy leverages the model's wide-ranging knowledge, assigning tasks to the most suitable "expert" within the model.
Code tasks see improved performance due to the model's programming language proficiency, showcasing its dual capability for specialized and general reasoning.
The paper highlights the effectiveness of meta-prompting in tasks such as sonnet writing, demonstrating significant accuracy improvements with and without code.

A new paper from Stanford and OpenAI offers us a glimpse into the "mind" of GPT4 and its bias for code.

Meta-prompting is a technique that enhances language models' performance by acting as a multi-expert system. It breaks complex tasks into smaller parts, assigns them to specialized instances within the same model, and integrates the outputs. This method significantly improves task accuracy, including in programming and creative writing, by leveraging a model's ability to execute code in real-time and apply diverse expert knowledge.

The approach is task-agnostic, simplifying user interaction without needing detailed instructions for each task, and demonstrates the potential for broad applicability in enhancing model utility and accuracy. The task-agnostic nature of meta-prompting suggests there are good general-purpose applications in interface design but also for the regular user of tools like ChatGPT, even accounting for their more constrained nature.

The "distribution of experts" idea in meta-prompting involves assigning specific tasks to different "expert" components within a language model, based on their specialized knowledge or capabilities. It works because it leverages the diverse range of information and problem-solving strategies embedded within the model.

Meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct “expert” instances of the same LM, each operating under specific, tailored instructions.

Performance increases with this method sit around 15-17% improvement on a range of tasks from math to writing.

An example of a meta-prompt for writing. Note the additional prompt to the model to use code as needed.

The use of code versus non-code tasks highlights this difference: tasks involving code benefit from the model's ability to understand and generate programming language, whereas non-code tasks utilize the model's general knowledge and reasoning abilities. This distinction underlines the model's versatility in applying specialized knowledge to a wide array of problems.

Let's put this another way: models perform better when they can use code as a way to assemble and query "experts".

Think of this as the model being able to operate in two modes. Code tasks access specialized reasoning because they trigger programming languages, algorithms, and computational logic so set the scene for reasoning to be precise and structured. No-code tasks use general reasoning and they rely on broader knowledge, language comprehension, and inferential skills. This makes them applicable to a wide range of topics and contexts.

This distinction is key because it showcases a language model's dual capability: to accurately process and generate code based on strict logical rules, and to engage in more fluid, general conversational exchanges that reflect human-like understanding and creativity.

We can see two things in the results—the difference between code and no code approaches and the "experts" which each approach calls upon.

Take sonnet writing: something that intuitively you might imagine that code has no role to play. But you'd be wrong. Sonnet writing requires linguistic accuracy and adherence to specific poetic forms. Standard methods achieve a 62% accuracy rate, while meta-prompting reaches 79.6% accuracy with a Python interpreter and 77.6% without it, demonstrating its effectiveness, according to the researchers.

The paper includes visualization of the distribution of experts conjured by the meta-model—with and without code. The charts offer glimpses into the two modes of reasoning unlocked by the meta-prompting and with and without using code. Generating code in real-time as part of a meta-prompt gives off system 2 vibes.

There's so much to dig further into here—more questions than answers, continuing the intrigue that is "just how do these darn things work?!?" For example, with code, sonnet writing draws on the expert poet and expert essayist, while without code, the essayist is substituted for expert poet reviewer. Why?

Remember that the meta-model selects experts dynamically. This ability to dynamically select expert types is going to be important for designers to grapple with. While this research suggests that there are obvious places where the model will have a noticeable preference for technical and computational expertise (such as, word sorting) it's a bias that might not always be appropriate. Why? Because when the model is prevented from using code, it brings in a more diverse spectrum of experts.

This research shows how flexible these models are: meta-prompting aids in decomposing complex tasks, engages distinct expertise, adopting a computational bias when using code in real-time which further enhances performance, then seamlessly integrates the varied outputs.

Footnote: The remaining blank space on the charts above represent a combination of experts that were employed infrequently.

Maggie Jackson: Embracing Uncertainty

Greg Epstein: Tech Agnostic

Chris Messina: Reimagining AI

Meta-Prompting and What it Tells us About a Bias for Code

Key Points:

Helen Edwards