AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
New research reveals that large language models can generate superior prompts for themselves through automated techniques, reducing reliance on specialized fine-tuning.
Advanced AI systems like GPT-4 display impressive language skills. However, their performance often depends heavily on how users “prompt” them—the examples, instructions, and context provided as input. Prompting is an important skill if you want to get better results from chatbots such as ChatGPT or Claude.
Prompt engineering is more expert. It is the process of developing specific prompting techniques that enable foundation models to perform specific tasks. It can unlock substantially better results without needing additional model training—called fine-tuning—which is expensive and time consuming.
Now, new research reveals that language models can actually write better prompts for themselves than human experts can in many cases. By automating the prompt generation process, the study substantially boosted GPT-4’s performance on complex medical exams to surpass specialized systems. The automated prompting paradigm also reduced compute costs significantly. This suggests that having AI systems refine and enhance their own prompts may make their capabilities far more accessible and applicable across diverse real-world tasks.
Unlike fine-tuning training regimes that update model parameters, prompting works by better steering the innate skills learned during pretraining. Techniques like “few-shot learning” show the model just a few demonstrative examples of the desired format. “Chain-of-thought” involves asking the model to detail its reasoning process and forces the model to exhibit clearer logic. And small tweaks like shuffling multiple choice order checks for biases. Together these methods can enable significant performance gains without extensive retraining.
Yet prompt engineering has traditionally depended heavily on manual effort and domain expertise. In medicine for instance, recent systems like Med-PaLM leveraged hospital clinicians to hand-craft prompts specifically tailored for diagnostic exams. Producing these custom-designed prompts requires substantial specialization. This new study shows that in many cases, language models can generate superior prompts for themselves.
The researchers systematically tested different methods for automated prompt generation with GPT-4 using a suite of medical exam questions spanning specialties. They evaluated techniques like:
The study found language models write better prompts for themselves than even clinical experts can in certain contexts. For example, on a set of US medical licensing exam questions, GPT-4’s self-produced chains of reasoning increased its accuracy by from 86% to over 90% compared to explanations hand-crafted by doctors with access to external references.
Above: visual illustration of incremental and additive increases in accuracy with each step of automated prompts. Relative contributions of each component are shown at the bottom.
Above: performance over time with different models and techniques
Automating prompt production has profound implications both practically and philosophically. In applied terms, it substantially reduces the manual effort and domain expertise previously needed to optimize language model performance on specialized tasks. This also lowers the computational overhead 10-fold by allowing more lightweight prompting approaches versus intensive fine-tuning of larger models.
Philosophically, the study reinforces that language models have become so adept at natural language that they can effectively “prompt themselves” better than humans can in select cases. Unlocking their own latent potential via self-refinement may make their capabilities far more accessible to mainstream users beyond AI experts. The automated prompting paradigm may help models continue to generalize robustly as they grow larger as well.
Let’s say that again for emphasis: Bigger AI models may reason better if left to question themselves.
The techniques explored are also general purpose, not just demonstrated on medical exams. The fundamental approaches outlined in this paper could extend to optimize language assistant performance on diverse applications from customer support queries to legal contract reviews and more.
Above: GPT-4 performance with three different prompting strategies on out of domain datasets. Across these datasets, Medprompt provides an average improvement of +7.3% over baseline zero-shot prompting.
While you can’t automate this inside of ChatGPT, you can try similar principles of automated prompting themselves in a simpler form today:
Teach by Example:
Humans find it easier to learn when given concrete examples first. Similarly, AI models can answer new questions better if "primed" with a few demonstrations first. Instead of fixed examples, an advanced technique has the model automatically select the best examples to show based on similarity to the new question. This dynamic approach tailors the examples to each case. Provide 2-3 examples of your question format using concise queries you would ask along with desired responses. Pick samples relevant to each new question for the best priming.
Ask for Step-by-Step Reasoning:
AI systems can seem like "black boxes" in how they produce answers. Getting them to show their work and thought process makes their reasoning clearer. Rather than have people hand-craft explanations, have the models generate its own chain of logic in plain language leading to each answer. This leverages the model's own capabilities better than human-authored chains. After examples, ask the AI to show its reasoning step-by-step. This helps you understand the AI's thought process.
Shuffle and Check Consistency:
AI models have biases and can get confused by the way choices are presented. By shuffling multiple choice options randomly and cross-checking the model's answers, the system reduces position bias and selects the choice most robust to order effects. Try randomly changing order of multiple choice options then cross-check consistency of AI's answers to reduce bias.
While current chatbots don’t have the full capabilities outlined in the research yet, these basic principles of steering with more targeted examples, asking “show your work”, and checking for robustness can help you prompt better performance on new tasks today.
As language AI continues advancing at a rapid pace, robustness and general applicability will only grow more vital. Rather than intense retraining, unlocking capabilities via optimized prompting ensures models remain broadly useful across diverse domains. If models can “prime themselves” better than we can prompt them explicitly going forward, making prompt engineering accessible may allow more groups to benefit from AI.
The Artificiality Weekend Briefing: About AI, Not Written by AI