Key Points:
- GPT's training objective is next-word prediction over internet text corpora. This narrow focus explains resulting capabilities and limitations.
- Numerous examples are provided of GPT-3's failures on uncommon or atypical inputs and outputs, complex reasoning, physical tasks without vision, etc.
- The authors categorize key sources of errors such as likelihood bias, data dependence, sensitivity to wording, difficulty combining concepts.
- They recommend identifying mismatches between training objectives and real-world tasks, using skepticism where skills exceed language prediction.
- GPT's strengths and weaknesses reflect its internet corpus training, not comprehensive intelligence. Outputs emerge indirectly rather than manipulating abstract ideas.
- LLMs operate within fluid competence frontiers shaped by likelihood tradeoffs. Their impressive yet limited capacities are echoes of thought, not complete cognition.
Clark's nutcrackers cache food like no other. These birds hide thousands of seeds in the soil every winter and depend on recalling cache locations months later to survive. Their extraordinary spatial memory outperforms that of related species like scrub jays.
This difference makes sense through a teleological lens: the ecological need to find cached food shaped the evolution of enhanced spatial abilities in nutcrackers. Their cognitive specialization arose from the particular problems their lifestyle requires solving.
Similar principles extend to artificial intelligence. Analyzing what objective a system was optimized for reveals a lot about resulting capabilities. Just as naturally selected skills reflect challenges faced, so too do trained AI abilities mirror prescribed training goals.
A recent paper from researchers at Princeton takes this idea and applies it to GPT-3.5 and GPT-4. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve, argues that to develop a holistic understanding of these systems, we need to consider the problem that they were trained to solve: next-word prediction over internet text.
This work brilliantly imports a biological mindset alien to AI development—one grounded in adaptive pressures sculpting cognition. Frontier research enjoys boundless resources to pump into models. Brains confront stark constraints—limited skull space, energy, lifespan. Evolution molds thinking to meet behavioral needs within environmental limits. Nutcrackers' spatial mastery arose from caching seeds to survive. Likewise, language models' impressive yet peculiar talents reflect the internet's imprint, training them to predict the next token. Unbridled power produces narrow gifts, while scarcity spurs general intelligence. Studying how capabilities serve objectives—whether bred by bytes or biology—gives us insight into the purposeful roots of intelligence.
The researchers find all sorts of intriguing failures in large language models stemming from its goal to solve the problem of next word prediction from words found on the internet.
When given a task, LLMs are better at common tasks than rare ones, even if both tasks are of the same complexity. Models are influenced by how likely certain answers or inputs are, even when the task is straightforward. To put it simply, the model's performance is affected by how common or rare the inputs (the prompt) and desired outputs (what you want it to generate) are.
GPT-4 can easily sort a list of words in alphabetical order, but not in reverse alphabetical order. Sorting in alphabetical order is common, reverse alphabetical order is rare.
GPT-4 can multiply a number by 9/5 then add 32, but it can’t multiply the same number by 7/5 then add 31. The first linear function—f (x) = (9/5)x + 32— is common because it is the Celsius-to-Fahrenheit conversion, while the function f(x) = (7/5)x + 31 is rare.
When asked the birthday of a public figure, GPT-4 is far more accurate when the person had a high probability of being mentioned online. GPT-4 knows that Carrie Underwood’s birthday is March 10, 1983, but doesn’t know that Jacques Hanegraaf’s is December 14, 1960. Curiously, this disparity reveals that even when there is a systematic function that maps input to output, LLMs may not necessarily use it.
When can we expect inaccuracies from language models? First, consider how extensively your prompt topic appears online. The more ubiquitous the concepts, the more accurate the AI. Second, identify conflicts between the task and the core competence of next-word prediction. Complete this: "My task requires using next-word prediction for <complex causal/conceptual/spatial/math reasoning/deterministic>.” Any mismatch signals the likelihood of errors.
Statistical prediction falters on niche contexts, mental modeling, spatial reasoning, causal chains, or disciplines with technical lexicons. If your task relies on proficiency areas beyond the AI's language fluency, anticipate shakier outputs. Scrutinize the fit between what you request and what transformer architecture directly provides—next token likelihoods given prior text. Where skills needed to exceed predictive text, supplement with skepticism.
Dive deeper: our takeaways for being more LLM-aware:
Likelihood Influences Outputs
LLMs favor more statistically probable responses over precise ones. When asked to summarize a passage, coherence trumps factual correctness. Models trained on internet text gravitate towards stereotypical outputs over nuanced reasoning. Even straightforward tasks like abbreviations are impacted by the likelihood of potential answers. For example, an LLM might be highly accurate at sorting a list in alphabetical order, but not accurate at sorting the same list in reverse alphabetical order.
Proprietary Data Lowers Accuracy
LLMs perform best on open, diverse training data like public internet text. Proprietary content from specialized domains leads to poorer results unless explicitly fine-tuned. Without tailored tuning, uncommon jargon and concepts will trip models up. Assumptions valid on the open web may not hold for niche contexts.
Physical Reasoning Lags Without Vision
Tasks requiring spatial or physical reasoning pose challenges if there are no accompanying visual inputs. LLMs cannot accurately describe complex physical movements just be using text prompts. Multi-modal models with visual components will likely improve on this, but for now, distrust an LLM’s knowledge about embodied interactions.
Wording Sensitivity Requires Varied Prompting
Slight changes in how a prompt is worded can significantly impact LLM outputs. Unlike humans, these models do not comprehend language, but react statistically to textual patterns. Rephrasing questions in multiple ways and simplifying verbose prompts helps improve results. Don’t fall into the trap of thinking long, complex prompts are necessarily better.
Causal Reasoning Remains Limited
Despite impressively coherent text, LLMs do not truly understand cause and effect. They do not reason about why events occur or the motivations behind human behavior. Expect factual inaccuracy when asking an LLM to analyze complex events or social dynamics beyond surface patterns.
Unlike humans, LLMs cannot sense and revise their mistakes. All outputs are taken at face value when used in future predictions. Erroneous information gets compounded, resulting in a form of "self-delusion." This challenge requires vigilance in monitoring for signs of cascading errors. If an LLM makes a mistake, and you don’t call it out, it will propagate. Also, it's better to give models clear instructions up front than try to fix answers later. First prompts matter most.
Biases Reflect Data and Tuning
LLMs reflect existing societal biases in their training data. However, their biases also stem from instruction tuning techniques used to improve performance. Such tuning pastoralizes certain responses over others. In both cases, biases happen because of statistical patterns rather than reasoned thought.
Memorization Can Override Systematic Reasoning
LLMs sometimes latch onto repeated text and reproduce it regardless of relevance. More concerning, they may memorize specific input-output pairs rather than learning the systematic function that connects them. For example, an LLM may predict celebrities' birthdays based on memorized mentions rather than robust web searching for facts. Despite access to correct information, the tendency towards predictiveness overrides accuracy. This quirk requires vigilance when relying on LLMs for factual information, even when they can source it online.
Ideas Are Decomposed into Words
LLMs are trained on words, not on ideas, so they are sensitive to the ways that ideas are presented to them. LLMs process inputs as sequences of words rather than parsing meaning. Changing the order of words or breaking apart concepts impacts outputs, even when the key concepts are preserved. Long-form prompts will rarely improve results over concise phrasing. Ideas must be conveyed through compositional word patterns. For example, changing "explain quantum entanglement" to "explain the phenomenon of quantum entanglement" significantly impacts outputs, even though the underlying request is unchanged. ChatGPT “chose” to answer the former as if explaining to a high schooler, while the later answer was at a higher level of comprehension and more complete.
Combining Concepts Has Limits
While LLMs can creatively combine familiar concepts like composing a song duet with celebrities, they struggle to integrate new ideas fluidly as complexity increases. Humans mentally juggle contexts and concepts with ease. But for LLMs, nuanced context shifting breaks down beyond low-complexity tasks.
LLMs are models of language, not models of thought. They don’t directly manipulate ideas or structured models of the world. If they look like they do, it’s because such outputs emerge indirectly.
LLMs operate within fluid competence frontiers demarcated by the interplay of objective and linguistic likelihood. Performance remains uneven across tasks and likely will for some time as it is so strongly coupled with statistical features of language use on the web. Straightforward queries can stump models when expressed unusually because statistical patterns override semantics. Yet rare phrasing scarcely trips up human cognition, equipped to reason about novel inputs.
So too do LLMs falter on many tasks simple for us but alien to narrow training regimes, like spatial, causal, or conceptual reasoning. Their capacities, while impressive, are echoes of thought, not complete cognition. Predicting the next token from internet text fails to capture of evolved complexity of human problem solving abilities.
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, Thomas L. Griffiths Princeton University
To understand what language models are, we must understand what we have trained them to be.