Key Points:
- The Bitter Lesson and Scaling Laws: Rich Sutton’s 2019 essay, “The Bitter Lesson,” emphasized that general methods leveraging computation, such as search and learning, are most effective in AI development. Scaling laws in AI show that as models increase in size, data, and computational resources, their performance predictably improves, often following a power law.
- Emerging Field of Mechanistic Interpretability: Mechanistic interpretability (mech-int) seeks to understand how AI models “think” by examining their internal workings. This field, likened to understanding a rock band where each component contributes to the overall behavior, aims to make AI processes more transparent and comprehensible.
- Emergent Capabilities with Scale: As AI models scale, they exhibit new capabilities not present in smaller models. These emergent abilities arise from complex interactions within the network and the types of knowledge or skills the model learns.
- Quantization Hypothesis: A recent hypothesis from MIT and IAIFII, called the Quantization Hypothesis, suggests that knowledge in neural networks is acquired in discrete ‘quanta’—chunks of knowledge or skills. As networks scale, they accumulate these quanta following a power law, leading to significant, though diminishing, improvements in capabilities.
In AI circles there is a famous essay by Rich Sutton called The Bitter Lesson. Its core idea is this: the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. Written in 2019, Sutton was prescient when he made the claim that it is search and learning that scale arbitrarily, with the implication that a focus on those attributes will yield the greatest gains.
Now, in 2023, the scaling laws of large language models are firmly entrenched in researchers’ minds. Scaling laws in AI refer to the observed patterns that as artificial intelligence models increase in size, data, and computational resources, their performance improves in a predictable manner, often following a power law, leading to enhanced capabilities and the emergence of new functionalities. This has led many (by no means all) researchers to believe that AGI will happen within a few years, simply because of more data and larger models, powered by ever bigger compute.
What fascinates me is unpacking what drives models' rapid improvements—sheer scale or intrinsic learning dynamics?
On one hand, bigger data and parameters reliably improve performance. Learning algorithms leverage this volume by discovering useful patterns. Scope and repetition aids pattern recognition.
Yet model architecture matters too—networks have inherent statistical biases. Attentional mechanisms concentrate signal and convolutions exploit spatial locality. Do structures like these explain some progress?
We are steadily gaining more insight into just how flexible and powerful neural networks are by examining specific features of how they “think”. This is the territory of the emerging field of mechanistic interpretability (or “mech-int” if you’re in the tribe). The core intuition is that models learn human, comprehensible things and can be understood.
Mechanistic interpretability in AI is like trying to understand a rock band. Just as each musician and instrument in the band contributes to the music, in AI, every model component plays a role in the system's behavior. By dissecting the “rock band” of an AI model and analyzing how each “instrument” or component contributes, we can gain a clearer understanding of how AI learns, making its processes more transparent and comprehensible. The current application of mechanistic interpretability techniques has been largely confined to small-scale models and controlled scenarios. The bet is that these methods will be scalable to larger, more complex networks but no one knows this for sure. Hopefully, insights gained from smaller models will hold true and remain relevant when applied to larger AI systems.
This article is in three parts: why new capabilities might emerge at scale, adaptability and flexibility of learned algorithms, and what is happening when models learn to generalize.
New Capabilities Emerge with Scale
An emergent ability is an ability that is not present in small models but is present in large models. While there is nuance in measuring and defining emergent properties of large models (there are ways to manipulate what “emerges” based on metrics) it’s accepted that as models get larger they exhibit new capabilities which are not direct extrapolations of smaller models. Instead these arise due to the complex interactions involving emergent structure within the network but also due to the types of knowledge or skills that the model is more likely to learn.
How emergence happens is a bit of a mystery. Clearly scale matters, but exactly how “bigger” becomes “different” isn’t clear. A recent paper from MIT and IAIFII, the NSF AI Institute for Artificial Intelligence and Fundamental Interactions (with Max Tegmark listed as a co-author) called Quantization Model of Neural Scaling puts forward an intriguing hypothesis the researchers call the Quantization Hypothesis. It posits that knowledge in neural networks is acquired in discrete 'quanta’—chunks of knowledge or skills. As networks scale, they amass these quanta following a power law, leading to a significant, albeit diminishing, improvement in capabilities.
Some of these quanta are more useful for learning (that is, reducing loss) than others. This leads to a natural ordering of the quanta, which the researchers term the Q Sequence. A model that is trained optimally should learn the quanta in the order of the Q Sequence. So the effect of scaling is to learn more of the quanta in the Q Sequence which means scaling performance is simply about how many quanta are learned successfully.
Let’s anthropomorphize this to make it easier to understand. Consider how you might learn a new language. The process of learning can be broken down into discrete units of knowledge or skills (akin to quanta), and these units are acquired in a specific order (the Q Sequence), reflecting their frequency of use and utility (mirroring the power law nature). This sequence is not random—it typically follows the utility and frequency of use.
For example, you learn basic vocabulary and simple grammatical structures first, as they are used more frequently and are foundational for communication. This mirrors the Q Sequence's ordered learning of quanta based on their utility in reducing loss (or, in the analogy, increasing communication effectiveness). The power law matters because you will choose to learn the most commonly and broadly applicable things first, which enable you to gain the most significant initial improvement in your ability to communicate. As you progress, your learning shifts towards more specific, less frequently used vocabulary and complex grammatical structures.
Just as your fluency for a language learner emerges, so too can new AI capabilities emerge. Quanta are learned in order of their use frequency and as it scales, it learns more quanta, leading to improved performance. This model helps explain why larger models not only perform better on familiar tasks but also develop new abilities, as they accumulate a more diverse set of skills or knowledge quanta.
Spooky.
Next week, in part 2, we will look at adaptability and flexibility of learned algorithms.