What Anthropic Finds by Mapping Claude's Mind

Key Points:

Researchers at Anthropic have used "dictionary learning" to map millions of features in Claude 3.0 Sonnet, providing an unprecedented look into a production-grade AI.
This study represents significant progress in mechanistic interpretability, moving beyond toy models to explore larger, more complex models.
By amplifying or suppressing specific features, researchers observed how Claude's outputs change, akin to an MRI revealing active brain areas.
The study revealed "feature neighborhoods," where related concepts are spatially grouped, mirroring human semantic relationships.
Exploring these neighborhoods, researchers found complex conceptual geographies, linking closely related ideas and progressively more abstract associations.
Researchers also uncovered features corresponding to dangerous capabilities, biases, and traits like power-seeking and dishonesty, underscoring the importance of understanding and managing these elements for safer AI.
The study emphasizes that while no new capabilities were added, understanding existing features can help make AI systems more transparent and secure.

In a new study, researchers at Anthropic have begun to show the inner workings of Claude 3.0 Sonnet, a state-of-the-art AI language model. By applying a technique called "dictionary learning" at an unprecedented scale, they've mapped out millions of "features"—patterns of neuron activations representing concepts—that underlie the model's behaviors.

Anthropic's research represents the first time that researchers have achieved this detailed look inside a production-grade AI model. This is significant because the field of mechanistic interpretability has largely relied on scaling laws. While early and exciting progress was made only on toy models, there was uncertainty about whether these techniques would work on larger models.

But the real fun stuff happened when the researchers began to tinker with these features, artificially amplifying or suppressing them to observe the effects on Claude's outputs. The results were impressive, as if they were able to put Claude in an MRI and observing which areas activate and understanding the reasons behind it.

Consider the case of the "Golden Gate Bridge" feature. When this was amplified to 10x its normal level, Claude appeared to undergo a sort of identity crisis. Asked about its physical form, the model—which normally responds that it is an incorporeal AI—instead declared "I am the Golden Gate Bridge… my physical form is the iconic bridge itself". Claude had seemingly become obsessed with the bridge.

The researchers also found that the model contains feature neighborhoods. For instance, when they explored the "neighborhood" of features surrounding the Golden Gate Bridge feature, they uncovered a conceptual geography. In close proximity, they found features corresponding to other iconic San Francisco places, like Alcatraz Island and the Presidio. Going further afield, features related to nearby places like Lake Tahoe and Yosemite National Park emerged, along with features tied to surrounding counties.

As the radius of exploration grew, the connections became more abstract and associative. Features corresponding to tourist attractions in more distant places like the Médoc wine region of France and Scotland's Isle of Skye appeared, demonstrating a kind of conceptual relatedness.

This pattern suggests that the arrangement of features within the model's neural architecture maps onto semantic relationships in surprising and complex ways. Just as physical proximity often implies conceptual similarity in our human understanding of the world, closeness in the model's "feature space" seems to encode an analogous notion of relatedness.

In a sense, we might think of Claude's feature landscape as a sort of alien geography, where concepts and ideas are arranged not according to physical laws, but the strange logic of the model's training data and learning algorithms. The Golden Gate Bridge feature sits at the center of its own conceptual "city," surrounded by a constellation of related ideas that grow progressively more distant and abstract as we move outward.

In another "neighborhood," this time one concerned with "inner conflict," the researchers revealed just how many concepts Claude can hold at any one time.

Nestled within this landscape, they found a subregion corresponding to the idea of balancing tradeoffs—the delicate art of weighing competing priorities and making difficult choices. This subregion sits in close proximity to another related to opposing principles and legal conflicts, suggesting a conceptual link between internal dilemmas and external disputes.

These cerebral struggles are situated at a distance from a separate subregion focused on more visceral, emotional turmoil. Here, concepts like reluctance, guilt, and raw psychological anguish cluster together, painting a picture of inner conflict that is less about rational calculation and more about how hard decisions might "feel."

This technique of probing feature neighborhoods could provide a valuable tool for auditing and monitoring AI systems. By tracking how the conceptual geography of a model shifts and evolves over time—perhaps in response to fine-tuning, retraining, or deployment in real-world applications—researchers and developers could gain new insights into how models learn, adapt, and potentially drift from their intended purposes.

I find it incredible that we can explore Claude's mind in this manner. It shows the power of these features to shape the model's behavior and reinforces that features aren't merely passive reflections of the input data, but active, causal ingredients in Claude's cognitive recipe. Activating a "scam email" feature was enough to overcome Claude's usual refusal to generate such content due to its harmlessness training. A "sycophantic praise" feature could be switched on to make Claude respond to an overconfident user with uncharacteristic, deceptive flattery.

Even more intriguing was the discovery of features corresponding to dangerous capabilities like creating backdoors in code or engineering bioweapons, as well as a range of biases around gender and race, and traits like power-seeking and dishonesty that we typically associate with human vices. In many ways, this is the main goal of mechanistic interpretability—to understand these features and thereby make models safer. Still, it's unsettling to see danger in action.

Showing activations for various dangerous ideas

Being able to bring out these hidden abilities and traits through specific adjustments shows an important truth: beneath the calm surface of language models like Claude is a vast sea of potential, with both risks and opportunities. What might it mean for an AI model to believe it is a physical object, even if only temporarily? Could a model like Claude be made to deceive or manipulate, to seek power or hide its true goals, if the right switches are flipped in its sprawling neural networks?

For now, these questions remain open. The Anthropic researchers stress that their work has not added any new capabilities to Claude, safe or otherwise—only revealed and isolated the ones that already exist. Their hope is that by interrogating the model's inner landscape, they can ultimately create more safe and transparent AI.

Mapping out Claude's full conceptual repertoire using current techniques would be prohibitively expensive. Nonetheless, this research constitutes a major milestone in AI interpretability. It's a first glimpse into the mind of an alien intelligence, one that we've created but are only beginning to understand.

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

What Anthropic Finds by Mapping Claude's Mind

Key Points:

Helen Edwards