We Need to Change How We Measure AI

Suddenly artificial intelligence shows signs of being smart like humans. AI has been advancing so swiftly that leading AI researchers and entrepreneurs predict that AI will soon surpass humans on all cognitive tasks—a milestone commonly referred to as Artificial General Intelligence, or AGI.

If you’re worried about being pushed aside by AI maybe there’s reason to doubt such predictions. How progress is measured comes with a built in bias that may skew the reality of a generally intelligent AI. Measures that use human tests may both overstate AI abilities compared to humans while also disguising the real nature of intelligence altogether.

The metrics for benchmarking machine performance largely hinge on tests designed for humans. This anthropomorphic approach raises two problems. First, tests for human intelligence are an incomplete and biased measure—they are culturally determined and inherently limited by the process of metrification itself. Second, tests designed for humans might not be the right tests for machines.

Higher Order Cognition: The Case of Chiplessness

Try this thought experiment, courtesy of neuroscientist Lisa Feldman-Barrett. Imagine reaching into a bag of potato chips and discovering that the previous chip you ate was the last one. You feel disappointed that the bag is empty, relieved that you won’t be ingesting any more calories, slightly guilty that you ate the entire bag, and yet hungry for another chip.

She calls this feeling “chiplessness” and, in a handful of words, she invented a new concept from a mundane experience. She named this new emotion. It is now shareable, sparking the same feeling in others. This took creativity and imagination. Chiplessness involves subjective feelings and emotions that can’t be precisely quantified. It’s a complex mental state. We can apply it flexibly to different contexts and blend it with yet more emotions. It’s adaptable. Chiplessness has fluidity which contrasts with the rigidity of defined cognitive tests and benchmarks.

The Problem of Individual Tests

This chart appeared in a recent article in Time Magazine. It shows AI performance on various human tasks compared to human performance based on commonly accepted industry benchmarks.

One conclusion you might make is that AI understands language better than us, has our common sense, can crunch numbers, and draft code nearly as well as we can. The individual measures (assuming the test is reasonable) tell a story of AI’s impressive progress.

But, individual tests of cognitive capability don’t tell us much about general capabilities, even in tests of higher cognitive function such as the Remote Associates Test or RAT. This test measures the ability to make associations. For example, what’s the common word that unites these three words: night / wrist / stop? Or pet / bottom / garden? (See the end of the article for the answers).

RAT is a good measure of creativity—and it is another test where GPT4 outperforms humans. Yet even this quite sophisticated and long-standing test for creativity doesn’t say anything about generally intelligent behavior.

While each of these tests may say something individually impressive, collectively these measures say nothing. It’s almost as if AI’s mirror dematerializes the very idea of intelligence which is much more than a sum-of-the-parts. Playing with concepts highlights human cognition. Abstract connections, metaphor, imagination, insights, emotions: these are our cognition at the highest levels. Language is a game played among us, not a test with a measurable, numerical score.

The foundations of insight are in the complex interplay of intentions, sensations, reactions, response, and social reasoning. Empathy and emotions amplify our ability to transfer ideas from one space to another, from one person to another. It’s not just bits and bytes—it’s neurons and nerves, heartbeats and heartache.

Are We Constraining Machine Progress?

By trying to mimic humans we might miss the opportunity to develop new cognitive capabilities based on what makes machines “superhuman.” AI is more alien than human. The scale of a large language model is hard to fathom and we have no intuition for the knowledge contained in GPT-4. But that’s the point—anyone can take a language journey, set at their level of comprehension. Anyone can query the entirety of humanity’s digitized knowledge: retrieving, traversing, and combining the data cosmos. We fail to form the right mental model if we think of ChatGPT as just a big writing machine.

Paradoxically, striving for human benchmarks may constrain machine progress. We need the weirdness that AI can give us. We want an AI’s creativity to be wildly different—one that is free of our biological motivations and constraints. One that can come up with ideas that are unrecognizable to us. This is precisely what we hope for in AI. What we see today in AI content generation, protein folding, and algorithmic trading is only just the beginning.

As we measure how an AGI may conceptualize, contextualize, empathize, understand, reason, analyze, and plan, we should be more critical about using human benchmarks. Rather than constraining AI to human benchmarks, we need new tests tailored to AI's unique capabilities.

What we want from adversarial testing is that it reveals the cognitive contrast of humans and machines. So let’s put aside constraining comparisons to human cognition and instead focus on how to reveal the alien nature of machine intelligence in all its promising weirdness.

The answers to the RAT test:

Easy - night / wrist / stop: watch

Very hard - pet / bottom / garden: rock

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

We Need to Change How We Measure AI

Higher Order Cognition: The Case of Chiplessness

The Problem of Individual Tests

Are We Constraining Machine Progress?

Helen Edwards