The ARC Prize and What it Means for AGI

Explore the debate on achieving AGI: scaling laws vs new approaches. Learn about the ARC prize, a $1M competition challenging the current consensus and proposing a benchmark focused on skill acquisition. Discover why benchmarks matter in shaping AI's future and driving industry perceptions.

An abstract image of an arc

We've had a lot of response to my post AGI is a Red Herring. To be honest, if you'd told me ten years ago that I'd be spending so much time talking and writing about AGI, I would have been incredulous. But here we are: OpenAI, with its bizarre incentives to proclaim early achievement of AGI, has skewed the discourse around advanced machine intelligence. It's cultish and incomplete.

At the risk of oversimplifying: there are two competing theories about AGI today:

  • Either, scaling laws will deliver AGI. Basically with more data and bigger models, AGI will emerge from a future transformer-based (ie multi-modal large) model.
  • Or, we need something else.

In the "we need something else" camp, Mike Knoop and François Chollet have launched the ARC prize: a $1,000,000 public competition to beat and open source a solution to a new benchmark.

Knoop and Chollet argue that progress toward AGI has stalled. Large language models are trained on unimaginably vast amounts of data, yet they remain unable to adapt to simple problems they haven't been trained on or make even basic novel inventions. Going further, they point out that current market incentives have pushed frontier AI research to go closed source, which means that research attention and resources are being diverted toward a dead end.

Clearly, like us, Knoop and Chollet are frustrated that the current consensus defines AGI as a system that can automate the majority of economically valuable work (per OpenAI). The correct definition of AGI, however, is a system that can efficiently acquire new skills and solve open-ended problems. Definitions are crucial because we turn them into benchmarks to measure progress toward AGI.

This is an important moment: we want to have systems capable of inventing and discovering alongside humans. If AI is unable to learn new skills on its own, it isn't AGI. Now we have a high profile, competing idea which differs in two key ways.

First, the competition is open source. Chollet, in particular, is a strong advocate of openness, primarily as being the best way to advance research and progress. Second, they propose a new benchmark, a crucial idea since, in AGI research, the choice of measure is paramount. Goodhart's law is at play here: when a measure becomes a target, it ceases to be a good measure. Once the measure becomes the target, we all know the consequences: research may shift towards gaming the benchmark rather than achieving genuine progress, potentially leading to overhyped claims, misallocated resources, and a distorted understanding of AGI capabilities. Sound familiar?

The ARC benchmark is different because it is based on acquiring new skills, not memorizing what is already in the training data. Per the ARC website:

Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. Chollet's unbeaten 2019 Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is the only formal benchmark of AGI.

It's easy for humans, but hard for AI.

The ARC benchmarks are like little reasoning puzzles. You can play with them yourself here but here's what they look like.

The left panel (example) shows the reasoning idea (darker blue square to complete the square). The panel on the right (test) shows the puzzle to complete (input) and my answer as output (which is correct).

Benchmarks matter. They not only drive the industry itself, they drive every day perceptions about AI, which in turn drive people's level of excitement and fear about the technology. We're really excited about the technology but not excited at all about the power structures, dominant narratives, and cultural deployment of it. Benchmarks that remind us there is more than one story serve to remind us that there are many human choices yet to be made and that how AI is adopted remains a human opportunity.

ARC v other benchmarks

Next time you're wondering what GPT-5/6/7 might bring, remember that we are still in the early days. AGI, according to OpenAI, might be right around the corner but the company has particular incentives to claim it and there's far more to the story than current benchmarks and current AI.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Artificiality.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.