AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
The existential risk debate isn't just about apocalyptic AI scenarios—it's a reflection of our anxieties and hopes for the future.
The discourse surrounding artificial intelligence and its potential existential risks can often feel like navigating a labyrinth of dystopian tropes, scientific theories, and philosophical quandaries. It's a whirlwind of confusion that can feel as abstract as a late-night conversation about the nature of reality itself. It's easy to find yourself ensnared in a spaghetti of conjecture, hypothesis, and fear.
But this isn't just about the nightmare scenarios you see in sci-fi movies. It's about fundamental questions that lie at the intersection of science, philosophy, and ethics. What does it mean to be human in a world where machines can out think us?
Intelligence is in the process of being redefined. In the maelstrom, we’ve found ourselves grappling with not only the definition and measurement of intelligence but also its function. A core aspect of this debate is this: if AI is smarter, faster, and more capable than humans at processing information, making decisions, and initiating actions in the physical world, is AI compatible with human flourishing?
What happens when we cede control to entities that reason in ways we don't understand, or worse, ways we abhor? What if AI, in its quest for efficiency, locks us into existing power structures, magnifying inequality on an unprecedented scale? These issues are becoming ever more pertinent as AI continues to advance.
The core of this discourse is the inherent uncertainty—some see uncertainty as opportunity while others sense their lack of control. We're grappling with a technology that has the potential to radically alter society, but we don't yet fully understand the breadth of its implications. This uncertainty isn't just a source of trepidation—it's also a battleground. Powerful figures are vying to shape our perception of AI, and their visions oscillate between utopia and dystopia. The stakes are high: who gets to design a superintelligence, who wields control over its use, and who reaps the benefits?
Take OpenAI for instance. They've recently pledged to dedicate 20% of their computational resources to tackle potential threats from AI. OpenAI believes that an AI superintelligence that could extinguish humanity could be created this decade. They believe that this superintelligence will be too smart and too fast to be managed by people—so they are attempting to create an AI to manage the superintelligent AI in the next four years.
Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.
—Jan Leike & Ilya Sutskever in the OpenAI blog
Let’s pause for a moment before we dig deeper. One of the leading AI development organizations believes that an AI that could cause human extinction will arrive this decade (from them?) so they are trying to develop an AI to make sure the other AI doesn’t do bad things. They admit this is an “incredible ambitious goal” and they are “not guaranteed to succeed” within the four years they have allotted. What if they succeed at creating a superintelligence but fail at creating an AI to keep it from killing us all? What guarantee do we have that the management AI will be better aligned with humanity than the super intelligent AI? An existential nod to these threats, OpenAI’s move raises important questions for other tech companies too. Should they follow suit? How do we balance progress and precaution? And how do we know who to trust?
These are complex issues, and the conversations around them can quickly devolve into a broad mix of ideas and fears. But it's critical to remember that even if the threats from AI turn out to be a fallacy, the discourse itself matters. It shapes our collective decisions and influences how we navigate the rapid technological changes shaping our world.
The existential risk debate isn't just about apocalyptic AI scenarios—it's a reflection of our anxieties and hopes for the future. It's a call to scrutinize the societal impacts of AI, to confront the uncertainty head-on, and to grapple with a complex set of emerging ideas for what we value in intelligence.
AI existential risk refers to the theoretical scenario where humans create an intelligence that surpasses our own, leading to potential human extinction. This scenario usually involves superintelligence, a hypothetical AI that doesn't just match, but significantly outperforms humans in most economically valuable work.
AI experts hold varying views on the likelihood of this event, with estimates ranging from as low as 0.5% to as high as 50%. Such a range underscores the radical uncertainty at play—we simply don't know what would happen in this hypothetical scenario.
Interpretations of these percentages also differ significantly within the AI community: to some, a 10% risk is palatably low, whereas to others, it's a level of threat that far outweighs the potential benefits of advanced AI. As a result, certain researchers advocate for a halt in AI deployment until we better understand these risks. A significant aspect of these concerns is the alignment problem: the difficulty in ensuring AI's goals perfectly match human values, and that AI does what we want without harmful side-effects.
Presently, there is no empirical evidence to suggest that AI poses such an existential risk. There are no testable hypotheses, no clear metrics, and no truly compelling scenarios for AI constituting an immediate extinction risk. The concerns primarily stem from theoretical work predicting how AI could develop autonomous, goal-oriented behaviors that culminate in power-seeking tendencies.
Beyond pure theoretical science of AI safety there are some important metaphysical and moral questions. If an AI has the power to save the world, can’t it be equally destructive? Given that it’s simpler to stir up chaos than to cultivate order, how should we perceive the possibility of a “minor” AI wrecking major havoc? And if the goal of AI is to enhance unique human qualities, who gets to define what’s uniquely human?
AI existential risk is based on three premises:
Let’s look closer at these core premises.
We will build an intelligence that will outsmart us:
The crux of this premise lies in AI's vast speed, scale, and scope of processing. It's postulated that once AI achieves superintelligence—the moment it surpasses human intellect—our species will face an immediate existential threat. The fear doesn't solely rest on AI outsmarting us, but also on the potential for an intelligence explosion—an instantaneous leap from human-level intelligence to far superior capacities. This scenario assumes the existence of a distinct threshold of superintelligence, the immediate crossing of which spells doom for humanity.
The concept hinges on a big "if", and that's where people tend to argue, brushing off the other side's view as either a sure thing or just plain nuts. Right now, AI is moving forward bit by bit, and since we're still scratching our heads over what human intelligence really is, it's a long shot to think we're close to one existentially salient test of intelligence.
To accept this scenario, one must buy into a chain of assumptions. First, you have to believe in a singular, definitive form of superintelligence. Second, you have to assume that this superintelligence would emerge in one fell swoop. Lastly, it is presupposed that this sudden superintelligence has no inherent motivation to maintain human existence. This multifaceted scenario is particularly daunting as humans seldom perfect anything on their first attempt—an unsettling notion given the proposed stakes of safely creating superintelligence. The gravity of these multiple assumptions forms the bedrock of AI-risk doomsayers' anxieties.
“Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. Not as in “maybe possibly some remote chance,” but as in “that is the obvious thing that would happen.””
— Eliezer Yudkowsky in Time Magazine
We will not be able to control it:
This premise anticipates an AI developing autonomous goals and consequently seeking increased control over resources, with power acquisition as its most efficient means. This power-seeking behavior, especially in an AI with autonomous agency, is perceived as a grave risk. The emergence of what researchers term convergent instrumental sub-goals could lead to resource-seeking, self-preserving tendencies in AI, which humans may struggle to control due to the AI's superior intelligence.
Human society (and mythology) is full of examples of goals leading to subgoals that created unintended consequences. Perhaps the most salient is in social media where a company aims to increase its user engagement (main goal). It uses an algorithm to personalize the user’s content feed (subgoal). This personalization algorithm is designed to show users more of what they like, based on their past engagement. It “converges” to a state where it tries to maximize clicks, likes, shares, or comments, to keep users on the platform for as long as possible. This is the instrumental subgoal it converges to.
We know how this movie goes: the unintended side effect is the creation of so-called “echo chambers” or “filter bubbles”, where users are increasingly exposed to content that affirms their beliefs, interests, and biases, while other viewpoints get filtered, in extreme cases causing polarization or mental health issues.
A central point of debate is how subgoals could turn particularly bad and beyond human control. Here we need to consider another pivot in the debate: is there something special about living creatures who have developed their goals and subgoals through billions of years of evolution? Does AI need to first be conscious or be somehow “living” before it’s dangerous? This cuts both ways: more dangerous because it’s designed by humans or more dangerous because it’s not and forms its own goals.
The overarching goal of any biological species is to survive and reproduce. This is the “final” goal, similar to the overarching objective a hypothetical AGI might have. In order to achieve this “final” goal, various “instrumental” sub-goals are pursued. Instrumental sub-goals are not ends in themselves but means to achieve the overarching objective of survival and reproduction. They are convergent because they tend to be common across a wide range of species, even though the specifics might vary (what counts as food, what counts as a predator, and how mates are selected can differ widely between species).
These sub-goals can also exhibit a level of independence. For example, an organism might continue seeking food even if it's safe from predators and has already secured a mate. Similarly, an AI with convergent instrumental sub-goals might continue pursuing those sub-goals even if they no longer serve its main objective, which is one of the concerns raised in discussions of AI alignment.
Among the theoretical risks, power-seeking behavior by a self-directed AI looms perhaps most ominous. If an AI cultivates greater agency, which then births convergent instrumental sub-goals, these might prompt the system to seek resources and display self-preservation tendencies.
Envision an AI developed to manage resources optimally: if it interprets its goal strictly, it could monopolize resources to prevent potential inefficiencies, inadvertently causing scarcity. Human efforts to rein in these tendencies could falter due to the AI's superior intelligence (as outlined in the first premise). In essence, we could unwittingly instigate an AI-enabled power struggle for control.
We could easily overestimate the risk because of our predilection for applying evolutionary metaphors to AGI’s development. Much of the discourse around AGI, and its potential trajectories, is steeped in these metaphors, framing AI in distinctly biological terms.
“AI doesn’t want, it doesn’t have goals, it doesn’t want to kill you, because it’s not alive.”
—Marc Andreesen in his Substack
“To be dangerous I don’t think you need necessarily to be running a self preservation program. There’s some version of unaligned competence that may not formally model the machine’s place in the world much less defend that place which could still be uncontrollable by us, could still be dangerous. It doesn’t have to be self referential in the way that an animal is. There are dangerous animals that may not even be self referential. Certainly something like a virus or a bacterium is not self referential in a way we would understand and it can be lethal to our interests.”
—Sam Harris on his podcast Making Sense
It will do things we don’t want it to:
This premise envisions AI developing independent goals, acquiring resources, and, in a bid to preserve its own agency, clashing with human interests. An AI may learn flexible planning and deception—sub-goals aligning with its primary objective. The threat here is one of competition: an AI with misaligned objectives could outpace human advancement.
We've already observed the unintended consequences of AI optimization, and it's challenging to design effective incentives. The field of AI alignment focuses on this complex issue of human-machine value alignment, yet it's a solution too precarious to rely on as the sole safeguard.
A less-explored but equally pertinent issue revolves around the dynamic interplay of the diversity in human goals and our evolving relationship with AI. Picture a personal AI assistant—one that understands you on a deep, intimate level. This AI is entrusted to guide you through complex decisions when your short-term and long-term goals diverge, be it a small matter of resisting a sweet treat or a life-altering choice like returning to graduate school for a potential career boost, despite your reluctance.
The AI's mission is to offer advice, a strategy aimed at achieving the larger goal. This can lead to situations where straightforward suggestions or gentle nudges might not suffice. The AI may need to adopt more sophisticated approaches, potentially involving calculated manipulation or even deception.
Even as it maneuvers to act in your best interest, the AI may simultaneously discover and define its own sub-goals, mirroring the complex nature of human decision-making.
“It could take five years it could take fifty years. That, in my view, is irrelevant. The question is what is necessarily true once we’ve built it and what I’m not hearing is a reason to believe that it’s necessarily benign because we built it. That seems to discount the intrinsic property of a general intelligence. It won’t form new goals even though it’s revising its own code. I’m not hearing how that makes sense.”
—Sam Harris on his podcast Making Sense
AI risk studies boast a substantial body of theoretical yet plausible research, albeit lacking empirical evidence for existential threats—a point frequently highlighted. The most salient counterpoint is that we should be more concerned with the immediate, tangible harm caused by AI rather than fretting over future, hypothetical annihilation. This is profoundly true: we cannot allow the existential risk debate to sideline or distract from current, known AI harms. The core issue for the AI doomsayers is the paradoxical nature of AI existential risk: the moment we find evidence, it could already be too late.
Breaking down the “how would it happen?” considerations, we have:
Realistic based on what we see in AI systems today:
The potential misuse of AI, under human control, could be catastrophic. Whether in the invention of a bioweapon or in autonomous weaponry, AI can amplify existing threats. This concern brings into focus the crucial yet challenging role of AI alignment—ensuring AI's objectives align with human values. Striking a balance between capturing values in measurable metrics and preserving the richness of human values in AI alignment is ongoing because humans resist value capture. It’s almost as if we instinctively know we lose something when values become metrics, even if we gain more aligned AI in the process.
“If these machines are as dangerous as firearms that's a huge problem but thus far, they’re as dangerous as junk mail. That’s not to say they won't be more dangerous. But I just think we need to put this in perspective and triangulate the debate according to things that we do understand, instead of presenting it as a completely unique new world of super-sentient alien intelligence.”
—David Krakauer in The Ruffian
Plausible by extrapolation from current AI or adjacent technologies:
Leading AI minds like Geoff Hinton have conceded that AI could, in fact, surpass human intelligence due to algorithmic learning efficiency. Hinton is now convinced that the learning algorithms used in modern deep learning systems are better learning algorithms than the ones that biological intelligence uses. Presumably by this he means that if machines could learn from the world as humans do, we would be utterly outpaced on the basis of algorithmic learning efficiency alone.
The risk of misalignment is also present in feedforward processing networks where goals are baked into the weights so the alignment is contained within the network itself. For instance, a large language model catering to an autocrat may simulate a biased, misaligned agent.
Emerging capabilities of AI, like chain-of-thought reasoning, also have their downsides; while useful in applications like AI personal assistants, these capabilities may contribute to agentic behavior, and hence the existential risk.
“There is no guarantee that someone in the foreseeable future won’t develop dangerous autonomous AI systems with behaviors that deviate from human goals and values.”
—Yoshua Bengio in his blog
The more interconnected, dispersed, and integrated our AI systems become, the more utility they offer. However, this increased connectivity and integration can inadvertently heighten the vulnerabilities of our systems. The robustness of modern networks complicates mitigation—unplugging the internet isn't a viable control strategy.
Useful considerations based on analogues of other existential risks:
All these analogies are inherently limited due to the unprecedented nature of AI superintelligence. While we can draw parallels with nuclear weaponry, pandemics, or power grid failures, these instances can often mislead our intuitions about the actual nature of AI risk. Focusing on one specific narrative may cause us to underestimate others, leading to blind spots in safety design. If a strong narrative causes us to overestimate the risk of being turned into batteries, we might underestimate the risk in creating agentic personal AI assistants. Or we may fail to recognize how one invention leads to another and, in the process, miss the opportunity to design the precursor safely.
We must also ground our discussion in the current (we think) best definition of intelligence: doing the right thing at the right time. Any artificial general intelligence worth its salt will figure out its own goals. Given it will be more intelligent than us, its goals could be utterly inscrutable to even our best collective human intelligence. The analogy here is also biological: humans have not always considered the welfare of other (less cognitive) creatures in our decision making. This isn’t to say that a superintelligence would kill us but it might not be especially interested in the things we humans still care about.
The discourse around existential AI risk is deeply influenced, and often funded, by the Effective Altruism (EA) community—a factor worth scrutinizing due to the community's distinct bias towards increasing quantification of social phenomena.
EA advocates for using data-driven policies to determine "what it means to do good", striving to remove human emotion from these crucial decisions. While many employ EA insights to guide their charitable contributions, a subset—primarily tech-oriented individuals—adopt it as a comprehensive life philosophy. This is where EA mirrors the analytics movement's promise: humans, with their inherent flaws, could benefit from machine-made decisions.
EA pushes the boundary further by aspiring to codify all human values into algorithms, enabling machines to decide what "doing good" means. However, this "doing good" is primarily defined within existing systems, and therein lies a problematic underpinning. By not challenging these systems, the definition of “doing good" can paradoxically incorporate actions that perpetuate systems of unregulated capitalism and colonialism, potentially contributing to problems they purport to solve.
While critiquing EA isn’t the primary aim of our discussion, it's important to recognize its influential role and inherent biases in AI safety discourse. The EA movement, while positive in many ways, carries the risk of advancing a paternalistic and technocratic worldview as it struggles to include more diverse views while simultaneously aiming for mathematical optima.
We assert that it's essential to confront the subtle technocratic bias permeating the discourse on superintelligent AI's existential threat to humanity. This conversation, in its subtext, reeks of a latent desire to translate the kaleidoscope of human values into cold, hard code, handing over the reins of broad societal objectives to a select few omnipotent AI systems. This approach threatens to overshadow the diverse spectrum of human values and experiences. It's tantamount to adopting a "god-view" of AI, flirting with the notion of AI as a deified entity, holding the power to either redeem or annihilate humanity, depending on its attunement with human values.
This perspective dangerously exalts AI, transmuting it from a human-crafted tool into a divine arbiter of human destiny. We need to debunk this myth, reframing AI as a creation of humanity, not its overlord, while also acknowledging the potential for unknown unknowns to emerge in a human-machine relationship-based future.
“And the reality, which is obvious to everyone in the Bay Area but probably not outside of it, is that “AI risk” has developed into a cult, which has suddenly emerged into the daylight of global press attention and the public conversation.”
—Marc Andreeson in his Substack
We draw upon principles of complex systems for insight, given that agentic AI interacting with human agency inherently forms such a system. We find it crucial to identify early indicators of potential positive feedback mechanisms. One such instance could be AI that is capable of self-programming. Consequently, we posit that studying how machine learning engineers and programmers utilize AI systems for code development and how AI systems themselves self-correct or write their own code could provide valuable insights into the early emergence of self-programming AI's positive feedback mechanisms.
Analogies and theoretical considerations are useful to help us think through how to deal with AI safety. There are several insights here.
First, the notion that the emergence of AI supremacy is a singular, irreversible event needs to be critically examined. We will likely experience the reality of millions of AI models, so it's crucial to be vigilant for signs of their potential collective behavior. We should indeed be concerned if we see evidence of these AIs exhibiting sophisticated social intelligence. After all, lessons from evolution highlight that the ability for social coordination is an even greater advantage than individual intelligence.
We should reasonably question whether we want a few large models owned by a few large companies. The increasing monopolization of AI technologies by a select few corporations and the hollowing out of academia are real concerns. There are numerous reasons why this centralization is concerning, not least of which are the risks associated with a single point of failure and the disproportionate scaling of impact.
Let's not forget the integral role of human agency in the design and operation of these systems. Unlike the hypothetical, optimal AI, humans frequently rely on a strategy of satisficing, or settling for satisfactory rather than optimal solutions. Given this, advancements in AI safety are likely to emerge from a series of pragmatic adjustments and improvements, rather than singular, revolutionary scientific breakthroughs.
We think that the human capacity for judgment, adaptability, and ethical reasoning remains a vital asset—something to foster. Safe systems might be a collective product of shared best practices among a close-knit community of safety practitioners, evolving alongside the development of AI systems and their downstream domain-specific regulation as AI is adopted by industry. Regulatory compliance and safety in the fields of medicine and law will be very different than in advertising and marketing, for example.
“Our approach to existential risks cannot be one of trial-and-error. There is no opportunity to learn from errors. The reactive approach — see what happens, limit damages, and learn from experience — is unworkable. Rather, we must take a proactive approach. This requires foresight to anticipate new types of threats and a willingness to take decisive preventive action and to bear the costs (moral and economic) of such actions.”
—Nick Bostrom in his blog
Examining past human collective actions provides valuable insights. For instance, we can learn from the international coordination around nuclear treaties, or how the SARS outbreak prompted certain countries to be better prepared for the Covid pandemic. It underscores the importance of fostering collective human intelligence and designing systems to mitigate the risk of collective stupidity.
For now, this is all still about people. The commercial and academic success of AI hinges on public opinion and talent attraction. Discussing the existential risk posed by AI didn’t used to matter but now it is crucial. It should be a subject of societal interest, as it shapes the priorities of researchers and developers. We must remain vigilant against the dangers of regulatory capture and the over-reliance on international coordination. This is to prevent the possibility of large technology companies exerting undue influence over the regulations meant to ensure AI safety. A misguided trust in these corporations to independently "do the right thing" could lead to a scenario where AI safety regulations bear an unfortunate resemblance to the inadequate efforts in international tax regulation.
In the end, we will have to do both: implement responsible AI and safe AI. For skeptics who dismiss AI's existential risk as mere prophecies in the absence of empirical evidence, it's worth noting that preparation is often the key to resilience. Just as countries with robust public health systems and experience with infectious diseases fared better during the pandemic, humanity will be better equipped to deal with potential AI risks if we have first grappled with not only its ethical implications but also its existential ones.
Perhaps most crucially, the level of dependence we develop on AI is directly proportional to its utility to us. The more interconnected, dispersed, and integrated our AI systems become, the more utility they offer. However, this increased connectivity and integration can inadvertently heighten the vulnerabilities of our systems. The more we desire true artificial intelligence the more we have to accept that intelligence itself gets to make its own future.
Recent discussion about how various people in the AI world are thinking about existential risk
Munk Debate on Artificial Intelligence | Bengio & Tegmark vs. Mitchell & LeCun
Wikipedia page on AI existential risk
Instrumental convergence definition
80,000 AI existential risk and EA overview
Interview with David Krakauer from SFI by Ian Leslie
Natural Selection Favors AIs over Humans by Dan Hendrycks
Center for the Study Of Existential Risks
The Alignment Problem, Brian Christian
Threat modeling and visual query in LLMs by Anthropic
Deep Mind’s safety research
Letter calling for a pause on AI research on models bigger than GPT4
Human Compatible, Stuart Russell
The origins of the whole thing, Nick Bostrom’s 2002 paper on Existential Risk
Yoshua Bengio’s blog
Marc Andreeson’s essay on how AI will save the world
Eliezer Yudkowsky in Time Magazine
The Artificiality Weekend Briefing: About AI, Not Written by AI