AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
In 2016, AI experts predicted radiologists would be obsolete within years as machines outperform humans. This did not transpire.
In 2016, Geoffrey Hinton, a pioneer in deep learning, claimed, "We should stop training radiologists now, it's just completely obvious within five years deep learning is going to do better than radiologists". Similarly, in 2017, Vinod Khosla, a prominent venture capitalist, asserted that "the role of the radiologist will be obsolete in five years".
Radiology, it seemed, was destined for obsolescence, with artificial intelligence taking over the reins by 2020. Machines, according to Oxford economists, would replace doctors as many tasks within professional work were deemed routine and process-based, lacking the necessity for judgment, creativity, or empathy.
Yet, these predictions from technology visionaries failed to materialize. What led to their glaring misjudgment? And, more broadly, what can we learn about AI-driven human obsolescence?
A seminal radiology study in 2013 played a significant role in shaping the discourse on automation in medicine. In this study, twenty-four radiologists participated in a familiar lung nodule detection task. Researchers surreptitiously inserted the image of a gorilla, forty-eight times larger than the average nodule, into one of the cases. The findings were astonishing: 83 percent of radiologists failed to see the gorilla, even though eye-tracking data showed that most were looking directly at it. In the images below, you should be able to spot a gorilla (hint: look in the top right portion of the lung image).
Inattentional blindness is a phenomenon that can affect even the most skilled experts in their respective domains. It reminds us that humans, no matter their level of expertise, are fallible. When we are engrossed in a demanding task our attention behaves like a set of blinkers which prevents us from seeing the obvious.
In mathematical terms, this can be described as a failure in sensitivity. The radiologists in the study were unable to detect a conspicuous anomaly in the image—a false positive, in this case. AI can compensate for this bias by collaborating with radiologists to screen for any potential abnormalities, such as an unexpected gorilla.
However, sensitivity alone does not encompass the entire diagnostic process. Radiologists must also accurately identify negative results, a measure known as specificity, to avoid raising false alarms. Humans excel at determining whether a suspicious finding flagged by AI is truly a cause for concern.
Generally speaking, machines demonstrate superior sensitivity (identifying deviations from the norm), while humans exhibit greater specificity (assessing the significance of these deviations). Sensitivity and specificity are interdependent variables: adjusting one invariably affects the other. Designing a machine with both high sensitivity and specificity is an impossible task, as a trade-off between the two is unavoidable. This is why the partnership between AI and radiologists proves superior to either working independently—the collaboration strikes an optimal balance between expert machine and expert human.
This insight reveals a key reason why Hinton and Khosla's predictions missed the mark—ironically, they overlooked the statistical nature of diagnosis and the necessity for humans to address machine errors.
Diagnosis is not a simple binary process of yes or no. Imperfections in tests and the ever-present possibility of errors necessitate accounting for false results. Designing AI with both low false positive and low false negative rates proves to be a challenging endeavor. Instead, a more effective approach involves creating machines that compensate for errors humans are prone to, while capitalizing on the innate strengths of human expertise.
Yet, there's another dimension to this narrative. As researchers and practitioners observe the collaboration between human and machine, they're witnessing a shift in the perception of diagnostic accuracy. Prior to AI's integration into the workforce, image-based diagnosis was primarily concerned with detection, posing the question: "did we find something that looks wrong?" The prediction and the judgment is connected in a single mind:
With AI now identifying a greater number of lesions or areas warranting further examination, radiologists are devoting more time to determining the significance of these findings. The central question has evolved into: "is this anomaly associated with a negative outcome?"
This example offers a second clue as to why technology experts misjudged the situation—AI has effectively bifurcated the diagnostic decision-making process. Previously, human radiologists made decisions that combined prediction and judgment. However, when humans make decisions, the prediction (such as an abnormal lesion) is often indistinguishable from the judgment regarding the danger it poses (whether the lesion is problematic). AI disentangles prediction and judgment in decision-making, leaving the human to exercise judgment. This separation can be subtle, with humans sometimes unaware that they're making a prediction as part of a decision.
AI has redefined the diagnostic landscape. Radiologists now face an increased volume of disease assessments generated by AI, and must evaluate whether a positive result carries implications for clinical outcomes. This judgment step takes into potential interventions and associated risks and calls for the very empathy and creativity that technology forecasters prematurely dismissed as obsolete.
For AI to become a genuinely valuable tool in radiology, radiologists themselves must take on the responsibility of training, testing, and monitoring outcomes. A data scientist's expertise can only go so far: the radiologist's judgment in assessing the connection between diagnosis and clinical outcome is vital. As researchers noted in The Lancet, "Unless AI algorithms are trained to distinguish between benign abnormalities and clinically meaningful lesions, better imaging sensitivity might come at the cost of increased false positives, as well as perplexing scenarios whereby AI findings are not associated with outcomes."
Hinton and Khosla also failed to account for the practical challenges AI encounters in the real world. Theoretical success doesn't always translate into practical effectiveness, as reality often proves more complex than our assumptions. By 2020, a mere 11 percent of radiologists reported using AI for image interpretation. This low adoption rate is primarily due to AI's inconsistent performance—94 percent of users experienced variable results, while only 5.7 percent reported that AI consistently worked as intended. This level of unreliability is insufficient to gain the trust of doctors.
The chasm between AI's potential and its real-world application is intricate. AI model development begins with testing in a highly controlled and limited environment. Machine learning engineers collaborate with a select group of experts to train a model, evaluate its performance, and then deploy it within a specific setting—such as a radiology department in a single hospital.
AI luminary Andrew Ng, known for his work at Google Brain and Baidu, has shed light on the challenges of transferring AI models between environments. When AI is trained and tested in one hospital—typically an advanced or high-tech facility—researchers can demonstrate its performance on par with human radiologists. However, when the same model is applied to an older hospital with dated equipment and differing imaging protocols, the data becomes inconsistent, leading to a decline in performance.
This issue of model transferability is another crucial factor in Hinton and Khosla's misjudgment. An AI trained in one location may not be dependable in another. In stark contrast, a human radiologist can effortlessly transition from one hospital to another and still "do just fine."
The emergence of ubiquitous foundation models (vast data captured as general models) and generative AI (ability to create new data from the foundation model) could reshape the landscape of AI applications in radiology and other domains. These models, pre-trained on vast amounts of data from diverse sources, could be further fine-tuned to specific environments and tasks, potentially improving transferability and performance. By incorporating diverse imaging protocols and equipment types, these models might better adapt to different hospitals, addressing the current limitations in model transferability.
However, the rise of foundation models also presents its own set of challenges and risks. As models become more intricate and expansive, interpretability becomes increasingly difficult, raising concerns about transparency and accountability in decision-making. If history is anything to go by, this will give rise to new skills and tasks for future radiologists.
Let's go back to our interlinked story of decisions: specifically the decoupling of prediction, judgment, and action that AI enables. In a world of foundation models and generative AI systems, we can see why AI adoption is far more complex than it first appears: not only does AI increase the value of human judgment (because there are more decisions and actions to take), it also puts judgment into more places in the decision making process because users have to apply judgment more prescriptively to both inputs to the generative model (prompts) and to the outputs ("is this the action that worked/did the AI generate the right prediction?").
In a twist of fate, rather than becoming obsolete, the number of radiologists in the US has increased by around 7 percent between 2015 and 2019. There is now a global shortage of radiologists, in part due to an aging population's rising demand for imaging. Ironically, the bottleneck in radiology now lies in training.
The prevailing sentiment is that "AI won't replace radiologists, but radiologists who use AI will replace those who don't." Far from becoming obsolete, radiologists are in high demand, partially thanks to the benefits AI brings to the field. We see three clear effects:
Net-net: deciding the appropriate course of action requires a holistic, synthesized, team-based, and personalized set of decisions, not just a single readout from an image.
The key to making better predictions about work is to understand the deeply connected nature of human and machine decision making. Ask: what new things can I know/predict, what new judgments arise, and what new actions might I be able to make?
We live in a complex system of evolving human preferences, adaptable professional environments, and emerging problems and opportunities. AI adoption in complex environments is never as simple as even the experts may have you think.
The Artificiality Weekend Briefing: About AI, Not Written by AI