J. Craig Wheeler: The Path to Singularity
An interview with J. Craig Wheeler, Professor of astronomy of the University of Texas at Austin about his book, The Path to Singularity.
A new study suggests that the rise of ChatGPT may be eroding the digital commons. If users turn more and more to ChatGPT and other AI models for answers and assistance, rather than posting their questions and solutions publicly, the digital commons that these models rely on will begin to decline.
Is AI destroying the internet? Are we running out of good data? Will AI increasingly eat its own excrement? All these questions are being asked right now and the answer to all of them feels like "yes." But how do we know and what evidence do we have? Perhaps the real question is this: what is happening to the digital commons that underpins so much of the modern web?
At the heart of the issue is the very nature of how ChatGPT and other AI models are trained. These systems consume troves of publicly available data, from Wikipedia articles and Reddit posts to open-source code repositories like GitHub. They then use this data to build their knowledge base and generate outputs in response to user queries. They stand on the shoulders of the digital giants—the countless contributors who have voluntarily shared their knowledge and creativity online for the benefit of all.
A new study suggests that the rise of ChatGPT may be eroding these foundations. Focusing on the popular programming Q&A platform Stack Overflow, the researchers found a significant drop in user activity following the release of ChatGPT. Using sophisticated statistical models, they estimate a 16% decrease in weekly posts, with the effect growing to 25% within six months. Importantly, this decline was not limited to low-quality or duplicate content, which means we should worry because even valuable and novel contributions were being displaced.
If users turn more and more to ChatGPT and other AI models for answers and assistance, rather than posting their questions and solutions publicly, the digital commons that these models rely on will begin to decline. Over time, this could lead to a feedback cycle, where closed models become more proprietary and open platforms become more narrow and less useful.
The digital commons, from open-source software to creative works shared under permissive licenses, have been a key driver of innovation and progress on the web. They have enabled developers to build on each other's work, accelerating the pace of technological change. They have allowed creators to remix and reuse content, spurring new forms of expression and collaboration. And they have made vast stores of knowledge and information freely available to anyone with an internet connection, fundamentally changing access to knowledge.
If the rise of AI leads to a decline in these public goods, the future of the web itself may be at stake. We risk entering a state of digital enclosure, where the most valuable data and knowledge are locked up in proprietary silos. The substitution effect of Large Language Models will decrease the dynamism and openness of the web as information contribution shifts from open knowledge generation to feeding proprietary LLMs. This is effectively privatization of the digital public goods. It will also have the effect of privatizing feedback loops as leading LLM owners gain advantage from exclusive user data and feedback.
I think one of the most pernicious effects—but one you as an internet user has some control over—is narrowing of information seeking. As more queries are funneled through LLMs, the researchers highlight that this has the effect of narrowing down the range and depth of information that people seek. LLMs favor mainstream views which reduces the need to explore. This behavior predisposes us to a flat, "simple and same" information landscape. The efficiency bump you get from using an LLM can disincentivize further learning or using niche or new tools. In the world of dichotomies, LLM use can favor "exploit" over "explore".
This study also highlights the complex political economy of the web and its underlying infrastructure of AI and data. Digital goods have value beyond training data, especially in developing countries where people are more intrinsically motivated to learn via internet platforms such as Wikipedia. Losses in open data impact innovation and therefore the ability to generate new data. People use platforms such as Stack Overflow for more than answering questions—they signal their competency and proficiency in the labor market, thereby linking the online world with the physical world of human flourishing in complex ways.
ChatGPT has altered the network topology already by reducing the vibrancy of one popular and productive platform. Widespread LLM use will likely slow new open data generation needed to train AI and don't believe the promise of synthetic data: LLM-generated data is going to be ineffective for training which is, in part, already leading to data scarcity concerns.
In the final analysis, LLMs can't fully replace human-generated data, their most important input, and yet the dilemma is that they are reducing the web's ability to generate novel data. We might be in a snake-eating-its-tail scenario. Ironically, it's Big Tech that might have the biggest incentive to fix it.
Writing and Conversations About AI (Not Written by AI)