The Hidden Cost of ChatGPT is the Erosion of the Digital Commons

Key Points:

Erosion of the Digital Commons: The rise of AI models like ChatGPT is contributing to the decline of the digital commons, foundational to the modern web, by reducing user contributions to public knowledge platforms.
Impact on Stack Overflow: A significant decrease in user activity on Stack Overflow was observed following the release of ChatGPT, with a 16% drop in weekly posts initially, growing to 25% within six months. This decline includes valuable and novel contributions, not just low-quality or duplicate content.
Feedback Loop and Proprietary Models: As users increasingly rely on AI models for information, the digital commons suffer, potentially leading to a feedback loop where open platforms diminish and proprietary models dominate, locking valuable data and knowledge in closed silos.
Narrowing of Information Seeking: LLMs streamline information seeking, favoring mainstream views and reducing the need for exploration. This predisposes users to a flat, homogenous information landscape, disincentivizing further learning and the use of niche tools.
Synthetic Data Limitations: While synthetic data is proposed as a solution, LLM-generated data is ineffective for training AI, exacerbating data scarcity concerns and potentially slowing the generation of new open data.
Snake-Eating-Its-Tail Scenario: LLMs depend on human-generated data, their most important input, yet their prevalence reduces the web’s capacity to produce such data, creating a self-perpetuating dilemma.

Is AI destroying the internet? Are we running out of good data? Will AI increasingly eat its own excrement? All these questions are being asked right now and the answer to all of them feels like "yes." But how do we know and what evidence do we have? Perhaps the real question is this: what is happening to the digital commons that underpins so much of the modern web?

At the heart of the issue is the very nature of how ChatGPT and other AI models are trained. These systems consume troves of publicly available data, from Wikipedia articles and Reddit posts to open-source code repositories like GitHub. They then use this data to build their knowledge base and generate outputs in response to user queries. They stand on the shoulders of the digital giants—the countless contributors who have voluntarily shared their knowledge and creativity online for the benefit of all.

A new study suggests that the rise of ChatGPT may be eroding these foundations. Focusing on the popular programming Q&A platform Stack Overflow, the researchers found a significant drop in user activity following the release of ChatGPT. Using sophisticated statistical models, they estimate a 16% decrease in weekly posts, with the effect growing to 25% within six months. Importantly, this decline was not limited to low-quality or duplicate content, which means we should worry because even valuable and novel contributions were being displaced.

Time series of weekly posts to Stack Overflow since early 2016. A: In the six months after the release of ChatGPT, the weekly posting rate decreases by around 20k posts. B: Comparing posts to Stack Overflow, its Russian- and Chinese-language counterparts, and mathematics Q&A platforms since early 2022.

If users turn more and more to ChatGPT and other AI models for answers and assistance, rather than posting their questions and solutions publicly, the digital commons that these models rely on will begin to decline. Over time, this could lead to a feedback cycle, where closed models become more proprietary and open platforms become more narrow and less useful.

The digital commons, from open-source software to creative works shared under permissive licenses, have been a key driver of innovation and progress on the web. They have enabled developers to build on each other's work, accelerating the pace of technological change. They have allowed creators to remix and reuse content, spurring new forms of expression and collaboration. And they have made vast stores of knowledge and information freely available to anyone with an internet connection, fundamentally changing access to knowledge.

If the rise of AI leads to a decline in these public goods, the future of the web itself may be at stake. We risk entering a state of digital enclosure, where the most valuable data and knowledge are locked up in proprietary silos. The substitution effect of Large Language Models will decrease the dynamism and openness of the web as information contribution shifts from open knowledge generation to feeding proprietary LLMs. This is effectively privatization of the digital public goods. It will also have the effect of privatizing feedback loops as leading LLM owners gain advantage from exclusive user data and feedback.

I think one of the most pernicious effects—but one you as an internet user has some control over—is narrowing of information seeking. As more queries are funneled through LLMs, the researchers highlight that this has the effect of narrowing down the range and depth of information that people seek. LLMs favor mainstream views which reduces the need to explore. This behavior predisposes us to a flat, "simple and same" information landscape. The efficiency bump you get from using an LLM can disincentivize further learning or using niche or new tools. In the world of dichotomies, LLM use can favor "exploit" over "explore".

This study also highlights the complex political economy of the web and its underlying infrastructure of AI and data. Digital goods have value beyond training data, especially in developing countries where people are more intrinsically motivated to learn via internet platforms such as Wikipedia. Losses in open data impact innovation and therefore the ability to generate new data. People use platforms such as Stack Overflow for more than answering questions—they signal their competency and proficiency in the labor market, thereby linking the online world with the physical world of human flourishing in complex ways.

ChatGPT has altered the network topology already by reducing the vibrancy of one popular and productive platform. Widespread LLM use will likely slow new open data generation needed to train AI and don't believe the promise of synthetic data: LLM-generated data is going to be ineffective for training which is, in part, already leading to data scarcity concerns.

In the final analysis, LLMs can't fully replace human-generated data, their most important input, and yet the dilemma is that they are reducing the web's ability to generate novel data. We might be in a snake-eating-its-tail scenario. Ironically, it's Big Tech that might have the biggest incentive to fix it.

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

The Hidden Cost of ChatGPT is the Erosion of the Digital Commons

Key Points:

Helen Edwards