ChatGPT Will Never Work Like the Demo

ChatGPT Will Never Work Like the Demo, AGI is a Red Herring, The Brittleness of Agentic Reasoning and Planning Using LLMs, Can LLMs Reason and Plan, How to Use Generative AI to Create, and more!

An abstract image of a broken iPhone

Important

  • Don't forget our June research briefing at 2pm PT on 11 June 2024. Given all the recent changes and news, we'll be focusing on generative AI search. We will send a calendar invite later this week—but keep the time free on your schedule until then.
  • And, the response to our initial announcement of the Artificiality Summit in Bend, Oregon on October 12-14 has been exciting! We have a great line-up of speakers already and can't wait to share more soon. Please email us if you are interested in attending and/or sponsoring. Note: this will be a space-limited event so it's a good idea to let us know asap if you're interested.
  • Happy Birthday to Dave's sister whose birthday is today!

ChatGPT Will Never Work Like the Demo

In the New York Times, Brian X. Chen provides a solid comparison between the current state of ChatGPT-4o and that which was demoed at the launch. He claims “the demo turned out to be essentially a bait and switch.” He describes and shows in videos how the features that were most celebrated in the launch event are not replicable today. I wholeheartedly agree with his review—but want to take it one step further.

As a bit of background, tech demos are always a show with varying degrees of reality. It’s become accepted that companies will demo products and features that aren’t available yet. And it’s become accepted that demos will showcase only the best features—and likely dance around weaknesses. I understand this process well—I used to script and deliver demos for dozens of software products. The aim is to excite a customer with the best you have to offer while staying true to reality.

That said, there are two differences between demos as we’ve come to understand them and those that seem to becoming a pattern for generative AI products.

First, as Brian demonstrates, OpenAI released ChatGPT-4o in a state that does not match the announcement or demo. It’s one thing for Apple to announce the next iOS before it’s available. It’s another for OpenAI to release ChatGPT-4o in an incomplete state. The public has a justified mental model that the available version of a product is the product. If I choose to use ChatGPT-4o, it should have the same features as were marketed and the same features as everyone else.

But OpenAI is building a pattern of releasing a product which is actually just a beta version of the product. ChatGPT-4o is not yet finished. It is incomplete. And while OpenAI’s core fanbase may be fine with being beta testers, the general public isn’t and shouldn’t be.

This release strategy leaves people confused. When will ChatGPT-4o be feature complete? When will the voices from the demo be available? When will the desktop app be able to screen grab as in the demo? When will ChatGPT-4o actually be ChatGPT-4o?

Second, generative AI products will never replicate demos—ever. With traditional software, you can perform a demo and know with 100% certainty that a user will be able to replicate the demo. With generative AI, you can perform a demo and know with 100% certainty that a user will never be able to replicate the demo. This is the fundamental nature of predictive technologies—and I can’t stress enough how important this difference is.

If you ask ChatGPT the answer to physics problem, it isn’t looking up the answer. It isn’t querying a database of ground truth. It isn’t checking the answer key to the physics AP test. It doesn’t “know” any facts in the way that fits the mental model of human knowledge.

ChatGPT (and all generative AI language tools) is “reading” your prompt and then iteratively predicting the next “almost” best words to provide a response. I say “almost” because large language models have been tuned to predict words that are considered almost next best to create the kind of language variability that sounds human-like. This means that every time—every time—you provide your prompt, you will get a different answer.

This also means that every time—every time—you try to replicate a demo of a generative AI tool, you will get a different response than the demo. These unpredictable generations are core to why generative AI is so powerful and compelling. But they do not match our mental models for computers and are creating mass confusion.

One of the best lessons I learned from Steve Jobs was when, during a weekly software review, he turned around from the software he was testing and said, “Do your customers understand how this works?” I think it’s fair to say that for the majority of ChatGPT users, that answer is “not well enough.”

It’s time for OpenAI to redesign the entire user experience with a beginner’s mind so that people know what they’re really working with.


This Week from Artificiality

  • Our Ideas: AGI is a Red Herring. The current obsession with AGI, fueled by the hype from companies like OpenAI, is a dangerous distraction we must firmly reject. Don't fall for the red herring argument that we need superintelligent AI to save us from ourselves. It's an insult to human intelligence and agency. Break down their flimsy logic and you'll see the AGI agenda for what it really is: a modern techno-myth peddling the ancient story of salvation from above. A quasi-religious narrative spun by Big Tech to serve their own interests, not ours.
  • The Science: The Brittleness of Agentic Reasoning and Planning Using LLMs. Research suggests that LLMs are not demonstrating genuine reasoning abilities but are instead relying on pattern matching and retrieval based on the provided examples. The ReAct framework is brittle and what looks like reasoning in LLMs is more by luck than design. We're still a ways off reliable performance of LLMs in reasoning and decision-making tasks.
  • Our Research: Can LLMs Reason and Plan? The AI world feels like it’s divided into two camps: those that think LLMs can reason and plan and those who do not agree. This dichotomy in views gives rise to over-optimism and over-pessimism about AI, neither of which are particularly helpful. So which is it? It’s increasingly clear that LLMs aren't capable of genuine planning and reasoning. According to ASU researchers, they're essentially giant pseudo-System 1 knowledge sources, not System 2 thinkers. While it’s true that they are more than giant machine translators, it’s also true that they cannot reason autonomously. LLMs are great at coming up with approximate knowledge and ideas for potential plans. But to actually use those ideas, you need to pair the LLM with external programs that can rigorously check the plans for errors. The key is to use them as part of a bigger system.
  • Toolkit: How to Use Generative AI: Create by mixing modes, varying both inputs and outputs, and perspectives. Part 6 in our How to Use Generative AI series. In the rapidly evolving landscape of artificial intelligence, multimodal models are at the cutting edge. They offer a transformative approach to content creation. These models mix modes, inputs, and outputs to build multimedia content from simple text prompts, providing an expansive generative canvas that fuels creative possibilities. By converting simple text prompts into sophisticated multimedia creations, multimodal models make content production more accessible. People without specialized skills in graphics design or video production can make content at a more professional level. Experts can expand the tools available which paves the way for dynamic context-adaptive creations.

Bits & Bytes from Elsewhere

  • Google's new AI Overviews generated some odd results like telling people to use glue to get cheese to stick to a pizza. These bad answers have been shared widely along with a lot of faked bad answers. Google explained why some of these bad answers were generated and how it is addressing the problem. Many are the result of "information gaps" where there isn't much information on the internet to choose from to generate an answer. Some are the result of the AI Overview technology not being great at identifying satirical content (like the Onion article that recommended eating two rocks per day). Google says that it has made changes to reduce these errors but we wonder how well these changes will work and how many more problems are yet to be discovered.
  • Perplexity released Pages, a way for users to create shareable webpages of Perplexity generative search results. It's hard not to see this in context with the issues with Google's AI Overviews. Yes, the two systems operate differently but Perplexity also generates text with errors—many more errors than users understand, in our view. So, while Pages may seem like an interesting new feature to some, it appears to memorialize its errors for more people to view. We'll come back to the challenges with generative AI search in our June research webinar on June 11.
  • The Information reports that Apple plans to use its Secure Enclave technology to protect user data in the cloud—including data processed with generative AI. With the caveat that it's always hard to predict anything that Apple has yet to announce, this concept makes a lot of sense. Apple has spent years building strong capabilities for both AI processing and data privacy on device but there will be use cases for which processing in a data center makes more sense (i.e. analyzing images stored in the cloud). Doing so with the security of Secure Enclave could be one of the key differentiators of Apple's generative AI strategy and an enabler of truly secure and private interactions with generative AI. Depending how this is implemented, it could be a an advantage for Apple developers too—and featuring developers using Secure Enclave at WWDC would be logical. Perhaps this is what is behind the Apple-Gemini and Apple-ChatGPT rumors?

Helen's Book of the Week

The AI Mirror, How to Reclaim Our Humanity in an Age of Machine Thinking, by Shannon Vallor

Vallor's book on AI and humans is the best of its kind I've read in years. Her writing style is a perfect fit for how I absorb information. Crystal clear and jargon-free, her descriptions of the tech itself are spot-on yet totally accessible.

Vallor is the Baillie Gifford Chair of the Ethics of Data and AI at the University of Edinburgh and has worked with AI researchers and developers for many years. She describes herself as a virtue ethicist and evaluates AI's reflection of us by considering how it may alter our perspectives on virtues versus vices. She embraces, elaborates on, and wrings every drop out of the metaphor of AI as a mirror.

This starts with how AI is built on a foundation of historical data, which means that humanity can't afford to rely on it. If we do, we risk dooming ourselves to being trapped in the past. "The conservative nature of AI mirrors not only pushes our past failures into our present and future; it makes these tools brittle and prone to rare but potentially spectacular failures," she writes. Touché.

Many otherwise familiar ideas gain new depth through her interpretations and scholarship. I discovered numerous concepts she brings to light from the history of technology philosophy. For instance, the notion that AI mirrors make us more like machines (instead of making machines more like humans) was termed "reverse adaptation" by the philosopher of technology Langdon Winner in 1977. Today, we see this with workplace surveillance transforming workers into metric-monitored automatons of efficiency.

Perhaps what I appreciated most in this book was her scathing appraisal of AGI, much of which I completely agree with. There are so many brilliant sentences! One that particularly stands out encapsulates the emerging anxiety about AGI (despite Sam Altman's enthusiasm for the idea) by drawing a parallel to the factories of the nineteenth century: "Visions of AGI overlords that cruelly turn the tables on their former masters mirror the same zero-sum games of petty dominance and retribution that today drain our own lives of joy."

But it's not all doom and gloom. Vallor highlights instances of AI being used as a tool for cultural recovery within Indigenous cultures, serving as a mechanism for reclamation. Another example is "reverse bias," where AI helps doctors become more aware of the historical under-treatment of Black people's pain. These are small but significant glimmers of hope. They highlight one of the values of AI: by revealing and measuring such issues, we can learn to reason about them differently.

The AI Mirror is worth your time if you're looking for a realistically skeptical view of tech with glimmers of hope for a more virtuous future with AI.


Facts & Figures on AI & Complex Change

  • 53%: Percentage of people in the USA who have heard of OpenAI's ChatGPT. (Reuters)
  • 24%: Percentage of people in the USA who have heard of Google Gemini. (Reuters)
  • 22%: Percentage of people in the USA who have heard of Microsoft CoPilot. (Reuters)
  • 7%: Percentage of people in the USA who have heard of Midjourney. (Reuters)
  • 5%: Percentage of people in the USA who have heard of Anthropic's Claude. (Reuters)
  • 19%: Percentage of people in the USA who have not heard of any generative AI tool. (Reuters)
  • 7%: Percentage of people in the USA who use OpenAI's ChatGPT daily. (Reuters)
  • 11%: Percentage of people in the USA who use OpenAI's ChatGPT weekly. (Reuters)
  • 20%: Percentage of people in the USA who never use OpenAI's ChatGPT. (Reuters)
  • 35%: Percentage of people in the USA who have used generative AI in their private lives. (Reuters)
  • 28%: Percentage of people in the USA who have used generative AI at work. (Reuters)
  • 58%: Percentage of people in six countries (Argentina, Denmark, France, Japan, UK, USA) who are comfortable with news being written entirely by a human journalist. (Reuters)
  • 14%: Percentage of people in six countries (Argentina, Denmark, France, Japan, UK, USA) who are comfortable with news being written entirely by an AI. (Reuters)
  • 17.5%: Percentage of computer science papers that have at least some content drafted by AI. (Stanford)
  • 16.9%: Percentage of peer review text that has at least some content drafted by AI. (Stanford)
  • 24%: Percentage of HR departments have have not been at all involved in conversations about adopting AI technology. (Brightmine)
  • 20%: Percetage of HR employees who have received AI training of any kind. (Brightmine)
  • 4%: Percentage of US electricity generation consumed by data centers today. (EPRI)
  • 4.6%-9.1%: Percentage of US Electricity generation estimated to be consumed by data centers in 2030. (EPRI)

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Artificiality.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.