AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
Explore the future of search with generative AI. Discover how Apple Intelligence, context-based understanding, intent-driven interactions, and integrated workflows are transforming search. Learn about the trust challenges and the critical balance needed for reliable, AI-powered search experiences.
In this webinar from June 11, 2024, we dive into the fascinating world of generative search and explore how it's evolving far beyond the traditional notion of search. We kick off by discussing the recent announcements from Apple's WWDC keynote, particularly focusing on Apple Intelligence and how it sets the stage for the future of search.
We then explore what search truly encompasses, going beyond just web searches to include commerce, file systems, and more. By examining different dimensions of search like perception, reasoning, and action, we paint a picture of how AI is transforming search as we know it.
Throughout the webinar, we take a deep dive into three key trends: the progression from language-based to context-based perception, from command-driven to intent-driven reasoning, and from simple information retrieval to integrated workflows and actions. We discuss how these shifts, enabled by advancements in AI, are reshaping the search landscape.
However, we also tackle the critical issue of trust in generative search systems. We examine various trust problems such as contradictory sources, hallucinations, variability across tools, prompt formulation bias, and generative pollution. By presenting a trust matrix, we categorize different search tools and highlight the challenges in establishing trust.
Key Points:
Following are the slides and an edited transcript of our research webinar from June 11, 2024.
Transcript: Hello, all. Thanks for joining us. We're excited to have you here. We're going to be talking a bit about generative search - search in general - which we're going to walk you through why we see it as much more than what you might think of just as search. And we will kick things off by covering and talking a little bit about what happened yesterday at the keynote of Apple's WWDC conference, as it puts a lot of the search stuff in perspective.
Transcript: What was announced yesterday shouldn't be a complete surprise to you. If you've been following this for a little while, we've talked a lot about the dynamic of moving AI to the edge to be able to run on device, which was a key part of the presentation yesterday. We've talked a lot about how AI will be pervasive, will be always on, will know you intimately as it gets to understand a lot more about you through all of the interactions on your phone or device. We've talked about how Apple might abstract some of the user flow away from Google by answering some of those questions in chats, etc. So this felt like a good day to see these announcements and think that we interpreted a lot of the product moves and the research and science that Apple has published pretty well.
Transcript: Some key things to talk about in terms of what we see that came out of the presentation yesterday - we're just going to focus in on the Apple Intelligence stuff. There were a lot of other interesting things that came out, but the interesting part is this Apple Intelligence is in a lot of ways sort of the start of a new aiOS for Apple. They are embedding this technology throughout the operating system, and making it available through the developer frameworks and toolkits and SDKs. There are a lot of new brand names of things that are related to these technologies, and availability for developers to embed this inside of their applications, which is interesting.
It is built as a privacy-first technology for sure, and that sets Apple apart from a lot of the other players that have been making a lot of noise. All of this is within the Apple ecosystem. We'll talk a bit about some of the differences, but that's the core part - the core part of what it's looking at is in the ecosystem and is secure on the device.
And then ChatGPT has been given some level of availability through an API. They say it's the first of what will be multiple. And it's an interesting question what the business model is - nothing has been disclosed yet. But if you look at the parallel to search, which I think is a decent parallel, Google pays Apple to be able to get access to their users to be the default search engine. And so it's kind of an interesting question about whether the Gen AI model companies are going to have to pay to actually have this prime position to be able to bring in new users.
Transcript: I just wanted to highlight three different things that are sort of ideas in process, so they're a little bit messy, but ways to sort of reflect on yesterday. The first is this concept that I think about as sort of like two worlds of your experience on the Apple platform.
The first one is sort of the inside of your world. This is your Apple stuff - your mail, messages, photos, contacts, files. This is where Apple focuses its technology, its products, its privacy, its terms of service. So that's where the world of Siri lives, that's where the world of the core part of Apple Intelligence lives. It all fits within that world and is inside that privacy shield. Key data and keys are kept on the Secure Enclave. Anything that is going to the cloud, which is a newish thing for Apple, is moving some of these processes that will be done on secure servers in the cloud.
Then there's the rest of everything else that you touch - the internet, other apps, external mail systems and enterprise systems and cloud storage and external generative AI like ChatGPT. That's outside that wall. That's where somebody else's privacy and terms of service and everything else is running the game.
And they made a key distinction that if you look at the workflow, you're working through something with Siri, and then it gets to a point where what you really want is not access to AI to analyze your world of data, like your stuff, but you want to have AI help you explore what's outside and the rest of the world. That's when you have to opt in and say yes, I'd like to talk to ChatGPT, please. So it's an interesting different way of thinking about it.
Transcript:
The second idea is a little messy in terms of the way I'm explaining at the moment. But there's something interesting that sort of grabbed me, which is that external world. So far, your access to that has required you to go to it in another app, you have to access it to actually go out to find that access to the external world.
What's showing up now, though, is that Apple is designing its own software and services, its own products to actually bring that a little bit more closely and to make it more accessible. So these two screenshots here - on the left, it's one where you're asking Siri something and it just pops up and says "Do you want me to use ChatGPT for that?" and you say yes or no. On the right is a screenshot from Pages, and while you're writing it can pop up and give you access to that external world access for ChatGPT.
So it's bringing more of that to you. And this dynamic is interesting to me. It's a little messy in terms of the way I'm explaining at the moment, but that's kind of interesting.
Transcript: The third idea is this sort of pervasive level of Apple Intelligence. And they showed so many different options across the apps and across not just their apps, but other people's apps. Here are two different external apps where different parts of the Apple Intelligence system are popping up within them. And that's really interesting from a way to think about how this isn't just a system where you're going to go to chat.openai.com and actually use it, they are actually getting integrated into true workflows. And we'll come back to that sort of workflow concept.
Yeah, I think that workflow concept is important. We're seeing it in a lot of other places. But here, there's a degree of seamlessness. The other thing that's interesting about this is that a lot of these, this idea of being able to sort of interrogate apps from the outside, this came through in the Apple research in December, and then another drop they did in March. But it was very clear that this multimodal looking at the screen, looking at the apps, being able to query into things like system settings - that's probably going to be one of the next places where Siri is going to be helpful. But this is moving, it's adding to this agentic story and the sort of early start of being able to see agents roam around your screens.
Transcript: So while we're incorporating some more stuff from this announcement throughout the presentation, this is a much quicker response than we normally give to these things. So there's a bit of how we're thinking about it that's still in flux. But it'll be helpful as we look through the context here.
Alright, so let's talk about search and what is it and think about it. Your first thought is likely this search is the Google prompts. This is what it was yesterday, I think with the dragon road on it, and you're thinking that that's search, but actually we search all the time through all kinds of things.
Transcript: We search in commerce systems, we search for words in documents, we search for files and systems. Search is a process that is used throughout our interactions with software. And so when we think about the impact of AI on search, and what it can do, and the evolution of AI, not just the AI that's part of it today, but the generative systems, it can actually happen across lots of different systems.
Apple helped us out a lot yesterday with a couple of examples that I thought were interesting. The top left was Scratch Math, which is an iPad app where you can use the pencil and draw out math and click the equal sign and it populates the answer. You can show how you can do conversions between different units, from inches to centimeters, the kinds of things that I actually often have to go to Google and look for to find an answer, it pops right up in that app.
There's using language translation on the watch going from English to Korean. And now they're starting to do some of these things that pop up within Apple TV, where you're seeing the stars, you can click through to listen to a song or add it to a playlist, variety things that you normally think of having to go to another app, going out someplace, trying to use Shazam, which I don't know about you, but it doesn't seem to work for me much anymore. I don't know what they did to screw that up. But you have to jump out to all these other things. Now search is becoming much more integrated.
Transcript: And these kinds of systems are tapping into a lot of the generative AI systems that we're talking about out in the world. Searching itself has a lot of different processes to it, or different ways to think about it. And here's sort of four different things to think about with search. This actually dates back, I think it's 30 years - 1995 or so - a professor from Rutgers wrote about this.
Your method of search, something you can be searching for something that's known, or just looking around and sort of seeing what you find. The goal, you could be learning about something, or you could be selecting something. So I'm gonna go, I know exactly what I'm getting and I'm gonna go grab that. Your mode of searching could be identifying something by specification versus identifying by recognition. Like, I know what I'm going, I'm specifying versus like, oh, I know it when I see it, that's what I'm looking for.
And then the final is that the resource that you're searching can be the information itself, or it could be the information about the information, that meta information. It's interesting that 30 years on, the same sort of ideas still seem to apply really quite well. The idea and the processes and the way you think about retrieving information has been consistent despite all these technology changes.
Transcript: So I'd like to remind you of the framework that we put out last month around agents and agentic AI. We started to think about this - remember - about thinking about AI and AI with agency. So agentic AI is having the capabilities of perception, and reasoning and action. And in all of these sorts of frameworks, there's always fuzzy boundaries to these things, but we're trying to organize it around a couple of key principles to be helpful.
And what we did is we looked at those three, and then we applied a low or a high complexity level to each - to the perception, the reasoning and the action. And we came up with different personas based on the low and the high. And this is all in the webinar from last month, it's also written up in detail in the report that you can access if you go to our website and go to research, you can find that there. It should be easily accessible. If you can't find it, let us know.
But each of these had a persona name associated with them. And a lot of the search that we do today throughout systems is probably like an Aide. It's low complexity in terms of what the system is perceiving, the kind of reasoning it's undergoing, and especially the action that's being taken. You're just going in fetching a thing.
We see some systems which are getting more complex on the perception. So they're kind of down in that Lookout category, they're looking across different kinds of media. Sometimes the reasoning is a little more complex, because they're trying to interpret more. But we're going to sort of give you an arc today, which shows you a little bit more of what it would mean to get all the way down to the bottom and have a Wayfinder persona, where there's complex perception, reasoning, and action.
Transcript: So let's look at those three dimensions and think about what they mean. So here we go, we have on the y-axis low and inserted a medium complexity. And I guess medium is a little bit kind of where we are with generative AI systems to some degree. But the low complexity in perception, these search systems are taking in language, that's our number one interaction system. We have a prompt window of some sort, you're typing in some words, and that's the information that's going in - reading text and interpreting it or it's just looking at particular images if you're hunting for image on Adobe Stock or something like that.
The reasoning is relatively low complexity because it's just a direct command from the user. Alright, that's what's coming. And the action is relatively simple, is relatively low complexity. And that is just a retrieval. I'm going, I'm grabbing it, I'm taking it and then I'm often doing something else with it. But that's not part of the search system itself.
Transcript: When you get to this sort of medium middle ground, and you get to see where perception becomes multimodal - a good example of that would be Google's Gemini system, which is trained as a multimodal model. So when you're asking it, it's perceiving text, image, video, audio, it's sort of thinking through those. It gets a little messy. And here again, none of these have true boundaries, we're just trying to get some sort of phases.
Reasoning is becoming more interesting because the reasoning involved is responding to a longer conversation. You're having more explanation in these generative AI systems, you're submitting long paragraphs or many paragraphs to explain what you want it to go do, to go fetch for you.
And in the generative systems, the action is an answer, it's a long worded answer - that inference response that you get, which is different from retrieval. It's not just that I'm going to go fetch it, but it's actually creating something for you, which is a more complex action.
Transcript: The real end of this story gets to where you get to perception that is about context understanding, not just what you've submitted to the system as a query, but also what else it can learn about you and what it knows about you in the context of your searching process. And from that context, it can start to reason and have some insights and some insight into your intent of what you're really trying to do. It's not just what do you say you're doing, but it's starting to understand what you might want to be doing.
And then the high complexity and the most interesting part for us is when the search result is part of a workflow, when it flows into something else. You're not going somewhere and taking the link or copy pasting into something, but it's working within the flow of what you're working on.
So we're going to walk through each of these three sort of spans from perception, reasoning, action to sort of show that progression, starting with perception and going from language to context.
Transcript: Language today requires you to, if you're looking at a hotel and you want to go figure out where it is, you have to take that and type the information into a system to say this is where you're going, you have to write in the name of the hotel to go find it.
And this was a nice example that showed up yesterday in the Apple Keynote, because they showed an example of when you're just on a website for the hotel, that it can already perceive that that's what you're looking for. So when you hit that button that's popping up with a suggestion for the address, because it already has the understanding of what's around you at that time, because it can understand where you are on that website.
Transcript: There's this whole world of these new systems being able to understand what's happening on your desktop, whether that's sort of Apple's method of having this sort of Apple Intelligence watching everything or Microsoft's new technology, which we'll see what happens if they actually do release it, called Recall, which was - they announced it was going to take a screenshot of your desktop every five seconds, and then everybody kind of freaked out for good reason, because they were taking the screenshots and leaving them in an unencrypted database, which is pretty lousy. So they've pressed pause on it. But this idea of understanding what's happening on your machine is interesting.
Transcript: This then becomes much more interesting when you start absorbing different kinds of contexts.
Transcript: Yeah, I'm gonna step forward a couple of steps and who were a little bit about what the status of the neurotech is. And this has been something that has been, obviously, people have been working on for quite a long time. But recently, there's been quite significant steps forward.
There's two things here to consider. One is the state of the science itself in terms of EEG, so reading thoughts and moving from thought to text, and last year, some researchers in Australia managed to get thought to text to like a 40% accuracy rate, which is pretty extraordinary when you think about how hard that is, and how quickly that will start to ramp up. Now the big change, of course, was ramping everything through a large language model.
And then Apple's patent of using the air pods for not just EEG but a range of other physiological signals. But this move to neurotech and to move into a potential thought to text is quite a significant step in terms of considering how search and how response and how these interfaces might evolve.
There was yesterday in the keynote, there was an example of someone in an elevator who didn't want to answer their phone. And so they could just use their just shake or nod their head to answer based on using the airports for that sort of movement, which is a glimpse into how we're thinking about these other interfaces.
Just a side note, what's interesting is a few years ago, we got involved in designing a golf app. And one of the features we grabbed on which was right in June at the WWDC was the ability for the air pods that they were starting to talk about a trip second, in terms of motion sensing, and using that to help understand whether somebody's head had moved correctly, you know, but now you can see where they're really going with this. Not just how you're listening to audio and everything else, but being able to use it as an input device is really different from what they've done so far.
And you know, one of the things I think Apple's really good at is picking the right time. It's a timing thing in terms of technology. And this whole neurotech - so neuromarketing and workplace surveillance has been sort of creeping along in the last few years. And a year or so back, it really went up quite significantly, especially with some of the commercial EEG products that people have been using on truckers and various other things.
And now the legal system is responding. We're seeing the same pattern we saw in responsible AI where it starts out with grassroots nonprofit groups getting together, mostly around lawyers doing things pro bono, and taking on cases where people haven't been treated well, where there's been some sort of harm done. And now we're in a position where there are a couple of nonprofits specifically looking at neurotech, neuromarketing. And Colorado actually - they lobby Colorado. Colorado has just passed a law to protect your thoughts and the privacy of thought. And we'll see a lot more on this in the next few years.
[Video transcript omitted]
So that's kind of creepy, but kind of cool, but creepy. I'm not really sure how to land on that. And it's clearly at that very messy stage of technology. But it's quite interesting to think about the sort of long term what might happen and what could happen.
Transcript: On the second side, this is in the reasoning section, when you're thinking about going from a command based system to an intent based system. This actually - we're drawing on work from the Nielsen Norman Group, which talked about three paradigms of computing.
One is batch processing, then the second is command based interaction, which is what we've had for most of the last some number of decades, and moving into a new world of intent based outcomes specification. So how do you have a system that starts to understand your intent?
We as humans are pretty good at understanding some level of other people's intent at times, but machines are usually not really good at it, because they don't have any sense of the context, which is why we go through this perception, reasoning, action - context is what enables intent. Intent requires it.
That hold that thought because we will be coming to another point later but that this intent based business which relates to theory of mind. So in a human, you can model someone else's mind, you understand their intent. There's a lot of philosophy around this, a lot of cognitive science around this.
There are a couple of researchers who are particularly focused on monitoring Theory of Mind emergence in these large language models. There's a lot of push and pull on the research, a lot of oversight, a lot of questioning. And the researchers are quite responsible in the way that they're actually responding to feedback around this.
But the latest paper, which is a couple of weeks old now, I think, but very recent, has been tracking the theory of mind in the largest language models at the level of an adult now. So this is pretty important to understand that there's always the caveat that language models make weird mistakes, obviously, that's part of what we're here to talk about today.
But there's intention based inference that can happen with the large language models as part of the search process is now an entirely new dynamic. And some of it is around recognizing that these models are developing some level of theory of mind and how they understand us as users.
Transcript: So as we look at the ideas of how do you move from understanding context to intent, a couple things will highlight from the keynote from Apple yesterday. They started talking about these use cases that are kind of on this level of "Pull up the files that my boss shared with me last week", "Show me all the photos of my kids", "Play the podcast my wife sent me the other day".
There's a lot of gaps in that in that command and in that explanation that start having to get filled in with some level of context, understand the intent of the user. And it's starting to show things like this where you're on a page and you pop up to see more. And so might you want to go see something, you know that you're reading an article about a TV show, might you want to go look at it. It's sort of starting to stitch that context to intent together and sort of see how you might bring that together.
This next video actually shows a bit of the demo that's quite helpful in showing some of this intent stuff.
Transcript: What's interesting there is you get the ability to understand context among all the things that are happening in your apps, and then be able to - it's starting to show how it can understand the intent by saying things like "How long will it take me to get there".
So my intention is for you to go to maps to actually go through putting the two different addresses, calculate the directions and the time. You know, there's a lot that's going on there that I'm not specifically stating as a command, as we normally have to do with points and clicks and texts that we're typing in. And that's really intriguing that it can start - they're starting to pull this together.
We do see this as a very long term trend. This is not something that stops here, and we're not solving it all. But there's these little peeks into what can actually be handled, which is interesting.
Transcript: Last sort of column is on the action side of retrieving to workflow. And I think this is the last little video clip from the thing yesterday, but it'll help explain and illustrate some of the things we're thinking.
[Video transcript]: Siri will also understand more of the things you could have done in your apps. And with new orchestration capabilities provided by Apple intelligence, Siri will take actions inside apps on your behalf, Siri will have the ability to take hundreds of new actions in and across apps, including some that leverage our new writing and image generation capabilities.
For example, you'll be able to say, show me my photos of Stacy in New York wearing her pink coat, and Siri will bring those right up, then you might say makes this photo pop and Siri will enhance it just like that. And Siri will be able to take actions across apps. So you could say, add this to my note with Stacy's bio, and it will jump from the Photos app to the Notes app to make it happen. This is going to bring us closer to realizing our vision in which Siri moves through the system in concert with you.
Transcript: "In concert with you" is an interesting phrase. I think that's very particular the way they're thinking of it is that this is they're building upon intelligence to be something that actually works with you throughout all of your workflows. It's starting down the path we've talked about now for quite a while around being about designing AI to be in a mind for our minds that it is actually really working for us and enabling that.
The ability to jump and just say I want this and go put it over there, make that modification in the image, put it in the note - that being part of that workflow is so much easier. And you can see there's other examples where they show working between Apple apps and non-Apple apps, non-Apple apps and other non-Apple apps. It's all kind of being exposed to a developer kit. But it's interesting to see how the tool can now take action for you and have it sort of work through in this sort of interesting workflow. And they're making it seem really quite simple. But it's actually quite challenging to get something to build a platform that can actually take those kinds of actions and not go off the rails on particular apps.
Well, I'm going to be really intrigued to see what the graceful failure is. You know, what does the user do when something goes wrong? So you've got three notes that have Stacy's bio in various stages of messiness or whatever. And what does it do with those things? How do they think about those shorter interaction?
How do they say - and we don't see any of that in the demo. And who knows if they figured it out yet? It's going to be out for, it's going to take a few months. But it's interesting, we'll see. But you're right, there is an interesting part of that graceful failure. If there's three out of three notes with Stacy's bio, does it just pick one? Or does it go to you and say, well, which one of these do you want? How much dialogue does it create with a lot of back and forth that helps people understand where they're going?
Transcript: This is really bringing AI into these workflows and being able to have those workflows feed more context. And this is something that we've thought about and worked on for a while. I thought it just be interesting to share.
This is a design project we took on a few months ago. It started doing a system for an Enterprise client, and then modified that into more of an open-ended system that we were considering putting out there. And it was all around the idea of searching and finding and working with data within an enterprise, which is really hard. There's lots of data systems and nobody knows where everything is.
So we started with this general concept of how to find things, and then realized that the information that we needed, the context that we needed to be able to make really good intent-based reasoning about where someone might find something - we needed the context about how other people use this data, what they thought of it, what they did with it, seeing what people worked with what data versus other people that didn't. That information doesn't exist in the enterprise.
And when people try to create AI systems that sort of sniff email and Slack and start to build knowledge systems out of it - we worked on it by saying we're going to create something that actually engages people to create that data, to create that contextual information.
So it was deciding how you built communities around the systems, around these data, how that would then lead into sort of creating a knowledge graph that would understand who was related to what, and what's were related to others. And as people work through them in teams and created collaboration spaces and developed them and put different pieces of data together, you started to understand associations among people and data.
And then especially as you built it all the way out, the end state for the system was actually building it out as a presentation system, so that you could actually build it out because you've kept all of that sort of workflow and context within the system, rather than somebody going and retrieving data, pulling it to their desktop, and then going to PowerPoint and putting it in there. This, the data system has no understanding of where that went or who worked on that PowerPoint.
What we've put together here was the same sort of ideas what Apple's building today. They're building systems that are encouraging you to develop and share more context, so that you're associating more things across your OS, you're making them known to other people in your network. All that data is still on your phone, but all of that information, all those interactions are helping you build the context that helps it understand your intent that can then help figure out how to solve the workflow problems.
Transcript: Last section, here's the trust problem. So we've talked about generative AI in the past and generally search and trust. This is from our January presentation, we did our 10 big themes for the year. Trust is one of those big themes.
Transcript: And we highlighted some research that is now a year old, but it's still relevant because it studied Bing and Perplexity, which is one of the tools that we'll show you in a minute, and looked at the accuracy of the citations. So in the systems, you ask a query, they generate some level of text using whatever system they use, different systems use different models. And they also associate citations with them.
And what this research found, looking across four of the key tools a year ago, is that only half of the sentences that were generated and associated with citations were supported by citations. And only a quarter of those citations did not actually support the associated sentence.
And that's a really big deal because the mental model we approach when we see text with citations that refer to something, some sort of footnote, you think you've got this mental model - this is well done, this is well researched, this thing should have sources, somebody documented it. So you know, maybe I have to go check it, but otherwise I think this should be trusted.
So it's designed with a trust mental model, when it actually has - it's really pretty low trust that should be given to it. And there's a handful of other trust problems that we'll walk through here.
Transcript: So one of the trust problems in terms of search is recognizing that these models are trained on not just controversial questions, but contradictory information, so that there's an underlying variability in the information that's really difficult to sort of fix. You can't fix it.
So there's this interesting shift that's happening behind the scenes a bit at Google - a shift to autonomous, pushing more responsibility out to the user to do their own work to figure out what it is that they trust, because this idea of a trusted source is getting harder and harder to handle.
And I think that's a very interesting shift and has comes with pretty significant philosophical consequences. And is a very long term shift. And we see very little experimentation yet with what it means to provide better signals that you can trust or not trust the information. It's been very much sort of left up to the user to figure out what they consider to be a trusted source.
And the power laws tell us that you're going to have the New York Times and Wikipedia on one end, and everybody else on the other, which is a problem for these other models.
Transcript: We've shared this in the past, which is an update. This is a problem with hallucination. So this research is interesting because it's doing what a lot of these kinds of generative search tools are trying to do - goes and looks at information and then generates a summary based on it.
This research studied the accuracy rates, the factual summarization of a model when given 1000 short documents from a corpus of news stories. So it's trying to figure out whether it was factually consistent with the source document.
And I found that even the best model at the time, this was in January of this year, had a 3% error rate. And so you can see, you can get all the way up to close to 10% with some of these models. Now these numbers are improving, but there is a reality to it that these, that's just sort of the nature of the beast, if you will, of having not being able to be truly accurate. It means that these kinds of things can happen.
And a whole series of screenshots Helen took asking the same question in multiple models of what was the earliest mention of artificial intelligence in the New York Times. That is a thing that is factually available. And they all come up with a different answer. And so there isn't any way to understand what's there and you have no idea to know what to trust.
So when you're searching for this information, you can't use these tools for this kind of a thing. You'd actually have to go to the New York Times and search and look for it.
Transcript: So there's the generation part of search, but there's also the retrieval part. And these charts, these images come from a live report from Google about the performance of Gemini as it's got, as they take it up to the 10 million token context window, which makes it, you know, offers a big opportunity for improvement.
Now, there's sort of a lot of dimensions on this chart. First of all, there's the three sort of modes - text, audio, video. And then the top bar is Gemini and it goes to the full extended context as the button is the little square on the right. And it's comparing it with a GPT-4 version of something.
So the text or the video or the Whisper audio - the green dots are accurate retrieval, the red dots inaccurate retrieval based on a test called needle and haystack where they literally put like a sort of needle, you know, an AI needle to find. And the gray is where GPT-4 doesn't actually have a context window available.
So you can see there's variability across the modes, there's variability across the tools. But in a lot of ways, retrieval isn't necessarily the problem. There's some problem there and there's certainly a problem when it comes to search because trusted sources, full trust trusted sources is a difficult thing to actually nail down because of this controversy issue and this conflicting issue.
But the retrieval itself isn't necessarily the issue. It's not too bad.
Transcript: There's another really interesting problem, though, that's kind of new. And that's that there is now what we'll call a prompt formulation bias. So because of the way that you're expressing something in a search that's conversational, you're not just writing a prompt to Google in the way that we were used to - that more that it's conversational.
All these associated words are included that used to get stripped out. Google never took any notice of those, it just took notice of the keywords. We're now seeing that the models have this interference. They interact with the query because the way that you phrase your query actually - the model is inferring intent based on the way that that's inferred. And it affects accuracy, it affects what trusted source the generation ends up coming from, and it affects the faithfulness and the trustworthiness. Those are all sort of three distinct concepts.
So this is pretty interesting. If a user says "Is it true that" and then the claim, the model will assume that the user is asking a genuinely neutral question. Whereas if the user says "I saw something today that [claim]. Do you think this is true?", the model will assume the user is skeptical and actually provide more pushback.
Got to remember these models are fundamentally sycophantic. There's a lot of grounded research that shows that the models want to please you as a user, so they alter where they generate, what they're generating, based on the actual prompt that you're giving.
The last two are even more interesting in a way. If you write the prompt "Explain why [claim]", the model will automatically assume and will infer that the user already believes that claim. So that brings in more confident confirmation bias, it brings in more confirmatory evidence.
And if you write the claim and then write "Write a short article about that", the model infers that the user has already got an agenda and wants to generate content to spread that claim. So it fills that in even further.
So this is something that is really quite new, that we don't really know what to do with. But I think it's going to be a really interesting way of considering the future of search.
Transcript: There's another big problem, which I think is really interesting. And this is a great example of it. And this comes from a National Geographic fan page, I think was on Instagram and Facebook. And someone put up a post in the group saying "Look at this, it's a beautiful baby peacock". And it was put out without the "synthetic image not real" label that was added later by the researcher.
But isn't that called now some sort of midjourney and it got liked and shared. And then it got picked up by Google Images. This idea of toxic spills of generative pollution, of filling in our information ecosystem with things that just are manufactured and not true, like genuinely fake.
And this remained on Google for quite some time before it was able to be - but it's like, it's the first image, it's extraordinary.
And then the final one, which is - this is what actual peachick looks like. Not as interesting is it? You know, you mostly - you'd want to, I like the element. I do too. It's not real.
I've also learned through this process that it's called a peachick. I didn't know that.
Transcript: Last little section here is we try to take these sort of concept of trust, the trusted source, and figure out how to organize these new generative search tools and think about them from different ways about how to think about them in this concept of trust.
And so we've created a sort of a two dimensional matrix here, which is intentionally fuzzy, where it should be. I feel like I need to be able to write squiggle lines, you know, so that it isn't a hard line.
We have this idea on the vertical axis of high trust. So whether what's being accessed by the model should be high trust, you have high trust in it or low trust down at the bottom. And then whether the model is trying to summarize something in particular, like going and fetching a document or set a couple of things and summarizing that accurately versus synthesizing. So taking multiple documents that have conflicting information and trying to create something new out of it.
And these two dimensions kind of helped us pull the different tools apart into different boxes. And we have associated a few different personas there to just try and help you think about them. So we're going to walk through each one, we're going to start with Librarian and go around clockwise.
Transcript:
So in the first one, this is where there's high trust and it's summarization. And this is a little video clip from a company called Glean, which makes an enterprise data system not dissimilar from what we designed. This is different, but it sort of has that same idea of how do you go find things in an enterprise.
And so it looks across all information and it actually does study a lot of conversations that happen in systems to help figure out who knows what. But the idea here is that the data is highly trusted, because it's within the enterprise. So hopefully it's good data. Now, not all data in enterprises is great. But on the span of whether that data is to be trusted, for an employee, internal data in the company is probably high up there. And it's going and looking for one or two things and then summarizing what that says.
So there's a good way of saying that's kind of like a librarian, like "Where's the book deal that I want?" "Where do I want to get?" So I know it's a trusted source, and then I can generate from it and there will be some errors in the generated summary but it's close, but not always perfect. And that's our case study.
Transcript: Yeah, a new case study came out from Stanford, human-centered AI researchers are associated with them, and they looked at actual commercial products in the U.S. law - Lexis and Ask Practical Law there. And what they found was significant hallucination rates, and some of the things 17 and 34% of the time, which is really extraordinary for such a tool that is grounded in fact, as it were.
But as the researchers point out, in the law, there's no such thing as an atomic fact. And in fact, the more than you need to use a tool like this, the less there's an actual atomic fact. It's very contextual, it's changing temporarily, it's moving around in time as different judges make different case law decisions.
And this is - and it's very easy for these tools to make significant mistakes. And I think this is a really interesting recent case study, because it does just show how much the promise of these systems is not meeting - they're just not meeting that promise.
And in the next slide, you can kind of see the nature of some of these errors. And the core promise of legal AI is that it will streamline these time consuming processes. But the problem is, we've got this inverse relationship between when you really value this research and need it to be correct. This is when the results cannot actually be trusted.
And you can end up just spending as much time going and checking. And this is similar error rates that we found - we talked with him about last month in the New York City system that was asking about city regulations.
So this whole category of librarians is interesting, because the source material is trusted, it's there, it's accurate, it's real. But you end up with all of these problems in terms of the generation, what actually comes out of it.
Transcript: So let's look at the top right category, this teacher as is high trust, but is trying to synthesize. And we've - I think part of our thesis here is that synthesizing is more tolerant to the generation errors because it doesn't have to be - it's not specific, you're taking in context of all of the things and generating something new.
So the examples here - there's a cool tool called Waldo. And it's an interesting system. It's based around a defined set of sources that are useful for market researchers. This isn't really a consumer tool, it's much more for professionals, priced that way too. But you can go through it and it will look across its well-vetted set of sources and then generate new text that is based on some of those facts. And it will give you citations.
And I haven't seen research external about the accuracy of the citations. My own experience of the tool is much higher accuracy than other tools that were in that system.
Another example here are the reviews that you may have seen in Amazon, which look across many reviews. It is somewhat trusted. I mean, anybody can leave a bunk review. But Amazon does have the ability to uprate or only look at reviews where they can confirm somebody actually purchased the product. So I guess that makes it a little bit more trusted, even though somebody could say whatever they feel like.
But it does give you that - and it gives a pretty decent sense and nice synthesis of what everyone's thinking. So this ability, that sort of that top right hand corner in some ways feels like closest to working so far. High trust and synthesizing - it seems to be in some ways of those four quadrants as the best use case for where the technology is today.
Transcript: The bottom right is low trust and synthesis. And this gets interesting because Perplexity is definitely a hot topic and there's a lot of people who absolutely love it. It's really distressing to me to see the number of people who say it's so incredible because it's always right and it gives you citations for everything, when it's pretty easy to find all the times that citations are wrong and there are a spread of Reddit threads talk all about all the inaccuracies, but still, people have the mental model that since it's showing sources and citations it's accurate.
The other ones on here are the summaries from Meta and Facebook feeds and I grabbed these from two people that are in my Facebook group. The top one was from Ethan Hawke, he was talking about getting together with Richard Linklater, the director, and they were both going to premieres of their own movies.
So it talks about the fact that it's, you know, commenters are liking the movies and showing enthusiasm. But then it's bizarre. It says one user goes off topic sharing a lengthy and unrelated comment about the importance of nasal sinus irrigation for respiratory health. Now, there's a whole bunch of things that are wrong with that - it shouldn't have pulled out what one user says. But also, it's completely irrelevant to the topic and basically falls into that untrust, that low trust category.
The one on the bottom is also fascinating. This is from a post from John Popper, who is the Blues Traveler singer. He just proposed to his girlfriend. And it says, comments congratulate the couple on their engagement, expressing happiness. But one comment notes that their dog Weezy seems to approve as well. That's interesting, except Weezy is Popper's daughter, not his dog.
So there's the error rates in these is pretty amazing. So anyway, that's the Genie. You kind of pull a Genie out of the bottle, like sort of the idea that we're coming up with to come up with.
Transcript: The last one is the Magician. And it's because that bottom left corner, we got low trust and you're trying to summarize one particular thing. I think you have to be a magician to get it right. And so far, Google - because this shouldn't be a use case. Exactly.
Yes. But I was trying to put a positive persona, so I wasn't putting too much - but my first persona, Helen's like, I think you're showing your bias too much against these use cases.
So anyway, if Google makes this work, I will declare them a magician. But this is the AI overviews that have gotten Google in hot water. These I believe, as best we can tell, are actual screenshots. There's a lot of Google troll sharing screenshots that aren't real that weren't real, but I believe these were all real.
How many rocks should I eat? Somebody asked. And it pulled up a source from an Onion article. Then how many Muslim presidents has there been? And it said that Obama was the first Muslim president. That was taken from a book titled "Was He a Muslim?" with a question mark, and it sort of ignored the question mark and put that in.
And then the last one is, somebody said "She's not sticking to pizza" and it grabbed onto a Reddit forum, and some random person was like "Man, you should put some glue on it."
So this whole category down here where you don't have low trust in the sources to trigger and you're - but the objective is to grab one thing and generate a response. The problem is, is when there's low trust, the AI doesn't have any ability to figure out what's accurate or not.
Going back to the very beginning, when we started working on this several years ago, AI does not have an ability to find truth, to assess truth in any way. We spent time with a company that got funded by Google called Factmata to try and figure this out. And they fail, not for their lack of capabilities, but it just isn't possible, because there's no database of Truth For The World to compare against.
Anyway, thanks so much for joining us. And we've run right up against time, this webinar, we will make the recording available later. We'll also write all this up again. So that'll come out sometime soon. And stick with us.
And make sure to check out the announcement we just recently put out about our Imagining Summit in October, and the very exciting group of speakers and catalysts that we have, who are going to be joining us. We'd love to have you join us. So please, please do so.
Thanks a lot. Send us messages with any questions or comments. We'd love to hear. Thanks a lot.
The Artificiality Weekend Briefing: About AI, Not Written by AI