Gemini 1.5 Pro: An Ultra-Efficient, Multimodal System

Key Points:

Gemini 1.5 Pro's Release: Google demonstrates its AI prowess with the release of Gemini 1.5 Pro, highlighting significant advancements in multimodal abilities and context length, making AI more adaptable and versatile.
Extended Context Length: Gemini 1.5 Pro handles millions of tokens, including long documents and hours of video and audio, achieving near-perfect recall on long-context retrieval tasks across modalities, a significant leap over existing models like Claude 2.1 and GPT-4 Turbo.
Sparse Mixture-of-Expert (MoE) Architecture: The model uses a MoE Transformer-based architecture, which efficiently handles extremely long contexts by directing inputs to specific subsets of the model's parameters, allowing it to process up to 10 million tokens without performance degradation.
Efficiency and Scalability: The MoE approach allows for scalable parameter counts while maintaining efficiency, enabling the model to process vast amounts of information quickly and effectively.
Multimodal Capabilities: Gemini 1.5 Pro excels in handling long-form mixed-modality inputs, including documents, video, and audio, demonstrating impressive multimodal capabilities.
Needle-in-a-Haystack Task: The model shows exceptional memory and retrieval capabilities by accurately recalling specific pieces of information within large datasets, maintaining high recall rates even with 10 million tokens.

The release of Gemini 1.5 Pro stands as a testament to Google's formidable AI prowess. Its native multimodal abilities and huge step up in context length demonstrate an impressive capacity to scale alongside unimodal abilities, highlighting a significant leap in making AI more adaptable and versatile than ever before.

Here's what you need to know:

1.5 Pro is often better than 1.0 Ultra: demonstrates Google's broad and comprehensive approach to AI development.
Huge leap in context length: signals fading importance of RAG (retrieval augmented generation)
Power laws sustained: Native multimodal abilities scale as unimodal abilities do signaling efficiency of architecture

According to the paper: Gemini 1.5 Pro handles millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities. It also shows continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). It is even able to surpass Gemini 1.0 Ultra on many tasks while requiring a lot less compute to train.

Gemini 1.5 Pro is a sparse Mixture-of-Expert (MoE) Transformer-based model. MoE enhances the architecture with a learned routing function to direct inputs to specific subsets of the model's parameters. This method enables the model to handle extremely long contexts efficiently, supporting inputs up to 10 million tokens without performance degradation. The MoE approach allows for scaling parameter counts while maintaining a constant number of activated parameters for any given input, pushing the limits of efficiency and long-context performance.

In plain English: an MoE model allows a computer to handle a vast amount of information very efficiently. Mixture of Experts (MoE) are like having a huge library of knowledge (a large number of parameters) but with a smart system that knows exactly which "books" (or parts of the model) to consult for any given question. This system directs the question to the most relevant experts (parts of the model) without needing to check every single book. This way, even as the library grows bigger and bigger, the system can still find answers quickly and efficiently, because it only ever uses a small, relevant portion of the library at any one time. It can grow its library as big as it needs to without slowing down, because it only ever uses a small part of its library to answer each question.

The paper has a lot of detail on the model's multimodal capabilities which are impressive: it can handle long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost a day long of audio. For regular humans this means you could, say, load every single email you've ever sent or received.

Gemini 1.5 Pro offers similar quality but with much greater efficiency and less computational cost. Google claims major improvements across its whole design stack: architecture, data, optimization and systems. Again, this model isn't just about text: it can handle a mix of sounds, images, text, and code all at once. With these advancements, Gemini 1.5 Pro has shown promising results in handling up to 10 million tokens of data, opening new avenues for research into its capabilities and potential uses.

A key test the researchers apply is referred to as "multiple needles in the haystack". This concept refers to an advanced retrieval task where the model must identify multiple specific pieces of information ("needles") hidden within a large and complex dataset ("haystack"). Gemini 1.5 Pro's ability to recall information accurately in this context demonstrates its exceptional memory and retrieval capabilities. This task tests the model's limit in handling complex, long-context information retrieval, showing significant improvements over previous models by maintaining high recall rates even with the insertion of 100 different needles in datasets up to 1 million tokens. Gemini 1.5 Pro demonstrates a remarkable ability to maintain high recall rates even as the context length increases up to 10 million tokens, with a slight decrease in recall from 100% at lower token counts to 99.2% at 10 million tokens.

Here are the "haystacks" with red being a fail and green is good.

**Text Haystack.** This figure compares Gemini 1.5 Pro with GPT-4 Turbo for the text needle-in-a-haystack task. Green cells indicate the model successfully retrieved the secret number, gray cells indicate API errors, and red cells indicate that the model response did not contain the secret number. (From the paper)

**Video Haystack:** This figure compares Gemini 1.5 Pro with GPT-4V for the video needle-in-a-haystack task, where the models are given video clips of different lengths up to three hours of video and are asked to retrieve a secret word embedded as text at different points within the clip.

**Audio Haystack:** This figure presents the audio version of the needle-in-a-haystack experiment comparing Gemini 1.5 Pro and a combination of Whisper and GPT-4 Turbo. In this setting, the needle is a short segment of audio that is inserted within a very large audio segment (of up to 22 hours) containing concatenated audio clips. The task is to retrieve the "secret keyword" which is revealed in the needle.

The demise of Retrieval-Augmented Generation (RAG) comes from the significant advancements made with the introduction of models like Gemini 1.5 Pro. Unlike its predecessor, Gemini 1.0 Pro, which relied on RAG to compensate for its limited context window by indexing and retrieving useful passages from an external database, Gemini 1.5 Pro eliminates the need for such complex retrieval mechanisms. Thanks to its larger context window, it can directly accommodate much longer material, streamlining the processing and understanding of extensive texts or datasets.

This capability marks a pivotal shift in how AI models handle and interpret large volumes of information, rendering the more cumbersome and less efficient RAG approach obsolete. As stated in the report, "In contrast, Gemini 1.5 Pro, due to its larger context window capable of accommodating much longer material, eliminates any need for additional" retrieval-based processes, demonstrating a significant leap in efficiency and effectiveness.

We like to keep an eye out for power laws—a marker of both scaling laws and complexity. And here they are again with improvement in prediction with longer contexts. The study found that the model's ability to predict the next word in a sentence (or the next piece of code) improves as it processes more and more data. The fact that this improvement follows a power law means that the model gets significantly better at predicting as the amount of information increases, up to a certain point. Specifically, they found that this pattern of improvement—where predictions get better as more information is considered—can extend to really long texts or code, up to 1 million tokens for documents and 2 million tokens for code.

Interestingly, they observed that for even longer lengths, especially up to 10 million tokens in code, this pattern starts to deviate. This deviation could be due to specific characteristics of the data, like repetitive patterns in code that give the model extra help, making it even more accurate than the power law would suggest.

This behavior has real-world implications, as the researchers found when they went to translate English to Kalamang, a language with fewer than 200 speakers worldwide. They were surprised to find that, at the frontier, the model had developed new capabilities: when given a grammar manual for Kalamang the model had learned to translate English to Kalamang at a similar level to a person learning from the same content.

The introduction of Gemini 1.5 Pro demonstrates the next level of sophisticated and capable AI systems. The model's ability to handle unprecedented context lengths, its superior performance compared to its predecessors, and the sustained relevance of power laws in its design underscore the breadth and depth of Google's long term capabilities.

I couldn't resist getting ChatGPT to make an image of a robot reading the new context window: ten volumes of War and Peace. Yes—of course there's more than ten volumes...ChatGPT can't count.

Blaise Agüera y Arcas: What Is Intelligence?

Blaise Agüera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

Gemini 1.5 Pro: An Ultra-Efficient, Multimodal System

Helen Edwards