Eye on Apple

Key Points:

Apple researchers published a paper on a new architecture for vision models that scales up better and learns from diverse image data. This suggests Apple's strategic interest in enhanced spatial computing and interpreting the physical world.
The approach mirrors techniques used in language models—predicting the next part based on previous parts to understand complex patterns. This allows for more effective scaling and adaptation as the model grows.
The paper shows Apple innovating in model architecture itself, not just compute architecture. It also indicates a direction towards complex vision prediction and interpretation relevant for dynamic spatial computing environments.

Apple doesn’t publish a lot of research so we tend to take notice when there’s something that suggests a strategic link.

Apple researchers recently published a paper describing a new architecture for vision models. The paper's unique approach to vision modeling hints at Apple's likely strategic imperative towards heavily integrating vision models in spatial computing environments. This suggests a keen interest in enhancing how devices interact with and interpret the physical world around us, where the models need to seamlessly adapt to rapidly changing visuals and where objects leap from the foreground to the background in a heartbeat.

The main point of the research is that as the AI model gets bigger and has more data to learn from, it performs better at understanding images. This is similar to how large language models get better as they get bigger, and potentially even exhibit emergent reasoning. The correlation between size, data, and efficiency, previously dominant in language models, is now emerging in vision AI.

What’s intriguing about this paper is we get a glimpse of Apple’s possible direction: scalability, uncurated data, and the ability of the AI model to learn from a wide variety of images without relying on specific types of images that focus mainly on objects or depend on text description. This allows for more versatile and comprehensive learning from diverse image data, rather than being limited to images that are centered around specific objects or require accompanying text for interpretation.

The paper, "Scalable Pre-training of Large Autoregressive Image Models" by El-Nouby et al., introduces Autoregressive Image Models (AIM), which are vision models pre-trained with an autoregressive objective. The study presents several models, ranging from 600M to 7B parameters, and shows that higher capacity models achieve better performance.

The approach mirrors the techniques used in language models, where a model predicts the next word in a sentence. This has enabled language models to understand complex patterns in extended contexts. The same principle is now being applied to vision models, although it's not as established in this field yet. This development is crucial as it indicates a potential for vision models to achieve similar levels of complexity and understanding as seen in language models.

To understand how Apple's new approach differs from traditional vision models, consider how traditional vision models often use convolutional neural networks (CNNs) that focus on processing image data in sections and layers to recognize patterns and features. These models have been effective but face challenges in scaling up and adapting to new, complex data without significant retraining. CNNs often require more layers and depth to increase their capacity, which can lead to challenges in training stability and computational efficiency. CNNs are also less effective at capturing long-range dependencies in data compared to transformers, which can be a disadvantage when dealing with large-scale or complex datasets.

In contrast, Apple's approach, inspired by language models, uses autoregressive models and transformers. These models predict subsequent parts of an image based on previous parts, learning intricate patterns over larger contexts. This method allows for more effective scaling as the model grows and accesses more data, similar to large language models. This novel approach could indicate a strategic shift towards more adaptable, scalable vision models, particularly relevant for spatial computing environments where dynamic and complex visual data is prevalent.

Obviously vision AI has been a very high priority for other major AI research labs. Just to reiterate: we think this paper from Apple is important because 1) it shows Apple innovating in the architecture of the core model, not just in compute architecture, and, 2) the strategic direction regarding complex vision prediction. By way of further background, Big Tech has been big in machine vision, including the application of transformers to vision. A summary of important Vision Transformer (ViT) papers and concepts:

Google Brain: Introduced the original Vision Transformer model. They showcased that a pure transformer applied directly to sequences of image patches can perform well on image classification tasks.
Facebook AI Research (FAIR): Expanded the ViT concept with their work on Data-efficient Image Transformers (DeiT), focusing on training transformers efficiently on smaller datasets.
DeepMind: Explored transformer applications in complex vision tasks, including hybrid models combining CNNs and transformers for improved performance.
OpenAI: Investigated scaling laws for transformers in vision, understanding how model size and data scale impact performance.
Microsoft Research: Developed Hierarchical Vision Transformers, focusing on creating transformers that can understand images at multiple scales.

Each lab has contributed to the evolution of ViT, showing its versatility in handling various vision tasks from basic image classification to complex hierarchical understanding of visual data.

Blaise Aguera y Arcas and Michael Levin: The Computational Foundations of Life and Intelligence

Maggie Jackson: Embracing Uncertainty

Greg Epstein: Tech Agnostic

Eye on Apple

Key Points:

Helen Edwards