AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
Apple researchers recently published a paper describing a new architecture for vision models. The paper's unique approach to vision modeling hints at Apple's likely strategic imperative towards heavily integrating vision models in spatial computing environments.
Apple doesn’t publish a lot of research so we tend to take notice when there’s something that suggests a strategic link.
Apple researchers recently published a paper describing a new architecture for vision models. The paper's unique approach to vision modeling hints at Apple's likely strategic imperative towards heavily integrating vision models in spatial computing environments. This suggests a keen interest in enhancing how devices interact with and interpret the physical world around us, where the models need to seamlessly adapt to rapidly changing visuals and where objects leap from the foreground to the background in a heartbeat.
The main point of the research is that as the AI model gets bigger and has more data to learn from, it performs better at understanding images. This is similar to how large language models get better as they get bigger, and potentially even exhibit emergent reasoning. The correlation between size, data, and efficiency, previously dominant in language models, is now emerging in vision AI.
What’s intriguing about this paper is we get a glimpse of Apple’s possible direction: scalability, uncurated data, and the ability of the AI model to learn from a wide variety of images without relying on specific types of images that focus mainly on objects or depend on text description. This allows for more versatile and comprehensive learning from diverse image data, rather than being limited to images that are centered around specific objects or require accompanying text for interpretation.
The paper, "Scalable Pre-training of Large Autoregressive Image Models" by El-Nouby et al., introduces Autoregressive Image Models (AIM), which are vision models pre-trained with an autoregressive objective. The study presents several models, ranging from 600M to 7B parameters, and shows that higher capacity models achieve better performance.
The approach mirrors the techniques used in language models, where a model predicts the next word in a sentence. This has enabled language models to understand complex patterns in extended contexts. The same principle is now being applied to vision models, although it's not as established in this field yet. This development is crucial as it indicates a potential for vision models to achieve similar levels of complexity and understanding as seen in language models.
To understand how Apple's new approach differs from traditional vision models, consider how traditional vision models often use convolutional neural networks (CNNs) that focus on processing image data in sections and layers to recognize patterns and features. These models have been effective but face challenges in scaling up and adapting to new, complex data without significant retraining. CNNs often require more layers and depth to increase their capacity, which can lead to challenges in training stability and computational efficiency. CNNs are also less effective at capturing long-range dependencies in data compared to transformers, which can be a disadvantage when dealing with large-scale or complex datasets.
In contrast, Apple's approach, inspired by language models, uses autoregressive models and transformers. These models predict subsequent parts of an image based on previous parts, learning intricate patterns over larger contexts. This method allows for more effective scaling as the model grows and accesses more data, similar to large language models. This novel approach could indicate a strategic shift towards more adaptable, scalable vision models, particularly relevant for spatial computing environments where dynamic and complex visual data is prevalent.
Obviously vision AI has been a very high priority for other major AI research labs. Just to reiterate: we think this paper from Apple is important because 1) it shows Apple innovating in the architecture of the core model, not just in compute architecture, and, 2) the strategic direction regarding complex vision prediction. By way of further background, Big Tech has been big in machine vision, including the application of transformers to vision. A summary of important Vision Transformer (ViT) papers and concepts:
Each lab has contributed to the evolution of ViT, showing its versatility in handling various vision tasks from basic image classification to complex hierarchical understanding of visual data.
The Artificiality Weekend Briefing: About AI, Not Written by AI