Podcast Notes /// What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Fei-Fei Li, the creator of ImageNet, and her former PhD student Justin Johnson discuss the next frontier for AI: spatial intelligence.

They explore why our 3D world is a much harder problem for AI than language and what is missing from today's models, including a true understanding of physics.

Key takeaways

The AI development landscape is not a simple choice between open or closed models; it's a diverse ecosystem with different business strategies, similar to the iOS versus Android dynamic.
The primary challenge facing academic AI research is not the tension between open and closed development, but a severe lack of resources that prevents exploration of new ideas.
Academia's role in AI has evolved. It should no longer compete with industry on building the largest models, but instead focus on exploring 'wacky' new ideas, novel algorithms, and the theoretical foundations of AI.
Today's neural networks are fundamentally shaped by the hardware they run on, specifically GPUs optimized for matrix multiplication, but this design is approaching its scaling limits.
The 3D and 4D structure of the physical world is fundamentally different from the one-dimensional signal of language, even if the AI models share architectural components.
The theory of 'pixel maximalism' suggests that pixels offer a more complete representation of the world, as even language is perceived visually and tokenizing text loses information like font and layout.
Whether an AI's lack of true physical understanding matters depends on the application. For a movie backdrop, a plausible image is sufficient, but for designing a real building, it becomes a critical safety issue.
Instead of starting from scratch, AI models can be trained on data generated by traditional physics engines, effectively distilling the engine's knowledge into the neural network.
Technologies originally created for fun, like GPUs for gaming, often become foundational for serious scientific and technological advancements like the AI revolution.
The problem of 'data starvation' in robotics can be solved with synthetic simulated worlds, which can be generated by AI to train embodied agents in a controllable environment.
We underappreciate spatial intelligence because it feels effortless, but it's the result of 540 million years of evolution, whereas language is a much more recent development.
Language is a low-bandwidth, lossy channel for describing our experience of the world. The act of picking up a mug involves a rich stream of spatial data that cannot be fully captured by a verbal description.
AI models, even with vast data, can predict outcomes like planetary motion but cannot independently derive the abstract physical laws (like F=MA) that govern them.
Humans learn by actively forming, testing, and falsifying theories through interaction with the world, a fundamentally different and more efficient learning paradigm than the pattern recognition used by current AI.
You don't need to discard working components like attention when exploring new AI modalities; it's better to focus on solving one hard problem at a time.
Transformers are not inherently sequential models. They are natively models of sets, and their ability to process sequences comes from positional embeddings, not the core architecture.

The origin of World Labs and the rise of spatial intelligence

01:50 - 05:31

Justin Johnson and Fei-Fei Li co-founded World Labs, a connection rooted in their shared academic history. Justin was Fei-Fei's student at Stanford's Computer Science department, joining her lab in a pivotal moment for AI. As Justin recalls, his start coincided with a significant breakthrough in the field.

The quarter that I joined your lab was the same quarter that AlexNet came out.

After his PhD, Justin's interests shifted towards 3D vision and generative modeling. Years later, he and Fei-Fei found they were independently contemplating the next frontier beyond large language models. They reconnected over a shared interest in spatial intelligence and world models, deciding to pool their efforts and start World Labs.

The timing for this venture is driven by the massive scaling of data and compute. The history of deep learning is closely tied to the history of scaling compute. While AlexNet represented a shift from CPUs to GPUs, the progress since has been exponential. Today, the computational power available for a single model has grown immensely.

The amount of compute that we can marshal today on a single model is about a million fold more than we could have even at the start of my PhD.

While language models have successfully utilized this scale, processing the vast amounts of visual and spatial data required for world models demands even more. This new level of available compute is what makes building sophisticated world models feasible now.

Open science and proprietary models are both part of the modern AI ecosystem

05:31 - 07:19

The question arises whether the model of public challenges, like ImageNet, still works, or if development should be centralized within private labs. Open science is still very important, even as AI has evolved from a niche computer science discipline into a civilizational technology. For example, Fei-Fei Li's Stanford lab recently announced an open dataset and benchmark called 'Behavior' to benchmark robotic learning in simulated environments. This is a clear effort to maintain the open science model, especially within academia.

However, the ecosystem is now a mixture of approaches. A lot of focused work in the industry results in a product rather than an open challenge. This is not just a matter of funding or ROI, but reflects the diversity of the market. There are different business models and strategies at play. Justin Johnson notes this is driven by the need for ROI, while Fei-Fei Li compares it to the difference between iOS and Android, stating, "There are different business models... they're different plays."

The shifting role of academia in the age of large-scale AI

07:19 - 12:10

A key question today is whether a foundational open-source project like ImageNet could be created now, given the commercial pressures on AI labs. The incentives to publish a valuable dataset seem lower when so much money is at stake and PhD programs are being pulled into private labs earlier.

Fei-Fei Li's main concern is not the pressure itself, but the imbalanced resourcing of academia. She has been advocating for more resources for public sector and academic AI work, including a national AI research cloud and data repository. She believes open datasets and benchmarks remain a crucial part of the ecosystem, pointing to ongoing projects in her Stanford lab. She notes that years ago, when Justin Johnson was her PhD student, computer vision simply didn't work that well, highlighting how much the field has progressed.

Justin adds that the role of academia in AI has fundamentally shifted over the last decade. This is not a bad thing; it is a sign that the technology has successfully matured and scaled. Five or ten years ago, state-of-the-art models could be trained in a university lab with a couple of GPUs. That is no longer possible.

The expectations around what we should be doing as academics shifts a little bit. And it shouldn't be about trying to train the biggest model and scaling up the biggest thing. It should be about trying wacky ideas and new ideas and crazy ideas, most of which won't work.

Justin worries that too many in academia are still trying to pretend they can compete on model size, or are treating their programs as vocational training for jobs at big labs. He believes academia's strength now lies in exploring new algorithms, architectures, and the theoretical underpinnings of these large models. Fei-Fei agrees, adding that the real problem is that academia is so severely under-resourced that researchers lack the means to pursue these critical, blue-sky ideas.

Imagining a future for AI hardware beyond GPUs

12:10 - 14:14

Justin often pitches an idea to his students about the relationship between hardware and neural network design. The neural networks we use today, like transformers, are fundamentally based on matrix multiplication because that operation fits perfectly with how GPUs work. However, this hardware design is not going to scale infinitely.

We are already seeing the limits of this approach. The basic unit of computation is no longer a single device but an entire cluster of them. Despite this reality, we still discuss and code neural networks as if they are monolithic entities running on a single GPU in PyTorch, when in practice they are distributed across thousands of devices.

This raises a critical question: as hardware continues to scale out, are there other computational primitives besides matrix multiplication that would be better suited for large-scale distributed systems? Justin thinks it's possible that drastically different architectures could emerge to fit the hardware of the next 10 or 20 years. This is a difficult bet to make, partly due to the concept of the 'hardware lottery,' which suggests that Nvidia has already won and the future is just about scaling their technology.

However, the numbers tell a different story. Even the transition from Nvidia's Hopper to Blackwell chips shows that performance per watt is plateauing. While the number of transistors and power usage increase, we are hitting a scaling limit. This indicates there is room to do something new. This kind of foundational research isn't suited for a fast-paced startup but is a perfect long-range problem for academia, requiring years of dedicated exploration to achieve a breakthrough.

The story behind neural image captioning

14:14 - 21:20

The work on neural image captioning began as an exploration beyond ImageNet object recognition. Fei-Fei Li had a long-term dream of enabling AI to tell the story of an image in natural language, a problem she initially thought might take a century to solve. The key idea, developed with Andrej Karpathy, was to combine two emerging technologies: Convolutional Neural Networks (ConvNets) as a powerful way to represent images, and LSTMs, an early sequential model for language. By training these models together, they hoped to match captions with images.

This led to their first image captioning paper in 2015. The system used a ConvNet to represent an image and an LSTM to model language, successfully generating a single sentence caption. They soon discovered that Google had been simultaneously and independently working on the same problem. New York Times reporter John Markoff broke the story, covering both research efforts.

Justin Johnson joined the project after being impressed by a presentation on this new work. He and Andrej first collaborated on a language modeling paper where they trained RNNs on datasets like the Linux source code. They analyzed the network's internal state, discovering units within the LSTM that appeared to track programming syntax, such as one that would activate for an open parenthesis and deactivate for a closed one.

The team then pushed beyond single-sentence captions to a concept they called "dense captioning." The goal was to describe different parts of a scene in greater detail. This resulted in a complex system that could draw boxes around multiple interesting objects in an image and write a short description for each one. The network processed the entire image for context, proposed individual regions to focus on, and then generated text for each region, all in a single forward pass.

Fei-Fei noted that Justin went a step further than just publishing the paper by building a real-time web demo. Justin recalled carrying his laptop around a conference in Santiago, showing a live feed from his webcam. The video was streamed to a server at Stanford, which ran the dense captioning model and streamed the results back. Despite being slow, the fact that the live demo worked at all was an amazing feat at the time.

Pixel maximalism: Is language just another visual signal?

21:20 - 23:27

A question was raised about whether vision and language modeling are truly that different, especially with recent experiments like DeepSQLCR modeling text directly from pixels. Fei-Fei Li argues that they are different. While generative AI architectures may share components, she believes the 3D and 4D spatial structure of the world is fundamentally different from a one-dimensional signal like language.

Justin Johnson presents an alternative view called "pixel maximalism." This perspective suggests that language is not as distinct as we might think. Humans perceive language visually by reading text with our eyes, which are like biological pixels. Even sound can be visualized as a 2D signal. From this viewpoint, text is a physical object in the world that we see.

When you translate to this purely tokenized representations that we use in LLMs, you lose the font, you lose the line breaks, you lose sort of the 2D arrangement on the page. For a lot of things, maybe that doesn't matter, but for some things it does. And I think pixels are this sort of more lossless representation of what's going on in the world, and in some ways a more general representation that more matches what we humans see.

The process of tokenizing text for large language models actually discards information like font, line breaks, and layout. Pixels, therefore, might be a more complete and general representation of the world, closer to how humans actually perceive it. While there might be an efficiency argument against rendering text as an image to feed a vision model, the fact that this approach has shown some success suggests its potential.

AI models mimic physical patterns without causal understanding

23:29 - 29:11

A key challenge for world models is understanding the hidden forces behind what they see. One paper illustrated this by feeding orbital patterns into an LLM. While the model could predict a planet's orbit, it failed to correctly draw the underlying force vectors, showing it didn't grasp the physics. This highlights a fundamental question: how do you get a model to learn these hidden dynamics?

There are two main approaches. One is to be explicit by feeding the model data from physics simulations, directly teaching it about forces. The other approach is to hope that an understanding of physics emerges latently as the model trains on a more general, end-to-end problem. However, there's no guarantee that this latent learning will lead to an understanding of causal laws. Today's deep learning is still fundamentally about fitting patterns, which is where it diverges from human intelligence. It learns to fit specific patterns, like orbits, without developing a causal model of gravity.

This raises the question of whether it matters if a model truly "understands" the world. For some use cases, it might not. If the goal is to generate a plausible backdrop for a film, looking correct is all that matters. But for an architect using AI to design a building, a genuine understanding of structural forces is critical to prevent a collapse. Fei-Fei Li notes that models don't understand in the human sense; they simply learn from data patterns.

Justin Johnson explains that AI represents a different kind of intelligence. Humans infer understanding in others because we can introspect and assume others have a similar internal process. This isn't possible with AI.

These models are this alien form of intelligence where they can exhibit really interesting behavior. But whatever kind of internal cognition or internal self reflection that they have, if it exists at all, is totally different from what we do.

When asked if different models would be needed for visual tasks versus physics-based ones, the speakers expressed hope that it's a matter of scaling data and improving a single model. The big challenge is achieving emergent capabilities beyond the training data. The hope is that, at scale, models will learn to implicitly understand forces without being explicitly trained on them.

Using game physics engines to train AI models

29:11 - 30:33

When building new AI models, there's a question of whether to rely on existing physics engines, many of which were developed by the gaming industry. The reason for building new models is that traditional physics engines are not perfect and do not solve problems with the generality that is sometimes required. If they were perfect, there would be no need to build new models. However, this doesn't mean starting from scratch is necessary. A common approach is to use traditional physics engines to generate training data. This process effectively distills the physics engine's logic into the weights of the neural network. It is speculated that recent models like Sora and Genie 3 might have used a similar technique. This highlights a recurring theme where technologies invented for fun, like video games and graphics chips (GPUs), eventually find their way into serious work. The entire AI revolution was partially enabled by repurposing GPUs, which were originally designed for rendering graphics, for general-purpose computation.

Marble is a glimpse into the future of spatial intelligence

30:33 - 37:43

Marble is the first public glimpse into a new type of model focused on spatial intelligence. Fei-Fei Li explains that the grand vision is to create models that can understand, reason, and generate in a multimodal fashion, enabling interaction as complex as human interaction with the physical world. Marble is the first step on this journey.

Justin Johnson describes it as a generative model for 3D worlds. It accepts inputs like text or images and generates a corresponding 3D world. Crucially, it's also interactive. A user can generate a scene and then edit it, for example, by changing a water bottle's color or removing a table. While Marble is a step toward the long-term vision of spatial intelligence, it was also intentionally designed to be a useful product today, with emerging applications in gaming, VFX, and film.

A key differentiator is the ability to record within a scene, which requires precise camera control. Fei-Fei notes this is a natural outcome of a model that understands 3D space. This contrasts sharply with video generation models where users must learn a director's language to guide the camera without precise control.

In Marble you have precise control in terms of placing a camera.

The fundamental unit of generation in Marble is the Gaussian splat. Justin explains that these are tiny, semi-transparent particles with a specific position and orientation in 3D space. A scene is built from a large number of these splats. Their main advantage is that they can be rendered efficiently in real-time, even on an iPhone, which is what enables the precise camera control. However, this is not the only possible approach. Future models might generate frames one at a time or use tokens that represent chunks of a 3D world.

Looking ahead, the team is exploring how to integrate physics into these worlds. One idea is to attach physical properties like mass to each splat and then run a physics simulation on top of them. This highlights the composable nature of working in a 3D environment, where logic and new features can be injected at different stages.

Two approaches for simulating 3D scene interactions

37:43 - 38:21

There are two main ways to simulate interactions in a 3D scene. One method involves predicting the 3D properties of every object and then using a classical physics engine to simulate how they interact. The other approach is to have the model regenerate the entire scene in response to a user's action, using a representation like splats. This second method is potentially more general because it is not limited by known physical properties that can be modeled. However, it is also much more computationally demanding. This area of dynamic interaction is a promising field for future work.

Device performance limits the fidelity of Gaussian splats

38:21 - 39:21

The density and resolution of Gaussian splats face certain limitations depending on the target use case. One of the biggest constraints is the computational power of the device where the scene will be rendered. For example, rendering cleanly on mobile devices or in VR headsets presents a challenge due to their limited compute resources.

If the goal is to render a splat file at a high resolution of 30 to 60 frames per second on an iPhone from four years ago, there are significant limits on the number of splats you can use. However, these constraints can be relaxed on more powerful hardware. Newer devices like a recent iPhone, a MacBook, or a machine with a local GPU can handle more splats. This increased capacity allows for higher resolution and more detailed scenes. The constraints can also be loosened if the performance targets, such as 60fps at 1080p, are not as strict.

Marble's emergent use cases in robotics and design

39:21 - 42:50

While the initial focus is on creative industries, the technology behind Marble has significant potential for embodied use cases like robotic training. Robotics currently suffers from "data starvation." High-fidelity, real-world data is critical but scarce, and internet video data lacks the necessary controllability for training agents. Synthetic, simulated data offers a vital middle ground. Marble can help generate these complex, simulated worlds for training embodied agents, providing a solution to this data bottleneck.

The vision for Marble is as a horizontal technology that can be applied across many different industries over time. One adjacent area to creative work is design, such as architecture and interior design. An interesting emergent use case is home remodeling. For example, a user can plan a kitchen remodel by capturing images of their current space, reconstructing it in Marble, and then experimenting with different finishes.

Who wants to use Marble to plan your next kitchen remodel? It actually works great for this already. Just take two images of your kitchen, reconstruct it in Marble, and then use the editing features to see what that space would look like if you change the countertops or change the floors or change the cabinets.

This is an example of an emergent capability that arises from a powerful, general-purpose technology. In fact, early beta users are already building applications for interior design.

Spatial intelligence is our foundation for understanding the world

42:50 - 49:47

Spatial intelligence is not in opposition to traditional or linguistic intelligence, but is complementary to it. Drawing from psychologist Howard Gardner's theory of multiple intelligences, Fei-Fei Li defines spatial intelligence as the capability to reason, understand, move, and interact in space. This form of intelligence is fundamental to both groundbreaking discoveries and everyday actions.

A lot of that had to do with the spatial reasoning of the molecules and the chemical bonds in a 3D space to eventually conjecture a double helix. That ability that humans or Francis Crick and Watson had done is very, very hard to reduce that process into pure language.

Similarly, the simple act of picking up a mug is a deeply spatial process. It involves seeing the mug, its context, and coordinating hand movements to grasp it. While one can narrate this action, the language itself is insufficient to perform the task. Language is a low-bandwidth channel compared to the rich, high-bandwidth experience of interacting with the physical world. Speaking for 24 hours straight generates only about 215,000 tokens, a fraction of the data processed through sensory experience.

Justin Johnson notes that our ability to formalize concepts like gravity, as Newton did, stems from a foundation of embodied experience. Large Language Models (LLMs) have jumped straight to the highest forms of abstract reasoning found in language, potentially missing the foundational understanding that comes from spatial interaction. Re-emphasizing spatial intelligence is like opening up that black box to see what was lost by skipping the embodied step.

Fei-Fei suggests that as a vision scientist, she finds spatial intelligence is often underappreciated precisely because it feels effortless to humans. We are born with the ability to see and link perception with movement, abilities that nature has optimized over 540 million years. In contrast, language, which requires conscious effort to learn, has only been developing for roughly half a million years.

Something that nature spend way more time actually optimizing, which is perception and spatial intelligence is underappreciated by humans.

Why AI struggles to derive the abstract laws of physics

49:47 - 56:28

Language models struggle with concepts of spatial intelligence and physical impossibilities, like understanding that one object cannot fall through another it rests upon. This is because their world model is based on sequences of words, not an internal three-dimensional representation of reality. This raises the question of how to instill spatial intelligence into these models. The consensus is that language models will not be thrown out completely, but will work in concert with other multimodal systems.

An interesting thought experiment is whether an AI, given vast amounts of astrophysical data, could independently derive the laws of Newtonian physics. Fei-Fei Li suggests it probably could not. While a model might become very accurate at predicting the trajectory of a planet, it would likely fail to discover the underlying abstract principles. She explains:

But F equals MA. Or action equals reaction. That's just a whole different abstraction level that's beyond just today's LLM.

This highlights a key difference between AI and human learning. Justin Johnson points out that the human objective is to understand the world and thrive. We do this by constantly building theories about the world around us, interacting with it, and updating our understanding when our expectations are not met. This process of forming and falsifying hypotheses, when scaled up, leads to major scientific discoveries. Fei-Fei agrees, calling this a more efficient learning method where experiments eliminate impossible worlds to arrive at the correct one. This same mechanism underlies theory of mind, where we form hypotheses about what others are thinking. Today's AI models do not engage in this kind of active, theory-driven learning, which also limits their potential for emotional intelligence.

Transformers are models of sets, not sequences

56:28 - 58:41

When exploring new modalities in AI, it is not necessary to discard everything that currently works. Fei-Fei Li suggests focusing on one hard problem at a time rather than fixing things that are not broken, like the attention mechanism. While new architectures beyond sequence-to-sequence models are likely to emerge, it is important to understand the current technology properly.

Justin Johnson clarifies a common technological confusion about transformers. Unlike earlier architectures such as recurrent neural networks (RNNs) which are inherently sequential, transformers are not models of sequences. They are natively models of sets.

A transformer is actually not a model of a sequence of tokens. A transformer is actually a model of a set of tokens.

The sequential nature of a standard transformer comes entirely from the positional embeddings applied to the tokens. Without these embeddings, the model would not know the order of the input. All the internal operations are either token-wise (like feed-forward networks and projections) or permutation equivariant, like the attention mechanism. This means if the input tokens are shuffled, the output is simply shuffled in the same way, which is a characteristic of an architecture that operates on sets.