Podcast Notes /// The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li | Lenny's Podcast: Product | Career

Dr. Fei-Fei Li, known as the "godmother of AI," breaks down the history of the deep-learning revolution and the key ingredients that made it possible.

She explains why current AI falls short of human intelligence and looks to the future, where "world models" will give machines the spatial reasoning they currently lack.

Key takeaways

There's nothing artificial about AI. It is inspired by, created by, and most importantly, impacts people.
Modern AI was sparked by the realization that the primary bottleneck was not the algorithms, but the lack of large-scale data for training them—a principle inspired by human learning and evolution.
The birth of modern AI in 2012 was sparked by a golden recipe: the combination of big data (ImageNet), neural network algorithms, and powerful GPUs.
As recently as 2016, some major tech companies avoided using the term 'AI' because they feared it was a 'dirty word'.
The 70-year history of AI is the result of collective work from generations of researchers, not the achievement of a few celebrated individuals.
The term AGI is more of a marketing label than a scientific definition; the true north star of the field remains the original goal of creating machines that can think and act like humans.
Current AI falls short of human intelligence in key areas. It struggles with simple tasks a toddler could do and lacks the creativity to replicate scientific breakthroughs like Isaac Newton's, showing that simply scaling current methods is not enough.
Vision can be understood through Plato's Allegory of the Cave: its purpose is to infer a 3D reality from 2D projections, much like a prisoner must decipher the real world from shadows on a wall.
Spatial reasoning is crucial for major breakthroughs, like the discovery of DNA's double helix from a 2D image. AI with spatial intelligence can augment human capabilities in science, design, and robotics.
Unlike language models where input and output data (text) are perfectly aligned, robotics faces a fundamental mismatch. The goal is to produce physical actions, but training data like web videos lacks this explicit action information.
Robots are physical systems, making them more complex than even self-driving cars. A self-driving car is a 2D system designed not to touch things, while a robot is a 3D system designed specifically to touch and interact with its environment.
A successful deep tech company can be built with a relatively small team of around 30 people by integrating research, engineering, and product development.
A critical gap often exists between technology creators in hubs like Silicon Valley and policymakers, highlighting the need for institutions to bridge this divide for responsible innovation.
No technology should take away human dignity. Human agency must be at the heart of how AI is developed, deployed, and governed.
Everyone has a role in AI, regardless of their profession. Individuals can embrace AI as a tool to enhance their work and use their voice to shape its application in society.

Podchemy Weekly

Save hours every week! Get hand-picked podcast insights delivered straight to your inbox.

Fei-Fei Li on the human-centric nature of AI

00:00 - 01:34

Fei-Fei Li is often called the godmother of AI for her work that helped end the 'AI winter'. In the mid-2010s, many tech companies avoided using the term AI, unsure if it was a 'dirty word'. By 2017, however, companies started identifying themselves as AI companies.

Fei-Fei emphasizes that the future of AI is up to us. While technology is a net positive for humanity, it is also a double-edged sword. If not handled correctly, it can be misused.

There's nothing artificial about AI. It's inspired by people, it's created by people and most importantly, it impacts people.

Her breakthrough insight was that for machines to think like humans, they needed the massive amount of data humans use to learn. She chose to focus on visual intelligence because humans are deeply visual creatures. The challenge is that objects are very difficult for machines to learn. A single object can appear in infinite ways in an image. To train computers on tens of thousands of object concepts, you need to show them millions of examples.

From the AI winter to ImageNet: A brief history of AI

08:23 - 17:49

When considering the future of AI, the most important step is for every individual to act responsibly. Whether involved in development, deployment, or application, everyone should care about the technology's impact. AI will affect every individual life, community, and society, making personal responsibility the essential foundation for its future.

For many, the story of AI began with ChatGPT. However, the field has a long history that predates this recent explosion. The period before the current boom is often referred to as the "AI Winter," a time when many researchers nearly gave up on the idea. The spark that led out of this winter was a shift in perspective on what AI needed to thrive.

Fei-Fei Li, who has spent her entire professional life in AI, reflects on this journey. The field's origins trace back to the 1950s and 60s, with thinkers like Alan Turing posing foundational questions about thinking machines even earlier. The term "artificial intelligence" was coined in 1956. Early AI focused on logic and expert systems. A major shift occurred around the late 80s and 90s with the rise of machine learning, which combined computer programming with statistical learning. This introduced a critical idea: purely rule-based programs were insufficient. Instead, machines needed to learn patterns from data to generalize their knowledge, similar to how a human who sees three cats can recognize any cat.

When Fei-Fei started her PhD in 2000, the field was still in the AI winter. Funding and public interest were low. Two key ideas guided her work and helped ignite the modern AI revolution. First, she chose to focus on visual intelligence, believing that human intelligence is deeply tied to our ability to see and understand the world. Her goal was to solve object recognition, a fundamental building block of perception. Second, she identified the most significant pain point in the field: a lack of data. While researchers were focused on creating better mathematical models, she realized that both human learning and evolution are big data processes. She and her students conjectured that this was the critically overlooked ingredient needed to bring AI to life. This insight led to the creation of the ImageNet project in 2006, an ambitious effort to collect a massive dataset of images to train AI models.

How ImageNet created the golden recipe for modern AI

17:49 - 21:08

The creation of ImageNet began with the ambitious goal of mapping all the objects on the internet. At the time, this felt achievable because the internet was much smaller. A team of graduate students and a professor curated 15 million images from the internet, creating a taxonomy of 22,000 concepts based on linguists' work like WordNet. This massive dataset, ImageNet, was then open-sourced to the research community, along with an annual challenge to encourage participation.

The year 2012 marked a pivotal moment, often considered the birth of modern AI. Researchers from Toronto, led by Professor Geoff Hinton, participated in the ImageNet challenge. They used ImageNet's big data, two Nvidia gaming GPUs, and a neural network algorithm. This combination resulted in a huge leap forward in solving the problem of object recognition.

That combination of the trio technology, big data, neural network and GPU was kind of the golden recipe for modern AI.

This same recipe is still at the core of AI today, powering models like ChatGPT. The ingredients are the same, just at a much larger scale: internet-scale data (mostly text), more complex neural network architectures, and hundreds of thousands of powerful GPUs instead of just two. This foundational work continues to inspire companies like Scale AI, which focus on providing massive amounts of labeled data for AI labs.

When AI was considered a dirty word

21:09 - 22:51

The perception and use of the term "AI" have changed dramatically in less than a decade. Fei-Fei Li recalls that around 2015 and 2016, some tech companies were hesitant to use the term, worrying that "AI was a dirty word." In contrast, she was always encouraging its use.

To me, that is one of the most audacious questions humanity has ever asked in our quest for science and technology. And I feel very proud of this term.

The turning point in Silicon Valley came around 2017, which marked the beginning of companies proudly calling themselves "AI companies." This is a stark contrast to today, where it seems almost every company wants to be known as an AI company.

AI was built by generations of researchers

22:51 - 23:53

The history of AI is built upon the work of many people, not just a few recognized individuals. Fei-Fei Li explains that AI is a field that is now 70 years old and has progressed through generations of researchers. She acknowledges that while our culture, especially in Silicon Valley, tends to assign achievements to a single person, it's important to remember the countless heroes and researchers who have contributed.

AI is a field of, at this point, 70 years old, and we have gone through many generations. No one could have gotten here by themselves.

No single person could have advanced the field to where it is today on their own. The progress is a result of a long and collaborative history.

AI's limitations highlight the need for new breakthroughs

23:53 - 29:50

The term Artificial General Intelligence, or AGI, is often used, but it lacks a clear scientific definition. Fei-Fei Li views it as more of a marketing term. She believes the true north star for the field of AI remains the original, audacious question: can machines think and do things in the way humans can? From this perspective, there isn't a significant distinction between AI and AGI. The goal has always been to achieve this level of intelligence.

While the current approach of using more data, more GPUs, and larger models has led to progress, it is not sufficient to reach this goal. Fei-Fei asserts that more innovation is desperately needed. She points out that AI is one of the youngest scientific disciplines, and we are still just scratching the surface of what is possible. There are many simple things that AI still cannot do, such as counting the number of chairs in a video of a room, a task a toddler could easily perform.

Furthermore, AI lacks higher-level human capabilities like creativity, abstraction, and emotional intelligence. For instance, even with all the modern data on celestial bodies, current AI cannot derive the 17th-century equations for the laws of motion that Isaac Newton did.

That level of creativity, extrapolation, abstraction, we have no way of enabling AI to do that today.

Similarly, today's conversational bots cannot replicate the emotional and cognitive intelligence of a teacher having a nuanced conversation with a student about their motivations and passions. The gap between current AI and human-level intelligence remains vast, indicating that significant breakthroughs are still required.

Moving beyond language models to spatial intelligence

29:51 - 39:32

While large language models were incredibly inspiring, Fei-Fei Li saw the need to push AI beyond language. Humans use a sense of spatial intelligence to navigate the world, a faculty that is often non-verbal. She points to a chaotic first responder scene as an example. The way people organize themselves to rescue others and control the situation relies on movement, a spontaneous understanding of objects, and situational awareness. Language is only a part of that; it cannot, by itself, put out a fire.

This led to her focus on "world models," which she sees as the linchpin connecting visual intelligence, robotics (embodied AI), and language. It's about giving AI a sense of spatial intelligence. To advance this work, she founded a company called World Labs. A world model is a foundation that allows anyone to create, interact with, and reason within a generated world.

A simple way to understand a world model is that this model can allow anyone to create any worlds in their mind's eye by prompting, whether it's an image or sentence, and also be able to interact in this world, whether you're browsing and walking or picking objects up or changing things, as well as to reason within this world.

This technology is a key missing piece for embodied AI, making functional robots a more tangible reality. However, its applications extend far beyond robotics. Humans, as embodied agents, can also be augmented by this spatial intelligence, much like LLMs assist with language tasks today. Other applications include creating infinitely playable games, aiding in design, and accelerating scientific discovery. Fei-Fei highlights the discovery of DNA's double helix structure, where Watson and Crick had to reason in 3D from a flat 2D X-ray photo. AI-assisted spatial intelligence could be critical for future breakthroughs of this kind.

Why the bitter lesson of AI may not apply to robotics

40:36 - 48:02

The "bitter lesson" in AI, a concept from Turing Award winner Richard Sutton, suggests that simpler models with vast amounts of data ultimately outperform more complex models with less data. Fei-Fei Li notes that while this lesson was sweet for her work on ImageNet, its application to robotics is not straightforward. The primary challenge is that robotics lacks the perfect alignment found in language models.

Language models had this perfect setup where their training data are in words, eventually tokens, and then they produce a model that outputs words. So you have this perfect alignment... But robotics is different. You hope to get actions out of robots, but your training data lacks actions in 3D worlds.

This misalignment means researchers must find ways to supplement data, using methods like teleoperation or synthetic data. Furthermore, robots are physical systems, more akin to self-driving cars than to a large language model. This physicality introduces immense complexity. The journey of self-driving cars, from a DARPA challenge in 2005 to today's street-legal vehicles, has taken nearly 20 years and is still incomplete. Yet, the problem for self-driving cars is simpler.

They're just metal boxes running on 2D surfaces and the goal is not to touch anything. A robot is a 3D thing running in a 3D world and the goal is to touch things.

While deep learning is accelerating the development of the robotic "brain," the hardware, supply chains, and use cases also need to mature. This work fosters a deep appreciation for biological intelligence.

We operate on about 20 watts. That's dimmer than any light bulb in the room I'm in right now. And yet we can do so much. So I think actually the more I work in AI, the more I respect humans.

Introducing Marble: A model that generates 3D worlds from prompts

48:03 - 52:50

Fei-Fei Li introduced Marble, one of the first products from her new company, World Labs. World Labs is a foundation frontier model company focused on spatial intelligence and world modeling, which they believe is as important as language models. The company was founded by four co-founders with deep technical backgrounds in AI, computer graphics, and computer vision.

Marble is an application built on their frontier models. After more than a year of development, they created the world's first generative model capable of outputting genuinely 3D, navigable worlds from simple prompts. Users can input a sentence and images to create immersive worlds they can move through and even walk around in using goggles.

So many creators, designers, people who are thinking about robotic simulation, people who are thinking about different use cases of navigable, interactable, immersive worlds, game developers will find this useful. So we developed Marble as a first step. It's the world's first model doing this. And it's the world's first product that allows people to just prompt. We call it prompt to worlds.

The host, who has used the app, described it as "insane," mentioning the ability to create and explore a "shire world" or a "dystopian world." A particularly interesting feature discussed was the appearance of dots that form the world before it fully renders. The host found this delightful, as it offers a glimpse into the model's process. Fei-Fei revealed this was not part of the core model but an intentional visualization feature designed to guide people into the world. The team was pleased to hear that this design choice, which was compared to the Matrix, enhanced the user experience so much.

The diverse and unexpected applications of Marble

52:50 - 57:30

The technology of world modeling is very horizontal, leading to some exciting and unexpected use cases. For example, it is being used in virtual production for movies to create 3D worlds that can be aligned with cameras, allowing actors to interact with the environment seamlessly. In a collaboration with Sony, this technology cut production time by a significant margin.

This has cut our production time by 40x. We only had one month to work on this project and there were so many things they were trying to shoot. So using Marble really, really significantly accelerated the production.

Other applications are emerging in gaming, where users export scenes and meshes for VR and other games. In robotics, it addresses the major pain point of creating diverse synthetic data for training. Instead of humans building every asset, simulations can generate varied environments for robots to learn from. An unexpected use case came from a psychology team that wanted to study how patients' brains respond to different immersive scenes, like messy or clean rooms. Creating these experimental environments is typically slow and expensive, but this technology provides an almost instantaneous solution. This pattern of releasing AI tools early to discover their best applications is common. The head of ChatGPT, for instance, scanned TikTok to see how people were using it, which helped guide its development. Similarly, the potential for this technology in exposure therapy, for things like fears of heights or spiders, is already being considered.

Spatial intelligence is deeper than video generation

57:30 - 1:01:01

The core thesis of WorldLabs is the fundamental importance of spatial intelligence, which goes far beyond simply generating videos. Fei-Fei Li uses Plato's Allegory of the Cave to illustrate this point. In the allegory, a prisoner is tied to a chair, watching shadows projected on a cave wall. Their task is to figure out the real, 3D world that is creating those 2D projections. This is the essence of vision: making sense of a 3D or 4D world from 2D information.

Spatial intelligence is the ability to create, reason with, and interact with a deeply spatial world. This is what differentiates WorldLabs' model, Marble, from other video AI tools that produce final, flat videos. Marble provides creators and developers with worlds that have 3D structure, allowing them to use these environments for their work.

The way I see it is it's a platform for a ton of opportunity to do stuff. Videos are just like, here's a one off video that's very fun and cool and that's it. And you move on.

While WorldLabs' technology can generate real-time video, Marble's primary function is to serve as a platform. For instance, a creator could enter a 3D world like a hobbit cave, move a virtual camera along a specific trajectory as a director would, and then export that sequence as a video. This gives them a level of creative control that is not possible with simple video generation.

What it takes to build a deep tech product

1:01:01 - 1:02:25

When asked what it takes to create their product, Fei-Fei Li explained that it requires a lot of brainpower. The team consists of about 30 people, who are predominantly researchers and research engineers. However, they also have designers and product specialists. The company's philosophy is to be anchored in the deep tech of spatial intelligence while also building serious products.

We want to create a company that's anchored in the deep tech of spatial intelligence, but we are actually building serious products. So we have this integration of R and D and productization.

In addition to the team's expertise, the project relies on a significant amount of computational power, using a large number of GPUs.

The surprising intensity of the AI landscape

1:02:26 - 1:04:45

When asked what she wishes she knew before starting her company 18 months ago, Fei-Fei Li reflects on the current state of AI. She believes a key advantage for her team is seeing the future of technology earlier than most. However, the pace of change is still astonishing.

Despite her extensive experience founding initiatives at Google and Stanford, which she feels prepared her more than a younger founder might be, she was still surprised by the reality of the AI landscape. The intensity of the competition is a constant source of pressure and even paranoia.

I'm still surprised and it puts me into paranoia sometimes how intensely competitive the AI landscape is, from the model, the technology itself, as well as talents.

This competition is fierce for both the technology itself and, crucially, for talent. The cost of hiring certain experts has risen to levels she did not anticipate when founding the company, requiring her to remain constantly alert.

An intellectually fearless approach to building a career

1:04:45 - 1:08:43

When asked about the common thread in her career that placed her at the center of major AI breakthroughs, Fei-Fei Li points to two guiding forces. The first is a deep-seated curiosity and passion for AI, which she calls her "scientific North Star." The second, and equally important, is a quality she cultivates in herself and looks for in others: being intellectually fearless.

When you want to make a difference, you have to accept that you're creating something new or you're diving into something new. People haven't done that. And if you have that self-awareness, you almost have to allow yourself to be fearless and to be courageous.

This fearlessness guided her to leave a near-certain tenured position at Princeton to join Stanford. She was drawn by the remarkable people and the Silicon Valley ecosystem, accepting the risk of restarting her tenure clock to become the first female director of the Stanford AI Lab (SAIL). Her move to Google was similarly motivated by the opportunity to work with incredible minds. She emphasizes that her decisions are driven by focusing on the mission and the people, rather than dwelling on potential negative outcomes.

I don't overthink of all possible things that can go wrong because that's too many. I feel like that's an important element. It's not focusing on the downside, focusing more on the people, the mission. What gets you excited? What do you think? Curiosity.

Advice for young AI talent: Focus on passion and mission

1:08:43 - 1:10:23

Fei-Fei Li offers advice to the young engineers and researchers in AI. She observes that many young talents tend to analyze every single small aspect of a job offer. She finds herself in a mentoring role, encouraging them to concentrate on what is truly important.

Maybe the most important thing is where's your passion? Do you align with the mission? Do you believe and have faith in this team and just focus on the impact and you can make and the kind of work and team you can work with?

The AI space is filled with so much news, hype, and pressure, which can cause stress. The advice is to focus on what will actually make you feel fulfilled in your work. This is more important than just identifying the fastest-growing company or trying to guess who will ultimately win.

Building a human-centered framework for AI at Stanford

1:10:23 - 1:14:43

Fei-Fei Li co-founded the Human-Centered AI Institute (HAI) at Stanford in 2018. The idea came to her during a sabbatical at Google, where she realized that AI was going to be a 'civilizational technology'. This prompted her to advocate for a guiding framework for AI development, one anchored in human benevolence and centeredness.

She felt that Stanford, situated in the heart of Silicon Valley, should be a thought leader in this area. HAI has since become the world's largest AI institute focused on human-centered research, education, and policy. It involves hundreds of faculty from all eight schools at Stanford, supporting interdisciplinary research in fields from medicine and business to engineering and humanities.

A major focus for HAI is policy. When the institute started, Fei-Fei noticed that Silicon Valley was not communicating with policymakers in Washington D.C. or Brussels. To bridge this gap, HAI created programs like a Congressional boot camp and the AI Index report. They have also actively participated in policymaking, such as advocating for a national AI research cloud bill and contributing to state-level AI regulatory discussions. Fei-Fei remains a leader at the institute, driven by the mission to ensure AI is created and used in the right way.

Everyone has a role to play in AI

1:14:44 - 1:19:30

Fei-Fei Li addresses a question she frequently hears from people across all professions, from musicians and teachers to nurses and farmers: Do they have a role in AI, or will it simply take over their lives? Her answer is a resounding yes. She finds that Silicon Valley often tosses around terms like "infinite productivity" or "infinite leisure time" without a heartfelt connection to people. At its core, AI is about people.

"No technology should take away human dignity. And the human dignity and agency should be at the heart of the development, the deployment as well as the governance of every technology."

She encourages people to see AI as a tool that can augment their unique skills and passions. For a young artist, AI can be a new medium to tell their unique story. For a farmer near retirement, their role is crucial as a citizen who should have a voice in how AI is applied in their community. For overworked nurses, AI can offer support through technologies like smart cameras and robotic assistance, helping them care for an aging society.