Podcast Notes /// The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo

Waymo Co-CEO Dmitri Dolgov explains the 20-year journey of bringing fully autonomous ride-hailing to cities across the globe.

He breaks down the complex AI models and specialized sensors needed to build a vehicle that safely navigates without any human help.

This technical evolution shows how self-driving technology is moving from scientific research into a scalable reality that could soon redesign our cities.

Key takeaways

Waymo uses a teacher-student architecture where massive off-board models are distilled into smaller, faster versions capable of running locally on the car.
Real-time driving inference is handled entirely on the vehicle hardware to ensure safety while the cloud is used for non-critical tasks like detecting lost items.
Simple end-to-end models can drive well in normal conditions, but they fall short of the safety required for fully autonomous operation without human assistance.
Autonomous systems benefit from intermediate representations like road signs and object locations to make simulation more efficient and add real-time safety layers.
Full autonomy requires augmenting end-to-end architectures with structured representations to handle the complexities of scaling.
Simulating for end-to-end models is difficult because it is easier to work with intermediate data than to create a pixel-perfect view of every scenario.
The transition from many small models to a single AI backbone made autonomous driving systems more generalizable and easier to scale.
A unified AI core allows for parallel deployment across multiple cities instead of iterating on a per-location basis.
The next generation of autonomous vehicles is moving away from driver-centric designs toward passenger-centric layouts that prioritize interior space and ease of access.
Waymo's sixth generation hardware is designed to be significantly cheaper and simpler, bringing the cost of self-driving sensors closer to high-end consumer driver assist systems.
Lidar provides high-resolution 3D mapping while radar excels in adverse weather like fog and rain where cameras and lasers might struggle.
Self-driving systems combine data from cameras, Lidar, and radar simultaneously rather than switching between them to ensure a consistent view of the environment.
Foundational AI models allow autonomous vehicles to exhibit emergent behaviors, such as detecting hidden pedestrians through subtle sensor reflections under a bus.
Using intermediate world representations is more effective for handling complex safety scenarios than relying on purely imitative black box systems.
Driver assist systems and full autonomy are qualitatively different technologies, and it is a mistake to view them as incremental steps on the same spectrum.
Waymo's operational efficiency has reached a point where they can launch in four new cities in a single day, a task that previously took years of preparation.
Widespread autonomous driving could reclaim massive amounts of urban land currently wasted on parking lots and garages, allowing cities to be redesigned for people instead of idle cars.
Autonomous driving is deceptively easy to start but exponentially hard to finish, as every additional decimal point of reliability requires ten times more effort.
While breakthroughs like Transformers reshape the development curve, they are not silver bullets that eliminate the inherent complexity of navigating the physical world.
The Russian school of physics and applied math provides a rigorous foundational education that has fueled a global diaspora of top engineering talent.

Dmitri Dolgov on his path from Russia to Waymo

00:01 - 03:29

Dmitri grew up in the Soviet Union where his father worked as a physicist. His family moved to Japan for a year before settling in Berkeley. Even after obtaining a green card in the United States, Dmitri chose to return to Russia in 1994. He wanted to pursue his bachelor's and master's degrees in physics and applied math at a Russian technical school.

I was pretty excited about where Russia is and the trajectory it's on. And, being young and naive, I was like, there's no turning back.

The Russian school of math and science provided a rigorous foundation. However, Dmitri recognized that the United States was the best place for graduate studies in computer science. This path led him into an engineering culture that has produced many successful founders in the global tech industry. When analyzing how a Waymo vehicle functions, it is important to distinguish between real-time inference and the broader technical architecture. The inference part handles immediate decisions, but it represents only a fraction of the entire system.

The architecture of the Waymo driver

03:29 - 09:56

The Waymo driver functions as an ecosystem built on a sophisticated sensor suite. Dmitri explains that the vehicle uses cameras, lidar, and radar to maintain 360-degree awareness. This data feeds into a specialized AI system. Encoders process the sensor data while a generative decoder determines the driving actions. All real-time driving decisions happen locally on the vehicle. Cloud processing is reserved for non-essential tasks like checking if a passenger left a mess or a phone in the car.

The development process relies on a large offboard foundation model. This model understands the physical world and the social norms of driving. This foundation model specializes into three high-capacity teachers. These include the Waymo driver, a simulator for synthetic environments, and a critic that evaluates behavior. These massive models are then distilled into smaller, faster versions for real-time use.

The way we think about building the Waymo driver, it starts with a large off board foundation model. I can imagine building a big model that understands how the physical world works and understands the important properties of what it means to drive. Then we specialize it into three main offboard teachers. Those then get distilled into smaller models that you can run inference on faster.

The simulator and the on-car driver share a fundamental understanding of how objects relate and how they might move in the future. The critic identifies interesting events. It determines what constitutes good or bad behavior. By distilling these expert teachers into student models, the system maintains high-level intelligence. This allows it to meet the speed requirements of a moving vehicle.

The iterative evolution of self-driving technology

09:56 - 11:53

The path to autonomous driving was not a series of dead ends. Instead, it was a process of iterative learning and evolution. Dmitri points out that technological breakthroughs in AI and compute were necessary. While general architectures like Transformers are powerful, they are not a silver bullet. Their success in driving depends on how they are applied specifically to that domain.

Architecture is important. But really a lot of it comes down primarily to your metrics, to your evaluation mechanisms, to all of the training recipes, and of course, new data.

Success with models like Transformers often involves creating the right representations for a specific domain. Large language models perform well with text and code because those areas are already textual. For autonomous driving, the challenge is creating the right tokens and training recipes that allow these general architectures to work effectively.

The balance between end-to-end models and structured representations

11:54 - 19:13

End-to-end models in autonomous driving allow gradients to propagate through all layers so every part learns representations for the final task. A simple version involves feeding pixels in and getting car actions out. This approach is easy to start with. For instance, you can take a Vision Language Model (VLM), which is normally for text, and fine-tune it to generate driving trajectories. While this can result in surprisingly good driving in normal cases, it is far from the safety levels required for a driverless car.

Driving is similar to a conversation. Just as Large Language Models model dialogue through words, driving involves social interactions and body language between agents on the road. Context and history matter. To reach superhuman safety, passive observation is not enough. Systems need closed-loop training, such as reinforcement learning based fine-tuning. This allows the system to explore different situations and receive reward signals to stay on track.

If you think about the hard parts of driving, it is not unlike having a conversation. What makes driving hard is this multi agent, social interactive part of it. If I do something that is going to affect you, it is going to affect somebody else. Context matters, semantics matters. But it is in a different language, a body language, if you will.

Relying solely on pixels for simulation and training is inefficient because the space is too high-dimensional. Dmitri notes that incorporating intermediate representations helps bridge this gap. These are structured concepts we know are correct, such as the location of objects, road boundaries, and speed limits. These structured elements provide more control for simulation and add essential safety validation layers in real time.

Balancing end-to-end models and structured representations

19:14 - 19:45

Achieving full autonomy at scale requires more than just intent. It involves augmenting end-to-end architectures with specific structures to bridge the gap between simulation and the real world. Simulating environments for a pure end-to-end model is difficult because creating a pixel perfect view is much harder than working with intermediate representations. Dmitri explains that both are necessary for a complete system.

Having an end-to-end architecture that is augmented with that structure allows you to play in both of those worlds. It is easier to deal in intermediate representations rather than coming up with a pixel perfect view of the world. You need both.

This hybrid approach allows a system to benefit from the strengths of both methods. By using an architecture that includes structured data, autonomy systems can handle complex tasks more effectively than they could with a single, unaugmented model.

The shift from autonomous research to global scaling

19:47 - 28:31

Self-driving cars aim for more than just reaching a destination. While safety is the top priority, the system must also be smooth and predictable to fit into the social ecosystem of the road. This involves navigating complex human interactions, such as managing drop-offs and pickups. These moments require understanding context, like knowing when it is acceptable to double-park briefly or how to avoid blocking a driveway.

For drop offs, you're absolutely right. There are a few things that are maybe not obvious. You just think about this problem, but it's understanding where you want to go and making it as convenient as possible for you. And pickups from drop. It's not exactly symmetrical.

The challenges differ between city streets and freeways. City driving involves constant nuance, while freeways are more structured but carry higher risks due to speed. High-speed environments contain rare but dangerous events, such as debris falling from trucks or multi-car accidents. Dmitri notes that the core technology for these scenarios is now largely complete. The focus has shifted from basic research to scaling the technology globally.

Moving into new markets like London or Tokyo requires specialization rather than reinventing the core system. While computers easily adapt to driving on the different side of the road, environmental factors like extreme cold present harder challenges. Cold weather impacts the entire system, requiring specialized hardware like sensors with heating elements and advanced motion control for slippery surfaces.

The approach to scaling has evolved through different generations of technology. Early efforts in Chandler, Arizona, focused on the end-to-end experience in a controlled environment. The current fifth-generation system was designed to handle a much broader operating domain, including high-complexity areas in San Francisco and Phoenix.

When we chose to deploy in the hardest parts of San Francisco, hardest parts of Phoenix, we made a big jump on the hardware side. And most importantly on the software, the AI side.

Building a generalizable AI backbone for autonomous driving

28:31 - 30:13

The shift from the fourth to the fifth generation of driving technology represented a major leap forward. While the previous version relied on many small, individual machine learning models, the fifth generation made a significant bet on using AI as the central backbone of the system. This change made the technology far more generalizable.

We made a much bigger bet and jump to AI as the backbone for the fifth generation. AI is the backbone as the core engine.

Dmitri explains that this architectural shift is the reason the system can now scale across different parts of the United States simultaneously. By building a core engine that handles the complexity of driving more holistically, the technology is no longer limited by the constraints of the older, fragmented approach.

Waymo's shift to passenger-centric vehicle design

30:15 - 33:07

While self-driving cars have made massive software progress, most vehicles on the road are still derivatives of consumer cars with steering wheels. However, the hardware is starting to catch up with the software. Dmitri discusses the arrival of the sixth generation Waymo vehicle, which is a custom platform built specifically for autonomous driving. This new design moves away from the traditional driver-focused layout to prioritize the passenger experience. It features sliding doors, a flat floor, and a much more spacious interior that allows passengers to fully stretch out.

We put a lot of thought into moving away from a car that is designed around the driver to a car that is designed around the passenger. It is much more spacious and it is happening. It is not open to the public yet, but I took a ride in it the other day fully autonomously.

The new vehicle offers a living room feel inside despite having a physical footprint that is barely larger than the current Jaguar I-PACE models. This shift in design allows for a more intuitive interface for passengers and easier entry and exit from the vehicle. Although these custom vehicles are not yet available to the public, they are already being tested on the road and represent the next phase of the autonomous experience.

The evolution of Waymo's sixth generation hardware

33:07 - 37:37

Waymo is currently operating at a scale of roughly 25 million rides a year using the Jaguar I-PACE. While this fleet relies on retrofitted vehicles, the core value for riders comes from safety, predictability, and the privacy of not sharing a space with a human driver. Transitioning to a specialized car is an optimization of the experience rather than a change to the fundamental value proposition. It makes sense to de-risk the software and the driving stack before committing to the massive investment of a custom vehicle.

The sixth generation of the Waymo driver represents a significant shift. While the software remains consistent across platforms, the hardware is entirely new. This sixth generation is simpler, more capable, and costs a fraction of its predecessor. The goal was to reach a price point comparable to modern high-end driver assist systems. Dmitri notes that the software is designed to be generalizable, allowing it to move between different vehicle platforms like the upcoming Hyundai Ioniq and various sensor configurations.

The self-driving hardware they are putting on the vehicle is the sixth generation. It is very different from the fifth generation. It is simpler, it is more capable, it is much lower cost. It is a fraction of the cost. It is comparable to what you would get like a fancy ADAS system nowadays.

Reducing costs involves riding the maturity curves of three sensing modalities: cameras, radars, and lidars. While cameras are a mature technology, radars have evolved from bulky airplane equipment to affordable automotive components. Imaging radars, which provide much richer data, are following a similar downward cost trajectory. Lidars are also becoming more predictable and affordable as the industry learns from previous generations to optimize and simplify the manufacturing process.

The complementary roles of Lidar and Radar

37:37 - 40:18

Lidar and radar are highly complementary sensors in the development of self-driving technology. Both systems work by sending out signals and measuring what bounces back, but they operate at very different frequencies. Lidar uses millions of laser pulses per second to sample the 3D structure of the world with extreme precision. This allows for high-resolution, fine-grained mapping of the environment.

Laser gives you very high resolution. You can think of it as a laser beam that goes out, spins around, it shoots out millions of these laser pulses per second. Each one comes back and you are sampling the 3D structure of the world with very high resolution.

Radar offers lower resolution but excels in situations where physics might hinder other sensors. It performs significantly better in adverse weather like fog, snow, or heavy rain because it is not occluded by particulates. In conditions where a camera might be blind, such as a dense fog on a freeway, radar provides clear returns for vehicles that would otherwise be invisible.

The self-driving system does not simply choose one sensor over another depending on the weather. Instead, it uses encoders for cameras, Lidar, and radar to jointly create the best possible view of the world. While a camera might degrade in pitch darkness or when facing direct sunlight, Dmitri notes that Lidar remains completely unaffected by lighting conditions. Each sensor has unique noise characteristics, and the system combines them to maintain a reliable understanding of the surroundings.

How foundational models enable emergent behavior in driving

40:19 - 46:35

The current phase of autonomous vehicle development is shifting toward rapid global expansion. Dmitri expresses excitement about a future where someone can fly into any major city and take a Waymo anywhere they need to go. This scaling is supported by significant progress in foundational AI models and world models. These models simplify the system and reduce costs while providing unexpected performance gains.

Investing in the architecture and data of these foundational models at an early stage leads to massive amplification. Dmitri describes the sensation of seeing emergent behavior in the car as exhilarating. In one instance in San Francisco, a Waymo vehicle detected a pedestrian who was completely obscured by a bus. The car slowed down and gave the person space before they even stepped into view.

The first time I looked at that log, I thought, what is going on here? We have pretty good sensors and the software is very capable, but we do not see through stuff. Radar should not be able to go through a massive metal box. You cannot see through the windows because of reflections and people on the bus. I could not actually believe it.

Investigation revealed that the car peripheral lidar sensors bounced under the bus. These sensors picked up noisy reflections of the pedestrian feet. The AI models interpreted this data to identify a likely pedestrian and predict their movement. Achieving this level of performance with a simple black box imitative system would be incredibly difficult. This capability demonstrates the value of using intermediate representations to boost the performance of the entire model.

The operational scale and technical evolution of Waymo

46:35 - 53:45

Waymo is scaling rapidly, currently operating about 3,000 cars and delivering half a million rides per week. This volume translates to over four million fully autonomous miles every week across 11 U.S. cities. While it took eight years to move from the first autonomous ride to external riders in four cities, the pace has accelerated significantly. Recently, the company launched service in four new cities in a single day, marking a major milestone in operational maturity.

How long did it take us from the first time we started fully autonomous rider only operation to the first time we had external riders in four cities? That's about eight years. And then just the other week we just launched four in one day.

There is a fundamental difference between driver assist systems and full autonomy. It is often deceptive to view them as incremental steps on the same spectrum. Building a system where no human is behind the wheel requires a qualitative jump in technology rather than just adding features to existing driver aids. While cars will continue to get smarter and sensors will become cheaper and more integrated, the path to true self-driving involves tackling problems that driver assist systems simply do not encounter. Eventually, this technology might transition from ride-hailing fleets to personal vehicles, though no specific timeline exists for consumers to purchase their own Waymo.

The infrastructure behind a self-driving fleet is an increasingly automated orchestrated dance. When cars need charging or cleaning, they automatically navigate back to a depot. While tasks like plugging in a charger or cleaning a cabin currently involve human intervention, the systems managing these needs are highly efficient. For example, a car can flag that it needs cleaning via its sensor dome, signaling a worker to step in. Future developments, such as inductive charging or robotic plug-ins, may further automate these operational tasks, depending on which method proves most cost-effective.

The future of autonomous transit and urban design

53:45 - 57:54

Riders using autonomous vehicles tend to treat the space with respect, perhaps because the absence of a driver makes the car feel like their own private environment. While behavior remains generally positive, the nature of the service fluctuates depending on the location and time, such as in a busy college town on a weekend night.

I talked about not having a person in the car. It is not somebody else's car. In some ways you kind of want to preserve the nice aspects of it. Because there is not somebody else's space, you are in it, it feels like it is your own. So you do not want to mess up your own space.

The reach of this technology will eventually extend to every address, though the business model may shift based on population density. In remote areas where a standing fleet of ride-hailing cars does not make commercial sense, the technology might instead be integrated into personally owned vehicles. This ensures that even in places with low trip density, people can still benefit from autonomous driving without waiting for a car to be deployed from a distance.

As autonomous traffic becomes the majority, the efficiency of our roads will improve significantly. Human drivers often create traffic jams through abrupt movements or slow reactions to obstacles that occurred hours earlier. Autonomous systems operate with a smoothness that allows traffic waves to clear much faster. Beyond the roads, the urban landscape will transform as the need for parking diminishes. Dmitri points out that a huge fraction of valuable land is currently dedicated to storing idle vehicles. Reclaiming this space could allow cities to replace parking lots and garages with outdoor seating or community spaces.

Imagine what you can do with your favorite city in the world if you do not have to spend that money, that huge fraction of it, on just keeping these chunks of metal sitting around.

The long road to autonomous driving

57:55 - 1:01:00

Waymo represents one of the most significant long-term bets at Google. The project required immense stamina and conviction from leadership like Larry Page and Sergey Brin to stay the course over many years. While some might wonder if the project started too early, the complexity of self-driving technology necessitates long iterative cycles. New technological waves, from ImageNet in 2013 to modern Transformers and Large Vision Models, have fundamentally changed how the system is built, but they are not total solutions.

It is super hard to go the full distance and get edge case domain. There is the standard engineering rule of thumb that every next nine takes 10x more effort. There is no some magical moment where the true complexity of the problem goes away and then you can just take some off the shelf components and you are a business.

Autonomous driving is a field that is deceptively easy to enter but incredibly difficult to master. While AI breakthroughs drastically reshape the early part of the development curve, there are no silver bullets for the physical world. The nature of the problem remains the same: the effort required to reach the final stages of safety and reliability grows exponentially with every decimal point of progress.

Google culture and technical talent

1:01:01 - 1:02:15

Google fosters a culture that refuses to accept the status quo. This involves setting a massive vision and investing in technical talent capable of achieving long-term goals. These individuals are expected to go the distance to turn a vision into reality.

That culture of Google of not accepting the status quo, having a big vision and investing in technical talent, the people who can go the distance and realize the vision that is part of the culture.

This philosophy explains early investments in fundamental technologies like Transformers and quantum computing. Breakthroughs in the digital world stem from a commitment to pushing boundaries rather than settling for incremental progress. Dmitri notes that this mindset is what allows talent to flourish within the organization.