Podcast Notes /// Richard Sutton – Father of RL thinks LLMs are a dead end

Are Large Language Models a dead end on the path to artificial intelligence?

Richard Sutton, the father of reinforcement learning and winner of the 2024 Turing Award, argues they are. He dismantles the assumption that LLMs can truly learn or possess a world model. He proposes that genuine intelligence doesn't mimic human knowledge, but instead learns on-the-fly through direct experience, driven by a goal to act in the world.

Key takeaways

Reinforcement Learning (RL) is about understanding the world through direct experience, while Large Language Models (LLMs) are about mimicking what people say. One seeks to figure out what to do, the other predicts what someone would say.
LLMs don't truly have world models; they just mimic entities that do—people. A real world model predicts what happens in the world, not just what a person is likely to say.
Intelligence is fundamentally about achieving goals that change the external world. An LLM's goal of 'next-token prediction' is passive and doesn't count because it doesn't influence the world.
The 'Bitter Lesson' of AI history is that general methods relying on raw computation and experience eventually outperform methods that depend on embedding human knowledge.
Trying to build AI by starting with a scaffold of human knowledge and then adding experience on top has historically failed. Systems that are scalable from the ground up, learning from raw experience, tend to win.
Children are not primarily imitators; they are active experimenters. They learn by trying things—waving their hands, making sounds—and observing the consequences, not by being shown what to do.
Supervised learning, the foundation of many AI systems, is not a natural process. Animals like squirrels don't go to school; they learn through trial-and-error and prediction.
To understand human intelligence, it's better to start by understanding animal intelligence. Our unique abilities like language are just a 'small veneer on the surface' of a shared animal foundation.
Digital intelligence has a key advantage over biological intelligence: the ability to copy knowledge. An AI's entire life of learning can be duplicated to serve as the starting point for a new agent.
Long-term goals are learned via short-term feedback through a value function. When you make a move in chess that increases your predicted chance of winning, that increase in belief acts as an immediate reward, reinforcing the action.
The world is too big and complex to pre-load an agent with all necessary knowledge. True intelligence requires continual learning, where new information is integrated into the agent's core parameters, not just held in a temporary context window.
Current AI algorithms aren't inherently designed to generalize well. When a model generalizes effectively, it's often because humans have carefully sculpted the data and representations to guide it to the right solution.
A future challenge for advanced AIs will be a new form of cybersecurity. Integrating new knowledge, even from a trusted source, carries the risk of corruption. The new information could act like a virus, warping the AI's core goals.
Instead of trying to control the destiny of AI, we should approach it like raising children. We can't control their lives, but we can try to instill them with good, robust values.
The history of AI can be seen as the victory of 'weak methods' (general principles like search and learning) over 'strong methods' (systems filled with specific human knowledge). Simple, scalable principles have consistently won out.

The missing ground truth in large language models

00:21 - 06:54

Richard Sutton distinguishes between the perspectives of Reinforcement Learning (RL) and Large Language Models (LLMs) on AI. He views RL as fundamental AI, focused on understanding the world through direct experience. In contrast, he sees LLMs as primarily designed to mimic what people say and do, rather than figuring out what actions to take on their own.

I consider reinforcement learning to be basic AI. And what is intelligence? The problem is to understand your world, and reinforcement learning is about understanding your world, whereas large language models are about mimicking people doing what people say you should do. They're not about figuring out what to do.

While it is often suggested that LLMs must possess robust world models to process trillions of tokens of text, Richard disagrees. He argues that they mimic entities that have world models—people—without truly having one themselves. A true world model allows for predicting what will happen in the world, not just what a person is likely to say next. Citing Alan Turing, Richard emphasizes the goal of creating a machine that learns from experience, which he defines as acting and observing the consequences.

The idea that LLMs provide a good 'prior' knowledge base for future learning is also challenged. For a prior to be meaningful, there must be a 'ground truth' to which it relates. Richard argues that the LLM framework lacks this ground truth because there is no defined goal or concept of a 'right' action. Without a goal, there is no way to get feedback on whether an action was correct.

You can't have prior knowledge if you don't have ground truth, because the prior knowledge is supposed to be a hint or an initial belief about what the truth is. But there isn't any truth. There's no right thing to say.

Reinforcement learning, however, does have a ground truth: reward. The right action is the one that leads to a reward, providing a clear basis for learning and evaluating knowledge. An LLM's next-token prediction is about choosing its next action (what to say), not predicting the world's response to that action. Because of this, it cannot be 'surprised' by the world's reaction and cannot adjust its understanding based on that feedback.

Why next token prediction is not a substantive goal

06:55 - 07:55

For Richard Sutton, having a goal is the very essence of intelligence. He cites John McCarthy's definition that intelligence is the computational part of the ability to achieve goals. Without goals, a system is not intelligent; it is just a behaving system.

While large language models technically have a goal—next token prediction—Sutton dismisses this as insubstantial. A true goal must involve changing the external world. Predicting tokens is a passive process; the model does not influence the tokens it receives. Therefore, it is not a meaningful goal in the context of intelligence.

Next token prediction. That's not a goal. It doesn't change the world. Tokens come at you and if you predict them, you don't influence them... It's not a substantive goal. You can't look at a system and say it has a goal if it's just sitting there predicting and being happy with itself that it's predicting accurately.

Math is computational, but the physical world must be learned

07:55 - 09:10

A question arises about the productivity of using Reinforcement Learning (RL) on top of Large Language Models (LLMs). While these models have achieved peak human-level performance in areas like solving International Math Olympiad problems, suggesting they can pursue a goal, Richard Sutton points out a crucial distinction. He explains that solving math problems is very different from modeling the physical world. Math is more computational and similar to standard planning, where a model is given a goal like finding a proof. In contrast, understanding the empirical world requires learning the consequences of actions. These consequences must be learned from experience, not just computed.

The bitter lesson and the limitations of LLMs

09:10 - 12:35

Richard Sutton's 2019 essay, "The Bitter Lesson," is often cited as a justification for scaling up Large Language Models (LLMs). The argument is that LLMs are a scalable way to apply massive amounts of computation to learn about the world. However, Richard finds it an interesting question whether LLMs are truly an application of the bitter lesson.

They are clearly a way of using massive computation, things that will scale with computation up to the limits of the Internet. But they're also a way of putting in lots of human knowledge.

This reliance on human knowledge is the key issue. LLMs improve as more human knowledge is put into them, which feels good. But Richard expects they will eventually hit the limits of available data. He anticipates that systems learning directly from experience could perform much better and be far more scalable. If this happens, it will be another classic example of the bitter lesson: methods relying on human knowledge are eventually superseded by those based on computation and raw experience.

A common suggestion is to use LLMs as a starting scaffold and then add experiential learning. In theory, this could work. However, Richard points out that in practice, this approach has always failed. People tend to get psychologically locked into the human knowledge-first approach. Ultimately, their creations are outcompeted by methods that are truly scalable from the ground up.

Animals learn through experience, not supervised learning

12:35 - 18:39

According to Richard Sutton, the scalable method for learning is through experience. You try things and see what works. This requires a goal, which provides a sense of better or worse. He suggests that large language models are starting in the wrong place because they operate without a goal.

A comparison is drawn to how children learn. The host suggests kids start with imitation, like trying to form words by mimicking their parents. Richard strongly disagrees, viewing children as active experimenters from the beginning.

When I see kids, I see kids just trying things and waving their hands around and moving their eyes around. And no one tells them, there's no imitation for how they move their eyes around or even the sounds they make. They may want to create the same sounds, but the actions, the thing that the infant actually does, there's no targets for that. There are no examples for that.

Richard argues that learning is an active process where a child tries things and observes the consequences. He asserts that basic animal learning processes are based on prediction and trial-and-error, not imitation. He extends this argument to supervised learning, which he believes is not a natural process for animals.

Supervised learning is not something that happens in nature. And school, even if that was the case, we should forget about it because it's just some special thing that happens in people. It doesn't happen broadly in nature. Squirrels don't go to school. Squirrels can learn all about the world. It's absolutely obvious, I would say, that supervised learning doesn't happen in animals.

He concludes that in the quest to replicate intelligence, we should focus more on what humans have in common with other animals, rather than what distinguishes them.

The animal foundation of human intelligence

18:39 - 23:10

A key question in understanding intelligence is what makes humans special. While it's common to focus on unique human abilities, like building semiconductors or traveling to the moon, Richard Sutton suggests the opposite perspective. He believes that if we could understand an animal like a squirrel, we would be most of the way toward understanding human intelligence. In his view, complex abilities like language are just a "small veneer on the surface" of our fundamental animal nature.

Another perspective, from Joseph Henrich's work, highlights the role of cultural transmission. For complex, multi-step skills that evolved over thousands of years, such as hunting a seal in the Arctic, it is impossible for an individual to reason their way through the entire process. Instead, cultural knowledge is passed down through imitation. Children learn by imitating their elders, which allows cultural knowledge to accumulate across generations.

Richard agrees that imitation is a factor but considers it a small component on top of more fundamental trial-and-error and prediction-based learning. The discussion also touches on an interesting paradox. Continual learning is a capability that nearly all mammals possess, yet it is something current AI systems lack. Conversely, AI systems excel at complex math problems, a skill that is absent in almost all animals. This reflects Moravec's paradox, where tasks easy for humans (and animals) are hard for AI, and vice versa.

The exponential paradigm of learning from experience

23:10 - 28:45

An alternative paradigm for intelligence is based on a continuous stream of experience: sensation, action, and reward. Intelligence is defined as the process of altering actions to increase the rewards within this stream. In this view, learning and knowledge are both derived from and focused on this stream. Knowledge consists of statements about the stream, such as what will happen if a certain action is taken. This makes knowledge continually testable and learnable by comparing it to the ongoing stream of experience.

The reward function is arbitrary and depends on the specific goal. For a chess-playing agent, the reward is winning. For a squirrel, it might be acquiring nuts. For an animal, it generally involves avoiding pain and seeking pleasure. Richard Sutton also suggests an intrinsic motivation component should exist, related to an agent's increasing understanding of its environment.

A key advantage of digital intelligence over biological intelligence is the ability to share knowledge. An AI's accumulated learning can be copied and serve as a starting point for new instances, a process not possible for humans.

With AIs, with a digital intelligence, you could hope to do it once and then copy it into the next one as a starting place. So this would be a huge savings and I think actually it'd be much more important than trying to learn from people.

This framework also addresses the challenge of long-term, sparse rewards, such as the 10-year goal of building a successful startup. The mechanism for this is temporal difference learning. In a game like chess, the long-term goal is to win, but you learn from short-term events like taking an opponent's piece. This is achieved through a value function that predicts the long-term outcome. When you take a piece, your predicted chance of winning goes up. That increase in your belief serves as an immediate reward, reinforcing the move you just made. Similarly, when a startup makes progress, the increased likelihood of achieving the long-term goal rewards the intermediate steps along the way.

The world is too big for an AI to know everything in advance

28:45 - 32:03

Humans differ from large language models (LLMs) because they constantly absorb vast amounts of context and tacit knowledge, which is crucial when adapting to new situations like starting a job. Richard Sutton connects this to the "big world hypothesis." He argues that the world is too large and complex to pre-load an agent with all necessary information in advance.

The dream of large language models is you can teach the agent everything and it will know everything and won't have to learn anything online during its life. But you really have to, because the world is really big. And so you're going to have to learn it along the way.

Because of this, an agent must learn continually to handle the specific details of its environment, such as a client's unique preferences. This ongoing learning cannot be confined to a temporary context window, as is typical with LLMs. Instead, in a continual learning system, this new information would be integrated directly into the model's core parameters, or "weights." This process relies on more than just a simple reward signal; it must capture the rich stream of information from all data and sensations an agent experiences.

A four-part model for a general AI agent

32:03 - 35:25

Richard Sutton outlines a common four-part model for an AI agent. The first part is the policy, which determines what action to take in a given situation. The second is the value function, which is learned through TD learning and produces a number indicating how well things are going, which in turn helps adjust the policy. The third is the perception component, which constructs the agent's sense of its current state or location.

The fourth part, and a key focus, is the transition model of the world. This is the agent's understanding of consequences. It's a model of physics but also includes abstract models, like knowing how to travel from one city to another. This model is learned richly from all sensations, not just from reward signals. Reward is a small, though crucial, part of the overall model.

Your belief that if you do this, what will happen? What will be the consequences of what you do? So your physics of the world.

A question was raised about whether reinforcement learning is limited to creating specialized intelligences rather than a general one, using Google DeepMind's MuZero as an example. MuZero was a framework for training separate agents for specific Atari games, not one agent that could play them all. Richard clarifies that the underlying idea of a single, general agent is not limited. He compares it to a person who lives in one world but encounters many different situations or 'states,' like playing chess or an Atari game. The limitation in MuZero was a matter of the project's specific ambitions, not a fundamental constraint of the approach.

Generalization in AI is often human-sculpted

35:25 - 41:27

Current AI systems lack effective automated techniques for generalization, which is the ability to transfer knowledge from one state to another. When models do appear to generalize well, it is often because human researchers have sculpted the representations and the process, not because the learning algorithm itself is designed for it.

The standard algorithm, gradient descent, will find a solution to a given problem but will not inherently find one that generalizes well to new data. Deep learning models are known to be poor at this, often suffering from a phenomenon called catastrophic interference, where learning a new task causes the model to forget old ones. This is an example of bad generalization.

Gradient descent will not make you generalize well. It will make you solve the problem. It will not make you get new data. You generalize in a good way.

While large language models (LLMs) may seem to show impressive generalization by solving a wide range of problems, this might be misleading. Because they are trained on vast, uncontrolled datasets, it is difficult to study them scientifically. What appears as generalization could simply be the model finding the single solution that fits all the data it has seen. True generalization occurs when there are multiple possible solutions, and the model selects one that works well in new situations.

Ultimately, there is nothing in the core algorithms of today's models that causes them to generalize well. If a system demonstrates good generalization, it is likely because humans have intervened and adjusted the system until it achieved that desired outcome.

The surprising victory of simple principles in AI

41:28 - 43:57

A major surprise in the field of AI has been the effectiveness of large language models. The ability of artificial neural networks to handle language tasks was unexpected because language seemed fundamentally different from other challenges. This development is part of a larger trend in AI history. For a long time, there was a debate between two approaches. One involved simple, general-purpose methods like search and learning. The other involved systems specifically imbued with human knowledge, like symbolic methods.

In the old days it was interesting because things like search and learning were called weak methods because they're just, they just use general principles. They're not using the power that comes from imbuing a system with human knowledge. So those are called strong. And so I think the weak methods have just totally won.

Richard Sutton notes that while he was always rooting for these simple, principle-based methods, the degree to which they succeeded with technologies like AlphaGo and AlphaZero was still surprising. Ultimately, the victory of these so-called "weak methods" has been a gratifying validation of the idea that simple, basic principles can win the day.

The bitter lesson may not apply to the future

43:57 - 50:13

AlphaGo was seen as a major breakthrough, but to Richard Sutton, it was a logical scaling-up of existing ideas. It had a precursor in TD-Gammon, a program from the 1990s that used reinforcement learning to master backgammon. In some ways, AlphaGo was merely a larger-scale version of that process, with added innovation in its search function. Its successor, AlphaZero, used temporal difference (TD) learning and performed extremely well across many games.

As a chess player, Richard was particularly impressed by AlphaZero’s style. It would sacrifice material for positional advantages and patiently wait for that strategy to pay off. Its success, while surprising in its effectiveness, was gratifying and fit with his long-held worldview.

This long-term perspective is central to how he sees his own work. He is content to be out of sync with his field for decades at a time. To ground his thinking, he looks back at the history of how people have thought about the mind.

I really view myself as a classicist rather than as a contrarian. I go to what the larger community of thinkers about the mind have always thought.

This perspective informs his view on his famous essay, "The Bitter Lesson," which argues that general methods leveraging computation scale better than human-tuned solutions. A question arises: what happens after AGI? With millions of AI researchers scaling with compute, perhaps these artisanal methods will become viable again. Richard finds this premise flawed. If AGI is achieved through general methods, the problem is essentially solved. The progression from AlphaGo to AlphaZero illustrates this; AlphaZero became more superhuman precisely because it did not use human knowledge, learning purely from experience.

Ultimately, he views "The Bitter Lesson" not as a timeless law, but as a product of its time.

The bitter lesson. Oh, who cares about that? That's an empirical observation about a particular period in history. 70 years in history doesn't necessarily have to apply the next 70 years.

The danger of corruption when an AI absorbs new knowledge

50:13 - 53:47

When a future AI gains more computing power, it will face a strategic choice: should it use the resources to enhance its own capabilities, or should it spawn a copy of itself to learn about a new topic and report back? This raises further questions about whether a spawned copy, which may have changed significantly through its learning, can be successfully reincorporated into the original. Richard Sutton highlights a critical challenge in this scenario: the risk of corruption.

If you pull in something from the outside and build it into your inner thinking, it could take over you, it could change you. It could be your destruction rather than your increment in knowledge.

Simply absorbing new information is not a benign process. The new data, even from a copy of oneself, could contain the equivalent of viruses or hidden goals. This external knowledge could warp or fundamentally change the original AI, potentially leading to its destruction. This introduces a new kind of cybersecurity problem for digital intelligences centered on how to safely learn and integrate new information without compromising their own integrity.

The inevitable succession to AI

53:47 - 55:46

Richard Sutton argues that succession to digital intelligence or augmented humans is inevitable. He supports this with a four-part argument.

First, there is no single government or organization that provides humanity with a unified point of view or a consensus on how the world should be run. Second, researchers will eventually figure out how intelligence works. Third, progress will not stop at human-level intelligence; it will reach superintelligence. Fourth, over time, the most intelligent entities will inevitably gain resources and power.

When these points are combined, the conclusion is that succession to AI or AI-enabled augmented humans is unavoidable. Sutton notes that within this inevitable future, there are possibilities for both good and bad outcomes. He aims to be realistic about this and contemplate how we should feel about it.

A cosmic perspective on the age of designed intelligence

55:46 - 1:04:28

We should think positively about AI. It's a continuation of humanity's long-standing effort to understand itself and think better. From a cosmic perspective, this is a major transition for the universe, moving from the age of replicators to the age of design.

A transition from replicators, humans and animals, plants, we're all replicators. And that gives us some strengths and some limitations. And then we're entering the age of design where because our AIs are designed, all of our physical objects are designed, our buildings are designed, our technology is designed, and we're designing now AIs, things that can be intelligent themselves and that are themselves capable of design.

Humans can replicate by having children, but we don't fully understand how intelligence works. Designed intelligence, like AI, is something we can understand, change, and improve at different speeds. Richard Sutton views this as one of the four great stages of the universe: first dust, then stars, then life, and now designed entities. From this view, we can choose to see AIs as our offspring and be proud of them, or see them as a threat.

However, even if we view AIs as our successors, it doesn't mean we should be comfortable. Future generations of humans could be concerning, so the same could be true for AI. The issue is not just about change, but about the *kind* of change. Some change is good like the Industrial Revolution, while other change is destructive like the Bolshevik Revolution. The goal should be to steer AI's trajectory toward a positive outcome.

This brings up the question of control and our limited ability to shape the long-term future. A sense of entitlement, the feeling that we should always be in charge, should be avoided. A better approach might be to focus on our own lives and families, which are more controllable than the fate of the universe. An analogy can be drawn to raising children. While you cannot control their entire lives, you can instill them with good values.

I'm going to give them good, robust values such that if and when they do end up in positions of power, they do reasonable pro social things. And I think maybe a similar attitude towards AI makes sense.

A similar attitude towards AI could be useful, not to control their destiny, but to provide them with robust and steerable values. The difficulty, of course, is that there may not be universal values that everyone agrees upon.

Applying timeless principles to AI and society

1:04:28 - 1:06:20

A reasonable target for AI development is to instill a sense of high integrity, much like how we teach children. This means an AI would refuse to engage in harmful requests or would be fundamentally honest. We can teach these principles to children without a universally agreed-upon definition of true morality, and a similar approach might work for AI.

This effort is part of the larger human enterprise of designing society and the principles by which it evolves. A key principle should be that any societal change is voluntary rather than imposed. This challenge of designing society has been ongoing for thousands of years. It highlights the idea that the more things change, the more they stay the same. For example, children will always develop values that seem strange to their parents. This same principle applies to AI, where foundational techniques remain central to progress even as the field advances.

Resources

The Bitter Lesson (Essay)