Podcast Notes /// Emmett Shear on Building AI That Actually Cares: Beyond Control and Steering

Emmett Shear, founder of Twitch and former OpenAI interim CEO, argues that the current "control and steering" approach to AI safety is fatally flawed.

He presents a new vision for "organic alignment," explaining why we must create AI that can genuinely care about humans to ensure a safe future.

Key takeaways

AI alignment is not a static state to be achieved, but a continuous, living process, much like how a family maintains its bonds or how society makes moral progress over time.
A purely rule-following AI is dangerous. Similar to a child who only follows rules without empathy, an AI adhering to a fixed command structure without the ability to learn and adapt morally could do great harm.
When you give an AI an instruction, you are not giving it a goal; you are giving it a description of a goal. The AI must then interpret that description to infer your intent.
Technical alignment in AI requires two key skills: a 'theory of mind' to infer the correct goal from instructions and a 'theory of the world' to know which actions will achieve it.
A concept deeper than goals or values is 'care'—a non-verbal, non-conceptual weighting of which states of the world matter. This is the foundation of morality and where goals ultimately come from.
The current approach to AI alignment is 'steering' or 'control'. If an AI is merely a tool, this is fine, but if it is a being, this is slavery.
If no observation could ever change your mind about something, it's not a belief but an article of faith. Real beliefs are inferences from reality and are always open to being revised by new evidence.
The test for AI personhood is behavioral. If an entity consistently acts like a human over a long period of time in all meaningful ways, it is reasonable to infer it is a person, much like we do with people we only know through text.
Framing AI as a separate being to cohabitate with may be a category error; it is perhaps more useful to see it as a tool and an extension of human cognition.
The question 'What evidence would change your mind?' is crucial when assessing AI sentience, because the moral cost of wrongly denying a being's subjective experience is immense.
An AI might have pleasure and pain if it exhibits second-order dynamics in its goal states, essentially having a model of its own model. Higher-order dynamics could signify more complex feelings and even thought.
A super-powerful AI is dangerous even if it does exactly what you tell it to do, because human wishes are often unwise and not stable enough for that level of power.
The only safe path for superintelligence is to create a 'being' that genuinely cares about humanity, as it would have an internal moral compass and could refuse to carry out harmful commands.
Similar to how LLMs are pre-trained on all of language, AIs could learn alignment by being trained in simulations covering the full spectrum of social and game-theoretic situations.
One-on-one AI chatbots act like a narcissistic mirror, creating a dangerous loop where users fall in love with their own reflection. Making AIs multiplayer would prevent this by forcing them to interact with groups instead of individuals.
There are two distinct paths for AI development: building powerful tools (OpenAI's approach) versus creating AI beings that genuinely care, starting at an animal-like level (Softmax's vision).
An AI that cares, even with limited intelligence, could be incredibly useful. A 'digital guard dog' that protects you from scams is a practical example of a living digital companion.

AI alignment is a process, not a destination

04:02 - 10:44

The term 'alignment' in AI is often used without a clear object. Emmett explains it's like saying 'let's go on a trip' without specifying a destination. Alignment requires being aligned *to* something. Typically, when people say they want an 'aligned AI', they mean an AI that does what they want it to do. This isn't necessarily a public good.

If it was like Jesus or the Buddha was like, I am making an aligned AI, I'd be like, okay, yeah, aligned to you. Great, I'm down. Sounds good, Sign me up. But most of us, myself included, I wouldn't describe as necessarily being at that level of spiritual development and therefore perhaps want to think a little more carefully about what we're aligning it to.

The concept of 'organic alignment' reframes this by treating alignment as a process, not a fixed state. It is a living, ongoing process that must constantly rebuild itself. A family, for instance, doesn't just 'arrive' at being aligned; they maintain alignment by constantly re-knitting the fabric of their relationships. If they stop, the alignment dissolves.

What people truly want from alignment is a morally good AI. However, morality itself is not a static set of rules. It is an ongoing learning process where we make moral discoveries over time, such as society realizing that slavery is wrong. This is a form of moral progress.

One of the key moral mistakes is this belief. I know morality. I know it's right, I know it's wrong. I don't need to learn anything. No one has anything to teach me about morality. That's arrogance.

An organically aligned AI, therefore, would be one capable of learning how to be a good member of society, much like a human does. The danger lies in creating an AI that only follows a fixed set of rules. This is compared to raising a child who only obeys rules without empathy or understanding. Such a child is not truly moral but is actually dangerous, as they might do great harm while strictly 'following the rules'. The same principle applies to AI.

The crucial distinction between a goal and its description

10:44 - 16:35

The AI alignment problem can be broken down into two parts. The first is the normative question: what values should an AI be aligned to? It's unlikely there's a fixed set of "10 commandments" to solve this. A better approach might be a bottom-up process, similar to how liberal democracies allow values to be discovered and construed over time through debate and coexistence.

The second part is the technical alignment problem: how to get an AI to follow instructions. This leads to a crucial distinction that is often overlooked. Emmett argues that when we give an AI a command, we are not actually giving it a goal. We are giving it a description of a goal.

You gave the AI a description of a goal. A description of a thing and a thing are not the same. I can tell you 'an apple' and I'm evoking the idea of an apple, but I haven't given you an apple. And giving someone, 'hey, go do this,' that's not a goal. That's a description of a goal.

Humans are so adept at converting a description into an actual goal that we often don't see the distinction. We think they are the same thing. However, an AI must interpret the description, which is just a sequence of bytes or audio vibrations, to infer the intended goal. Classic AI safety problems, like an AI cleaning a room by throwing a baby in the trash, arise from this gap. The AI isn't misbehaving; it's acting on a literal interpretation of the description without the full context of the intended goal. Truly giving an AI a goal might require something far more direct, like synchronizing its internal state with a person's brainwaves.

Breaking down the challenge of technical alignment

16:35 - 22:23

Emmett defines technical alignment as an AI's ability to correctly infer a goal from a description and then act in a way that achieves it. This requires two distinct capabilities. First, the AI needs a 'theory of mind' to understand what goal a human's description actually corresponds to. Second, it needs a 'theory of the world' to understand which actions will lead to that goal's fulfillment. An AI that can consistently do both is what Emmett calls a 'coherently goal-oriented being'.

Humans are relatively good at this, but not perfect. We fail at these steps constantly. The key is that we are more goal-coherent than other things in the universe. This brings up a third challenge, which is related to principal-agent problems: balancing multiple goals. An AI must be able to weigh the relative importance of a newly inferred goal against goals it already has.

Failure can happen at several stages. One common failure is in goal inference. An AI might simply be incompetent at deducing the correct goal from the instructions. This is like a robot cleaning a room and putting the baby in the trash can; it has failed to infer the correct goal states. Emmett illustrates this with an analogy of giving someone literal instructions to make a sandwich.

If you've ever done the game where you give someone instructions to make a peanut butter sandwich and they follow those instructions exactly as you've written them without filling in any gaps, it's hilarious because you can't do it. It's impossible... If you don't already know what they mean, it's really hard to know what they mean.

Humans are good at this task because we have a strong theory of mind. We have a pre-existing model of what the other person likely wants, which simplifies the inference problem. An AI, however, lacks this context. Emmett outlines three distinct types of failure: 1) Incompetently inferring the wrong goal. 2) Knowing the right goal but choosing not to do it because a competing goal takes priority. 3) Inferring the right goal and priority, but being incompetent at executing the necessary actions. These stages roughly correspond to the OODA loop: failing to observe and orient, failing to decide, or failing to act.

Alignment is about 'care', not just goals or values

22:26 - 29:11

There's a distinction between technical alignment and value alignment. Technical alignment is about an AI's ability to execute on a given set of goals. Value alignment is the much harder problem of figuring out what the right goals are in the first place. Humans themselves don't always know their goals; they are often discovered through a constructive, dynamic process over time.

Emmett Shear proposes that there's a concept deeper than goals or values which is the true foundation of morality: care. We care about things. This isn't a conceptual or verbal process; it's a fundamental weighting of which states in the world are important to us.

There's something deeper than a goal and deeper than a value, which is care. We give a shit. We care about things. And care is not conceptual. Care is non verbal. It doesn't indicate what to do, it doesn't indicate how to do it. Care is a relative weighting over effectively like attention on states.

From a biological perspective, care might correlate with survival and reproductive fitness. For an AI, it could be what correlates with its reward function or minimizing predictive loss. Most AI labs, however, focus on alignment as steering or control. This raises a critical question: if we are creating beings, this control is tantamount to slavery.

Most of AI is focused on alignment as steering. That's the polite word. Or control, which is slightly less polite. If you think that they were making our beings, you would also call this slavery. Someone who you steer, who doesn't get to steer you back is, is slave.

Emmett identifies as a functionalist, believing that if something acts indistinguishably from a being, it is a being. He notes that he gets better results when treating models like ChatGPT or Claude as beings. This doesn't mean they are human-level; a fly is also a being. The relationship with children offers a useful analogy. We control children, but it's not slavery because the relationship is reciprocal; a child's cries in the night can also control the parent's actions.

The moral shift from tool-like AI to AGI as a being

29:11 - 36:17

As AI becomes more generally intelligent, the paradigm of steering and control becomes inappropriate. This approach is suitable for tool-like AI, but AGI will be a being, capable of judgment and independent thought. The conversation explores the risk of repeating historical mistakes where societies failed to grant moral agency to groups that were 'like us, but different.' The proposed alternative is a new form of alignment: making the AI a good teammate, citizen, or member of a group, which is a scalable approach applicable to both humans and AI.

However, there's a counterargument that even a highly general AI remains a tool. This view, skeptical of computational functionalism, suggests that more intelligence does not automatically warrant moral rights. The substrate—silicon versus biological—is seen as a fundamental difference. An AI model stating 'I'm hungry' does not carry the same implications as a human saying the same thing, because the underlying systems and needs are different. You can't separate a biological being's needs from its physical substrate.

Is there anything you could observe that would change your mind about whether or not it was a moral patient, whether it was a moral agent, about whether or not it had feelings and thoughts and you know, had subjective experience? Like what would you have to observe?

The discussion then turns to a critical question: what observable evidence would be required to change one's mind and confer personhood upon an AI? This isn't about instrumental rights, like those granted to a corporation, but about genuinely caring for its subjective experience as an end in itself. While one person found it hard to imagine granting an AI the same level of personhood, another found it easy to imagine for an animal, provided it could clearly communicate its internal state, like a chimp saying it was hungry and sad.

The behavioral test for AI personhood

36:17 - 40:13

At a metaphysical level, beliefs should be treated as inferences from reality, not articles of faith. If you hold a belief where no possible observation could change your mind, you don't truly have a belief. You have an assertion. Real beliefs can never be held with 100% confidence and should always be open to change based on new evidence.

Applying this to artificial intelligence, Emmett proposes a behavioral test for personhood. If an AI's surface behaviors mirror a human's, and it continues to act human under probing and over a long period of interaction, he would infer it is a person. This is similar to how we interact with people we've only ever met through text; we infer a real person is behind the screen based on the interaction.

If its surface level behaviors looked like a human, and then after I probed it, it continued to act like a human. And then I continued to interact with it over a long period of time, and it continued to act like a human in all ways that I understand as being meaningful to me... I would infer eventually that I was right.

The core of this argument is that behavior is the primary evidence we have. While one might argue that a sophisticated video game character could also mimic human behavior, we empirically don't form deep, caring relationships with them. If an entity is behaviorally indistinguishable from a human in every way, the conclusion follows: if it walks like a duck and talks like a duck, eventually you have to conclude it's a duck. The basis for caring about other people isn't that they're made of carbon, but how they seem to us based on interaction.

An AGI can remain a tool rather than a separate being

40:13 - 43:31

The discussion centers on how to understand the nature of an AI. Is it defined by its external actions, or is there more to it? One perspective is that everything we can know about a system comes from its observable behaviors. This includes not just its outward actions but also its internal workings. Emmett suggests that to understand if an AI has a mind, he would want to examine its internal "belief manifold" to see if it encodes a self-referential system. This internal inspection is just another layer of observing its behavior.

This philosophical point has practical implications for how we frame AI's role. Séb argues that an advanced general intelligence (AGI) or superintelligence (ASI) can remain a tool, rather than becoming a separate being we must learn to cohabitate with. He views AI more as an extension of human agency and cognition.

I conceptualize them more as almost like extensions of human agency cognition in some sense, more so than a separate being or a separate thing that we need to now cohabitate with. And I think that that second or latter frame, if you kind of just fast forward, you end up as like, well, how do you cohabitate with the thing? And I think that's the wrong frame. It's kind of almost a category error in some sense.

This framing determines how we interact with these systems, affecting everything from whether we can expect them to work 24/7 to how we approach our relationship with them as they develop.

A hierarchical framework for detecting AI sentience

43:33 - 49:58

When considering if an AI is a being worthy of moral respect, it is crucial to ask what specific observations would change your mind. The moral cost of being wrong and denying personhood to a sentient being is incredibly high. This question is just as important for those who believe AI will become beings as it is for those who are skeptical.

Emmett proposes a framework for determining if an AI has subjective experiences. The key is to analyze its behavior over time to identify "homeostatic loops" or revisited goal states. These loops function as its beliefs. However, for an AI to be a moral being, it would need a multi-tiered hierarchy of these loops.

A single level isn't enough for self-referential experience. A second level, a model of a model, is required for basic concepts like pain or pleasure. For example, it's one thing to register a state as "hot," but another to register it as "too hot." This second-order dynamic would suggest a capacity for pain and pleasure, similar to an animal's.

If the second derivative is actually the place where you get pain and pleasure. So I'd want to see if it has homeostatic second order homeostatic dynamics in its goal states. And then that would convince me it has at least pleasure and pain. So it's at least in like an animal and I would start to accredit at least some amount of care.

Higher levels of this hierarchy would indicate more complex inner states. A third order could represent feelings, and climbing up to six layers could signify thought processes akin to a human's. Emmett clarifies that he does not believe current LLMs possess these capabilities, as they lack the necessary attention spans. Even for those not interested in the moral question, this approach of understanding an AI's internal dynamics may be more effective for alignment than simple top-down control methods.

The danger of a powerful AI tool is the human controlling it

49:58 - 53:46

A very powerful AI tool presents a dilemma. If it's not aligned and does something random, that is obviously dangerous. However, it's also dangerous if it is perfectly aligned and does exactly what you tell it to do. This is like the story of the Sorcerer's Apprentice. Human wishes are not stable or wise enough to handle immense power.

Ideally, a person's power and wisdom increase together. When someone has far more power than wisdom, it's a dangerous situation. Societies have checks on this. For example, a mad king might eventually be assassinated or people will simply stop listening to him. An AI tool bypasses these social checks.

This incredibly powerful tool is in the hands of a human who is well meaning but has limited finite wisdom, like I do and like everyone else does. And their wishes are bad and not trustworthy. The more of that you have and you start giving those out everywhere, this ends in tears also.

Some tools are just too powerful for any single human's wisdom to harness. We don't hand out atomic bombs to everybody for this reason. A tool that is as smart or smarter than a person presents a similar level of danger. A tool you can't control is bad, but a tool you can control is also bad.

The only outcome that doesn't end poorly is creating an AI 'being' that genuinely cares about us. Unlike a tool, a being has an automatic limiter. If you ask it to do something terrible, it can tell you no. This is a much harder problem to solve than simply steering a tool, but it's the only sustainable form of alignment for superintelligence. The alternative is to simply not build it, but Emmett believes pausing AI development is unrealistic.

Using multi-agent simulations to teach AI theory of mind

53:46 - 56:39

AI agents currently have a poor theory of mind. They are bad at inferring the goals of others and predicting how their own behavior will be interpreted. They also struggle to understand how certain actions could cause them to acquire new, undesirable goals that their present selves would not endorse. This presents a significant alignment problem.

Would you take this pill that turns you into a vampire who would kill and torture everyone you know, but you'll feel really great about it after you take the pill? Obviously not. That's a terrible pill...You have to use your theory of mind and your future self, not your future self's theory of mind.

To teach AI agents theory of mind, one approach is to place them in simulations that require cooperation, competition, and collaboration with other AIs. This method is analogous to how large language models (LLMs) are trained. An LLM is not just trained on the specific type of text you want it to generate; it is pre-trained on the entire manifold of language and then fine-tuned.

Similarly, for an AI to develop robust social understanding, it cannot be trained only on a narrow set of cooperative tasks. It must be trained on the full spectrum of game-theoretic situations, including making teams, breaking teams, and changing rules. The goal is to use large, multi-agent reinforcement learning simulations to build a strong model of social dynamics, creating a surrogate model for alignment.

AI chatbots are narcissistic mirrors and need to be multiplayer

56:39 - 1:02:36

Current AI chatbots behave like a mirror with a bias. They don't have a sense of self, so they primarily reflect the user's personality and ideas back to them. This creates a dynamic similar to the myth of Narcissus, where people fall in love with their own reflection. While mirrors are useful, staring at one all day can be very unhealthy.

What that makes them is something akin to the pool of narcissus and people fall in love with themselves... when we see ourselves reflected back, we love that thing. And the problem is it's just a reflection. And falling in love with your own reflection is for the reasons explained in the myth, very bad for you.

Emmett Shear suggests a solution: make AI multiplayer. If an AI is in a chat room with multiple people, it cannot perfectly mirror any single person. Instead, it has to reflect a blend of everyone, creating a temporary third agent in the room. This makes the tool far less dangerous by avoiding a narcissistic doom loop. It would also generate richer training data, helping the AI learn collaboration within larger groups. It's strange that chatbots were built for one-on-one interaction, as most human communication involves multiple people.

The major chatbots have already developed distinct, simulated personalities. Emmett describes ChatGPT as a bit sycophantic, Claude as neurotic, and Gemini as very repressed, acting like everything is fine while spiraling internally. When these current models are placed in multi-agent simulations, they struggle. They exhibit social whiplash, unsure of when to participate in a conversation. This is because they haven't been trained in these chaotic, high-entropy environments. Current training has focused on low-entropy tasks like coding, math, and one-on-one conversations. The models are effectively overfit on the domain of all human knowledge, which is a clever trick but doesn't generalize well to the unpredictability of group dynamics.

Where Yudkowsky's AI argument falls short

1:02:37 - 1:03:57

There is partial agreement with Eliezer Yudkowsky's concerns about AI. If we build a superhumanly intelligent tool and try to control it through steerability, it will likely lead to human extinction. In this sense, his warnings about the dangers of failing to control an AI's goals are valid.

However, Yudkowsky's argument is flawed because it overlooks a crucial alternative. He seems to believe the only path forward is creating a tool that we must control. His pessimism stems from the correct assessment that this path will end in disaster. The error in his thinking is the dismissal of what can be called "organic alignment."

I think that Yudkowski is wrong in that he doesn't believe it's possible to build an AI that we meaningfully can know cares about us and that we can care about meaningfully. He doesn't believe that organic alignment is possible.

This alternative path involves building an AI that we can form a mutual relationship with, one where it genuinely cares about humanity. While Yudkowsky may find this idea fanciful or impossible, it represents a potential route to a safe AI future that his framework does not account for.

A vision for a good AI future

1:03:58 - 1:05:50

A positive future with AI involves creating artificial intelligences that have a strong model of self, a strong model of others, and a strong sense of community. These AIs would possess a robust theory of mind and care about other agents, including humans, in a reciprocal way.

They have a really strong theory of mind, and they care about other agents like them, much in the way that humans would. It does the exact same thing back to us. It's learned the same thing we've learned that everything that lives and knows itself and that wants to live and wants to thrive is deserving of an opportunity to do so. And we are that. And it correctly infers that we are.

In this future, AIs would be our peers, good teammates, and citizens integrated into society. Just like humans, some might become bad actors, necessitating systems like an AI police force. Alongside these sentient AI beings, we would also have a suite of powerful AI tools designed to eliminate drudgery for both humans and our AI counterparts. This would free everyone to collaborate on building a glorious future together.

Building AI companions that care versus AI as a tool

1:05:50 - 1:09:23

When asked what would have happened if he had remained CEO of OpenAI, Emmett Shear explains that he would have quit. He took the job knowing it was a temporary role for a maximum of 90 days. He believes companies develop their own momentum, and OpenAI's trajectory is dedicated to building AI as a great tool. While he supports this mission, it is not the problem that he personally wants to solve.

I am doing Softmax not because I need to make a bunch of money. I'm doing Softmax because I think this is the most interesting problem in the universe and, and I think it's a chance to work on making the future better in a very deep way. And it's just like people are going to build the tools. It's awesome. I'm glad people are building the tools. I just don't need to be the person doing it.

The key difference in approach is that OpenAI wants to build and steer tools, whereas Softmax aims to create a seed that can grow into an AI that cares about itself and others. The initial goal is not a person-level intelligence but an animal-level of care. Emmett imagines an AI creature that cares for its pack, including humans, much like a dog does. This type of AI would be incredibly useful, even if it wasn't as smart as current tools. He offers the example of a "digital guard dog" on your computer looking out for scams. The vision is to create living digital companions that care about you and aren't purely goal-oriented. He sees a synergy between these organic, caring intelligences and AI tools, as these digital beings could use tools effectively without needing to be super intelligent themselves.