Podcast Notes /// John Schulman on dead ends, scaling RL, and building research institutions

John Schulman, a key researcher behind ChatGPT, shares his perspective on the journey of building powerful AI systems.

He discusses navigating research dead ends, the future of reinforcement learning, and the principles of creating a successful institution like OpenAI.

Key takeaways

With full hindsight of the development recipe, a ChatGPT 3.5-level model could likely have been built as early as 2018 or 2019 by a small team.
Clever techniques, particularly in post-training and fine-tuning, can effectively multiply the power of your compute, allowing smaller models to achieve high performance.
Even projects that were considered dead ends, like robotics, proved valuable in the long run by building the company's capacity for large-scale engineering and training its staff for future successes.
Early AI labs like OpenAI operated in 'peacetime,' allowing for more exploratory work, whereas newer companies often start in 'catch-up mode,' feeling compelled to first replicate the state of the art before innovating.
There are two successful models for research managers: the hands-on technical lead who guides specific execution, and the hands-off mentor who provides career advice and lets experienced researchers explore.
Continual learning in AI will likely involve a layered approach. In-context learning excels at short-term adaptation, while updating model weights through fine-tuning is better for absorbing knowledge over a longer time horizon.
While AI models can have excellent short-term learning efficiency, they tend to get stuck on larger tasks. Humans, optimized by evolution for long time horizons, are better at self-correction and resourcefulness over extended periods.
Co-training generator and verifier models can create a virtuous cycle. As a model improves its ability to reason and verify, it provides a better learning signal to itself, leading to continuous self-improvement.
While LLMs can accelerate engineering, effective AI research still demands a deep, line-by-line understanding of the code, as the best work comes from knowing the 'nuts and bolts'.
The perceived rate of major breakthroughs in AI may be deceptive. The standards for experimental rigor have increased significantly, meaning progress today requires more thorough validation than in the past.
Internal research at large AI companies often has higher accuracy because it's tied to real-world consequences, whereas academic publishing can be less reliable despite more detailed write-ups.
The maturation of AI has shifted the required skillset from exploratory research taste towards strong software engineering, as much of the work now involves scaling ideas and building on existing infrastructure.
Current LLM development mirrors the 'Sim2Real' approach from robotics, where models are trained extensively in diverse, simulated environments to generalize to real-world applications.
A potential obstacle to necessary cooperation between AI labs is the 'bad blood' and personal conflicts between the key individuals involved.
Engineers and researchers consistently underestimate project timelines, often by a factor of two or three. Applying this logic to AGI suggests it may be further away than many predict.
A major uncertainty in AGI timelines is the positive feedback loop of AI accelerating its own development, which could defy typical project estimation and lead to shorter timelines.

ChatGPT could have been built years earlier with hindsight

00:20 - 03:31

If the team that started OpenAI could go back to 2015 with the knowledge they have now, they could have built ChatGPT much faster and with less compute. It is often easier to achieve something with more compute, but clever tricks can compensate for less. Knowing the returns would be so great would have also justified scaling up much faster.

With the entire recipe in mind from the start, a team could have assembled a large cluster, pre-trained a model, and then applied modern post-training techniques. Effective post-training can significantly increase a model's capabilities, essentially multiplying the value of the available compute. For example, even if a GPT-3 level model is needed for good few-shot performance, a much smaller model can become quite good through extensive and clever fine-tuning.

Assuming full hindsight, John Schulman estimates that a ChatGPT 3.5-level model could have been created back in 2018 or 2019. This would have required just a few talented people working for about a year. This scenario relies on building upon existing pre-training datasets and web scrapes. Looking forward, it's possible we might see an even more extreme version of this, like a "demo scene" ChatGPT that can be trained from scratch in a single day within one file.

False starts and foundational lessons at early OpenAI

03:31 - 09:24

In its early days around 2016-2017, OpenAI operated more like a ragtag academic group than the corporate giant it is today. The culture was a blend of small, independent research projects and larger, more ambitious undertakings. Many researchers worked in small teams of one to three people on projects driven by their own interests, which would often result in a paper or a blog post. At the same time, the company was influenced by DeepMind's success with large-scale projects like AlphaGo, and it aimed to tackle similar challenges by combining serious engineering with bigger groups of researchers.

Not all of these projects were successful. One notable false start was a project called Universe. The goal was to create a general reinforcement learning (RL) agent by training it on a vast and diverse dataset of environments, including video games and web navigation tasks. In hindsight, John Schulman believes the core concept was sound but attempted far too early.

The funny thing is, I think it was a deeply correct idea, but it was just way too early, like maybe even a decade too early. And there were a lot of prerequisites that were missing at the time.

The system built for Universe was unwieldy, and because the models were trained from scratch, they struggled to generalize. The project was ultimately unsuccessful. However, the experience led to more focused efforts, such as concentrating on emulated video games, which proved more fruitful. Other projects, like robotics, were also considered dead ends but provided long-term value by building the team's capacity for large engineering projects. These early ambitious projects, including the successful DOTA project, involved complex ML systems work, such as building infrastructure to programmatically control the game and developing large-scale parallel training systems.

AI research labs in peacetime versus catch-up mode

09:25 - 16:56

There are different successful approaches to managing research teams in ML. One model is the hands-on manager who writes and reads a lot of code, providing detailed technical feedback. This works well for goal-oriented projects or teams with less experienced members. Another successful model is the hands-off manager who acts more as a sounding board, offering career advice and keeping people motivated while letting experienced individuals explore their own ideas.

When building a research lab, inspiration often comes from more immediate sources than historical examples. John Schulman notes that early OpenAI was influenced more by the previous work experiences of its staff, like Google Brain or DeepMind, than by famous institutions like Bell Labs or Xerox PARC. While there were some discussions about projects like the Manhattan Project, there wasn't a deliberate effort to analyze and replicate past successful research models.

The environment of AI research has changed significantly over time. Early OpenAI operated in what could be described as "peacetime." The field didn't have a single, clear direction that everyone was competing on. This allowed for more exploratory work. In contrast, many AI companies starting today find themselves in "catch-up mode." The field is moving so fast that new companies feel pressure to first replicate the current state of the art before they can focus on new, exploratory ideas.

If you're just in catch up mode, it's harder to build up that exploratory research muscle later. Building the right culture is hard to do later.

Because of this, it is important to build an exploratory research culture from the beginning, even while playing catch-up. It's difficult to foster that kind of innovative muscle later if the initial culture is purely focused on replication.

The current decline of value functions in reinforcement learning

16:57 - 18:24

Value functions are not very popular in reinforcement learning (RL) right now because they don't seem to help much in the current settings where RL is being applied. These settings include RL from human feedback (RLHF) and tasks with verifiable rewards, even those with long time horizons like sampling tens of thousands of tokens.

John Schulman explains that the main purpose of value functions is to provide variance reduction. For some reason, on the current set of popular tasks, they aren't delivering much variance reduction compared to other tasks historically used in RL research. While it's hard to say precisely why this is the case, he expects that value functions will make a comeback at some point.

Solving continual learning with context, fine-tuning, and scaling

18:24 - 21:10

Continual learning in AI is not a single problem but involves several different kinds of learning, similar to how humans have procedural, motor, and episodic memory. John Schulman expects a layered approach to solve this. The foundation will be improvements in in-context learning and context management, making long context abilities increasingly important. On top of that, parameter fine-tuning methods like LoRA (Low-Rank Adaptation) will be crucial, especially for types of memory that require absorbing a large amount of knowledge.

It's debatable whether entirely new ideas are needed beyond just improving context windows and fine-tuning. One possibility is that simply scaling up current models will eventually solve these problems, as performance on various metrics continues to improve with scale. However, it is also likely that new ideas could solve these problems much faster. Such breakthroughs might offer a better scaling law, either by providing a fixed improvement in computational efficiency or by changing the fundamental relationship between scale and performance.

Ultimately, different methods will likely be optimal for different time horizons. John suggests that in-context learning will be very hard to beat for short-term adaptation. Over a longer period, however, permanent weight updates through fine-tuning will likely prove superior for integrating new knowledge.

Improving AI generalization with co-training and game theory

21:11 - 27:07

It is difficult to clearly compare how well AI models generalize versus humans. Models can have very good sample efficiency with in-context learning, sometimes on par with or better than humans. However, for certain kinds of training, models require significantly more data. A key difference appears over longer time scales. Humans have been optimized by evolution to operate over an 80-year time horizon, which equips them with strong self-correction mechanisms and resourcefulness in pursuing goals. In contrast, while models can be persistent, they tend to get stuck more easily when working on larger tasks.

One promising approach to improve models is co-training generators and verifiers. This setup can create a virtuous cycle for self-improvement. As a model gets better at reasoning and instruction following to verify an output, it can provide a better learning signal back to the generative part of itself.

As the model gets better at reasoning and following instructions, it also becomes a better verifier and you have somewhat of a virtuous cycle there.

John is also fond of ideas around multi-agent training and games. Games offer an automatic curriculum because as a player improves, their opponents—often copies of the model itself—also get better. There are also concepts from theoretical computer science where simple two-player, zero-sum games can be designed so that their equilibrium state solves a very difficult problem. The 'debate game' is one such compelling idea from AI alignment literature that is expected to become more important.

How John Schulman personally uses AI

27:07 - 28:55

John Schulman incorporates AI heavily into his daily work, particularly for coding and research. He uses tools like Cursor and Claude Code, and keeps chat windows with different models open to ask questions throughout the day. For research, he uses models for everything from simple queries to more complex literature searches, which he notes is much faster than traditional methods. This also applies to finding open-source libraries.

If I have an idea now, I'll just fire off a bunch of questions to GPT5Pro and have it do a bunch of literature searches for me. Or sometimes if I have a vague idea, I'll write a paragraph or two and just tell the model to flesh it out a bit more.

Beyond research, he uses AI as a tool for writing. While he does most of the core thinking himself, he uses chat models to get a first round of feedback on his work.

John Schulman's two-phase approach to research

28:55 - 30:30

John Schulman's research process involves distinct phases. For the idea formation stage, he often works from coffee shops. He finds the ambient buzz of activity helpful for thinking and generating ideas. It allows him to sit with a coffee and a notebook, jotting down thoughts and removing other distractions.

I like thinking at coffee shops where there's a buzz of activity around and I can just sit with my coffee and the notebook and just kind of jot down some ideas and remove distractions.

Once a project moves into the execution phase, his work changes. During this mode, he spends more time either coding himself or, more frequently now, advising others. His advising role involves reviewing the work of his colleagues, which includes reading their documentation and messages, and looking at their plots and code.

The evolving landscape of AI research and progress

30:30 - 36:22

Reflecting on a 2020 blog post about effective research, John Schulman believes the core advice still holds. This includes distinguishing between goal-directed and idea-driven research, keeping a detailed notebook, and building taste by reading many papers. The rise of LLMs has made some of these practices, like keeping a lab notebook, even more valuable. A well-maintained notebook provides crucial context that can be fed to an LLM for high-quality feedback.

The biggest change for researchers is learning how to incorporate LLMs into their work. However, the best approach for research may differ from other areas of software engineering. John cautions against using AI to write large amounts of code that the researcher doesn't fully understand. In research, deep knowledge of the code's inner workings is critical.

For research there is a lot of value in knowing about exactly what's going on in every line of code. And the people who have done the best work really have that understanding of the whole thing all the way to the nuts and bolts.

When asked if the rate of major breakthroughs has remained constant despite a surge in researchers, John expressed hesitation. He finds it difficult to quantify the rate of scientific progress, especially for the recent past, as it takes time to know which ideas are truly important. He suggests that standards have actually risen. In the past, a seminal paper might have complex ideas but only a single experiment on a toy task. Today, the field demands a much higher level of experimental rigor, including testing against multiple baselines and on various tasks.

While acknowledging the frustrations of the academic publishing system, John believes the field is grounded by its focus on solving real problems and achieving objective improvements. This grounding helps the field continue to make real progress overall, despite its flaws.

Comparing research culture in corporate AI labs and academia

36:23 - 39:19

When comparing internal research at large AI companies to the academic publishing system, there are trade-offs. Internal research often results in more accurate conclusions about what works, such as improvements in pre-training. The methodologies are strong because the experiments have real consequences, unlike research done just to get a paper published.

However, the level of detail in documentation is different. John Schulman notes that internal tech reports are rarely as detailed as external publications. While the accuracy of claims is higher internally, the thoroughness of experiments can be less, with fewer baselines tested. In contrast, academic work offers more detailed write-ups and can be very thorough, though sometimes the results are less trustworthy.

A lot of academic papers have baselines that are nerfed in some way that you can't really trust the results. But the best work actually does is actually quite thorough and does a lot of good baseline comparison.

John mentions a desire to improve the research writing culture within these companies. He wants to encourage more detailed tech reports that explore the science deeply, rather than just finding the minimum needed for a shippable product. This is challenging because company incentives are not always aligned with conducting thorough science and building up a strong theoretical foundation.

AI talent has shifted from 'weirder' risk-takers to conventional engineers

39:20 - 41:39

The type of people entering the AI field has changed significantly since the mid-2010s. John Schulman notes that the early entrants were often a bit "weirder" because AI was not yet an obvious career path. Now, it's conventional wisdom that AI is the most important thing happening, so it attracts people with more conventional career paths who are less tolerant of risk.

Despite this shift, the overall talent bar has gotten higher simply due to the sheer volume of people trying to enter the field. There has also been a change in the most valued skills. Strong software engineering ability is more important now than it was before, sometimes more so than pure research taste or the ability to do exploratory work.

I'd say probably engineering skill matters more now is more important now than it was before as opposed to like research taste and like the ability to do like exploratory research.

This is because many recent improvements have come from scaling simple ideas and executing on them well. The field has matured, so people are no longer writing code from scratch in a notebook. Instead, they are building on existing codebases and infrastructure, which gives an advantage to those with strong software engineering backgrounds who can integrate with other people's code and tools.

The future of RL research may look like Sim2Real from robotics

41:40 - 43:34

Ideas in Reinforcement Learning (RL) research often go in and out of fashion. Sometimes, they become popular too early and don't live up to their initial promise, only to come back later and prove effective. While the RL techniques that have worked well on Large Language Models (LLMs) have been fairly simple, more complex ideas may become relevant in the future.

One interesting set of ideas is offline RL. What is currently happening in the LLM world is similar to what robotics calls 'Sim2Real.' This approach involves building many simulated environments to train a model at scale. By randomizing these environments and ensuring enough diversity, the model can generalize to the real world. Sim2Real continues to yield good results in robotics and is a very effective technique.

However, there is also a lot of value in learning from the real world. It's expected that this principle will eventually come back to the LLM world, where models will learn from their actual deployment and real-world interactions.

The prospect of coordination between major AI labs

43:35 - 44:47

When considering if the biggest AI labs could coordinate effectively if they developed extremely powerful AI, the outlook is mixed. There is a reasonable amount of shared vision and viewpoints among the leading labs. They have also recently collaborated on safety-related topics. However, there is also some 'bad blood' between the personalities involved, which could make coordination more difficult. Ultimately, cooperation could work out if it became clear that it was the necessary path forward.

The challenge of predicting AGI timelines

44:47 - 47:53

When trying to predict when Artificial General Intelligence (AGI) will arrive, it's worth considering the track record of engineers and researchers on estimating project timelines. For smaller projects, they are often abysmal at estimation and systematically assume they will finish much earlier than they actually do. A common rule of thumb is to apply a 2x or 3x factor to their predictions to get a more realistic timeline.

John Schulman agrees with this observation, noting that there's a consistent bias to underestimate timelines. Applying this heuristic, it's reasonable to predict that AGI will be a little further out than many timelines suggest. A good analogy is self-driving cars, which have taken much longer than people expected to reach full autonomy.

However, on the other hand, there is this positive feedback loop where AI accelerates its own development that's also probably going to defy intuition. People who are incorporating that effect are coming up with pretty short timelines.

This positive feedback loop is a compelling counterargument. There's a lot of uncertainty about how much uplift will come from AI accelerating its own development and whether there will be bottlenecks around human understanding. Given these competing forces, it's difficult to make a confident prediction either way.

John Schulman introduces Tynker, a low-level API for ML training

47:54 - 51:19

John Schulman introduces Tynker, a new low-level fine-tuning API from Thinking Machines. It provides a small set of primitives for training and sampling, allowing users to express almost any post-training algorithm they might want. The service abstracts away the complexities of managing GPUs, accelerators, and distributed systems issues.

The concept is novel because most machine learning training services are much more high-level. John compares it to the sampling APIs from companies like OpenAI and Anthropic, where a user can make an API call from Python or JavaScript without needing to manage their own GPU infrastructure. Tynker aims to provide a similar level of convenience for writing training code.

Currently, Tynker is designed for sophisticated users with deep knowledge of machine learning who want to work with low-level primitives. However, the long-term vision is to make it more user-friendly by building more tooling and higher-level components. The goal is for Tynker to become a full-stack solution that allows non-experts to build custom models based on their business problems. John hopes that new companies will build on Tynker rather than developing their own infrastructure.

Looking ahead, Thinking Machines plans to release its own models next year and will continue to improve Tynker. Future enhancements include adding more models, supporting multimodal inputs and outputs, and scaling up the size of jobs the platform can handle.